MI300A system optimization guide internal draft (#117)

* MI300A system optimization guide internal draft

* Small changes to System BIOS paragraph

* Some minor edits

* Changes after external review feedback

* Add CPU Affinity debug setting

* Edit CPU Affinity debug setting

* Changes from external discussion

* Add glossary and other small fixes

* Additional changes from the review

* Update the IOMMU guidance

* Change description of CPU affinity setting

* Slight rewording

* Change Debian to Red Hat-based

* A few changes from the second internal review
This commit is contained in:
Jeffrey Novotny
2024-07-31 13:29:49 -04:00
committed by GitHub
parent 222865c8be
commit bdcb82372b
9 changed files with 400 additions and 0 deletions

Binary file not shown.

After

Width:  |  Height:  |  Size: 28 KiB

Binary file not shown.

After

Width:  |  Height:  |  Size: 54 KiB

Binary file not shown.

After

Width:  |  Height:  |  Size: 28 KiB

Binary file not shown.

After

Width:  |  Height:  |  Size: 103 KiB

Binary file not shown.

After

Width:  |  Height:  |  Size: 113 KiB

View File

@@ -65,6 +65,12 @@ their own performance testing for additional tuning.
- `CDNA 3 architecture <https://www.amd.com/content/dam/amd/en/documents/instinct-tech-docs/white-papers/amd-cdna-3-white-paper.pdf>`_
* - :doc:`AMD Instinct MI300A <mi300a>`
- `AMD Instinct MI300 instruction set architecture <https://www.amd.com/content/dam/amd/en/documents/instinct-tech-docs/instruction-set-architectures/amd-instinct-mi300-cdna3-instruction-set-architecture.pdf>`_
- `CDNA 3 architecture <https://www.amd.com/content/dam/amd/en/documents/instinct-tech-docs/white-papers/amd-cdna-3-white-paper.pdf>`_
* - :doc:`AMD Instinct MI200 <mi200>`
- `AMD Instinct MI200 instruction set architecture <https://www.amd.com/system/files/TechDocs/instinct-mi200-cdna2-instruction-set-architecture.pdf>`_

View File

@@ -0,0 +1,391 @@
.. meta::
:description: AMD Instinct MI300A system settings
:keywords: AMD, Instinct, MI300A, HPC, tuning, BIOS settings, NBIO, ROCm,
environment variable, performance, accelerator, GPU, EPYC, GRUB,
operating system
***************************************************
AMD Instinct MI300A system optimization
***************************************************
This topic discusses the operating system settings and system management commands for
the AMD Instinct MI300A accelerator. This topic can help you optimize performance.
System settings
========================================
This section reviews the system settings required to configure a MI300A SOC system and
optimize its performance.
The MI300A system-on-a-chip (SOC) design requires you to review and potentially adjust your OS configuration as explained in
the :ref:`operating-system-settings-label` section. These settings are critical for
performance because the OS on an accelerated processing unit (APU) is responsible for memory management across the CPU and GPU accelerators.
In the APU memory model, system settings are available to limit GPU memory allocation.
This limit is important because legacy software often determines the
amount of allowable memory at start-up time
by probing discrete memory until it is exhausted. If left unchecked, this practice
can starve the OS of resources.
System BIOS settings
-----------------------------------
System BIOS settings are preconfigured for optimal performance from the
platform vendor. This means that you do not need to adjust these settings
when using MI300A. If you have any questions regarding these settings,
contact your MI300A platform vendor.
GRUB settings
-----------------------------------
The ``/etc/default/grub`` file is used to configure the GRUB bootloader on modern Linux distributions.
Linux uses the string assigned to ``GRUB_CMDLINE_LINUX`` in this file as
its command line parameters during boot.
Appending strings using the Linux command line
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
It is recommended that you append the following string to ``GRUB_CMDLINE_LINUX``.
``pci=realloc=off``
This setting disables the automatic reallocation
of PCI resources, so Linux is able to unambiguously detect all GPUs on the
MI300A-based system. It's used when Single Root I/O Virtualization (SR-IOV) Base
Address Registers (BARs) have not been allocated by the BIOS. This can help
avoid potential issues with certain hardware configurations.
Validating the IOMMU setting
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
IOMMU is a system-specific IO mapping mechanism for DMA mapping
and isolation. IOMMU is turned off by default in the operating system settings
for optimal performance.
To verify IOMMU is turned off, first install the ``acpica-tools`` package using your
package manager.
.. code-block:: shell
sudo apt install acpica-tools
Then confirm that the following commands do not return any results.
.. code-block:: shell
sudo acpidump | grep IVRS
sudo acpidump | grep DMAR
Update GRUB
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
Use this command to update GRUB to use the modified configuration:
.. code-block:: shell
sudo grub2-mkconfig -o /boot/grub2/grub.cfg
On some Red Hat-based systems, the ``grub2-mkconfig`` command might not be available. In this case,
use ``grub-mkconfig`` instead. Verify that you have the
correct version by using the following command:
.. code-block:: shell
grub-mkconfig -version
.. _operating-system-settings-label:
Operating system settings
-----------------------------------
The operating system provides several options to customize and tune performance. For more information
about supported operating systems, see the :doc:`Compatibility matrix <../../compatibility/compatibility-matrix>`.
If you are using a distribution other than RHEL or SLES, the latest Linux kernel is recommended.
Performance considerations for the Zen4, which is the core architecture in the MI300A,
require a Linux kernel running version 5.18 or higher.
This section describes performance-based settings.
* **Enable transparent huge pages**
To enable transparent huge pages, use one of the following methods:
* From the command line, run the following command:
.. code-block:: shell
echo always > /sys/kernel/mm/transparent_hugepage/enabled
* Set the Linux kernel parameter ``transparent_hugepage`` as follows in the
relevant ``.cfg`` file for your system.
.. code-block:: cfg
transparent_hugepage=always
* **Limit the maximum and single memory allocations on the GPU**
Many AI-related applications were originally developed on discrete GPUs. Some of these applications
have fixed problem sizes associated with the targeted GPU size, and some attempt to determine the
system memory limits by allocating chunks until failure. These techniques can cause issues in an
APU with a shared space.
To allow these applications to run on the APU without further changes,
ROCm supports a default memory policy that restricts the percentage of the GPU that can be allocated.
The following environment variables control this feature:
* ``GPU_MAX_ALLOC_PERCENT``
* ``GPU_SINGLE_ALLOC_PERCENT``
These settings can be added to the default shell environment or the user environment. The effect of the memory allocation
settings varies depending on the system, configuration, and task. They might require adjustment, especially when performing GPU benchmarks. Setting these values to ``100``
lets the GPU allocate any amount of free memory. However, the risk of encountering
an operating system out-of-memory (OMM) condition increases when almost
all the available memory is used.
Before setting either of these items to 100 percent,
carefully consider the expected CPU workload allocation and the anticipated OS usage.
For instance, if the OS requires 8GB on a 128GB system, setting these
variables to ``100`` authorizes a single
workload to allocate up to 120GB of memory. Unless the system has swap space configured
any over-allocation attempts will be handled by the OMM policies.
* **Disable NUMA (Non-uniform memory access) balancing**
ROCm uses information from the compiled application to ensure an affinity exists
between the GPU agent processes and their CPU hosts or co-processing agents.
Because the APU has OS threads,
including threads with memory management, the default kernel NUMA policies can
adversely impact workload performance without additional tuning.
.. note::
At the kernel level, ``pci_relloc`` can also be set to ``off`` as an additional tuning measure.
To disable NUMA balancing, use one of the following methods:
* From the command line, run the following command:
.. code-block:: shell
echo 0 > /proc/sys/kernel/numa_balancing
* Set the following Linux kernel parameters in the
relevant ``.cfg`` file for your system.
.. code-block:: cfg
pci=realloc=off numa_balancing=disable
* **Enable compaction**
Compaction is necessary for proper MI300A operation because the APU dynamically shares memory
between the CPU and GPU. Compaction can be done proactively, which reduces
allocation costs, or performed during allocation, in which case it is part of the background activities.
Without compaction, the MI300A application performance eventually degrades as fragmentation increases.
In RHEL distributions, compaction is disabled by default. In Ubuntu, it's enabled by default.
To enable compaction, enter the following commands using the command line:
.. code-block:: shell
echo 20 > /proc/sys/vm/compaction_proactiveness
echo 1 > /proc/sys/vm/compact_unevictable_allowed
* **Change affinity of ROCm helper threads**
This change prevents internal ROCm threads from having their CPU core affinity mask
set to all CPU cores available. With this setting, the threads inherit their parent's
CPU core affinity mask. If you have any questions regarding this setting,
contact your MI300A platform vendor. To enable this setting, enter the following command:
.. code-block:: shell
export HSA_OVERRIDE_CPU_AFFINITY_DEBUG=0
* **CPU core states and C-states**
The system BIOS handles these settings for the MI300A.
They don't need to be configured on the operating system.
System management
========================================
For a complete guide on installing, managing, and uninstalling ROCm on Linux, see
:doc:`Quick-start (Linux)<rocm-install-on-linux:tutorial/quick-start>`. To verify that the
installation was successful, see the
:doc:`Post-installation instructions<rocm-install-on-linux:how-to/native-install/post-install>` and
:doc:`ROCm tools <../../reference/rocm-tools>` guides. If verification
fails, consult the :doc:`System debugging guide <../system-debugging>`.
.. _hw-verification-rocm-label:
Hardware verification with ROCm
-----------------------------------
ROCm includes tools to query the system structure. To query
the GPU hardware, use the ``rocm-smi`` command.
``rocm-smi`` reports statistics per socket, so the power results combine CPU and GPU utilization.
In an idle state on a multi-socket system, some power imbalances are expected because
the distribution of OS threads can keep some APU devices at higher power states.
.. note::
The MI300A VRAM settings show as ``N/A``.
.. image:: ../../data/how-to/tuning-guides/mi300a-rocm-smi-output.png
:alt: Output from the rocm-smi command
The ``rocm-smi --showhw`` command shows the available system
GPUs and their device ID and firmware details.
In the MI300A hardware settings, the system BIOS handles the UMC RAS. The
ROCm-supplied GPU driver does not manage this setting.
This results in a value of ``DISABLED`` for the ``UMC RAS`` setting.
.. image:: ../../data/how-to/tuning-guides/mi300a-rocm-smi-showhw-output.png
:alt: Output from the ``rocm-smi showhw`` command
To see the system structure, the localization of the GPUs in the system, and the
fabric connections between the system components, use the ``rocm-smi --showtopo`` command.
* The first block of the output shows the distance between the GPUs. The weight is a qualitative
measure of the “distance” data must travel to reach one GPU from another.
While the values do not have a precise physical meaning, the higher the value the
more hops are required to reach the destination from the source GPU.
* The second block contains a matrix named “Hops between two GPUs”, where ``1`` means
the two GPUs are directly connected with XGMI, ``2`` means both GPUs are linked to the
same CPU socket and GPU communications go through the CPU, and ``3`` means
both GPUs are linked to different CPU sockets so communications go
through both CPU sockets.
* The third block indicates the link types between the GPUs. This can either be
``XGMI`` for AMD Infinity Fabric links or ``PCIE`` for PCIe Gen4 links.
* The fourth block reveals the localization of a GPU with respect to the NUMA organization
of the shared memory of the AMD EPYC processors.
.. image:: ../../data/how-to/tuning-guides/mi300a-rocm-smi-showtopo-output.png
:alt: Output from the ``rocm-smi showtopo`` command
Testing inter-device bandwidth
-----------------------------------
The ``rocm-smi --showtopo`` command from the :ref:`hw-verification-rocm-label` section
displays the system structure and shows how the GPUs are located and connected within this
structure. For more information, use the :doc:`ROCm Bandwidth Test <rocm_bandwidth_test:index>`, which can run benchmarks to
show the effective link bandwidth between the system components.
For information on how to install the ROCm Bandwidth Test, see :doc:`Building the environment <rocm_bandwidth_test:install/install>`.
The output lists the available compute devices (CPUs and GPUs), including
their device ID and PCIe ID:
.. image:: ../../data/how-to/tuning-guides/mi300a-rocm-bandwidth-test-output.png
:alt: Output from the rocm-bandwidth-test utility
It also displays the measured bandwidth for unidirectional and
bidirectional transfers between the devices on the CPU and GPU:
.. image:: ../../data/how-to/tuning-guides/mi300a-rocm-peak-bandwidth-output.png
:alt: Bandwidth information from the rocm-bandwidth-test utility
Abbreviations
=============
APBDIS
Algorithmic Performance Boost Disable
APU
Accelerated processing unit
BAR
Base Address Register
BIOS
Basic Input/Output System
CBS
Common BIOS Settings
CCD
Compute Core Die
CDNA
Compute DNA
CLI
Command Line Interface
CPU
Central Processing Unit
cTDP
Configurable Thermal Design Power
DF
Data Fabric
DMA
Direct Memory Access
GPU
Graphics Processing Unit
GRUB
Grand Unified Bootloader
HBM
High Bandwidth Memory
HPC
High Performance Computing
IOMMU
Input-Output Memory Management Unit
ISA
Instruction Set Architecture
NBIO
North Bridge Input/Output
NUMA
Non-Uniform Memory Access
OMM
Out of Memory
PCI
Peripheral Component Interconnect
PCIe
PCI Express
POR
Power-On Reset
RAS
Reliability, availability and serviceability
SMI
System Management Interface
SMT
Simultaneous Multi-threading
SOC
System On Chip
SR-IOV
Single Root I/O Virtualization
TSME
Transparent Secure Memory Encryption
UMC
Unified Memory Controller
VRAM
Video RAM
xGMI
Inter-chip Global Memory Interconnect

View File

@@ -53,6 +53,7 @@ ROCm documentation is organized into the following categories:
* [Fine-tuning LLMs and inference optimization](./how-to/llm-fine-tuning-optimization/index.rst)
* [System optimization](./how-to/system-optimization/index.rst)
* [AMD Instinct MI300X](./how-to/system-optimization/mi300x.rst)
* [AMD Instinct MI300A](./how-to/system-optimization/mi300a.rst)
* [AMD Instinct MI200](./how-to/system-optimization/mi200.md)
* [AMD Instinct MI100](./how-to/system-optimization/mi100.md)
* [AMD Instinct RDNA2](./how-to/system-optimization/w6000-v620.md)

View File

@@ -67,6 +67,8 @@ subtrees:
- entries:
- file: how-to/system-optimization/mi300x.rst
title: AMD Instinct MI300X
- file: how-to/system-optimization/mi300a.rst
title: AMD Instinct MI300A
- file: how-to/system-optimization/mi200.md
title: AMD Instinct MI200
- file: how-to/system-optimization/mi100.md