From 7f3846577099fb4f4613eff895c8e767133e0cd0 Mon Sep 17 00:00:00 2001 From: Bence Parajdi Date: Wed, 24 Apr 2024 11:05:01 +0200 Subject: [PATCH] add cu setting page --- docs/conceptual/setting-cus.rst | 48 +++++++++++++++++++++++++++++++++ docs/sphinx/_toc.yml.in | 2 ++ 2 files changed, 50 insertions(+) create mode 100644 docs/conceptual/setting-cus.rst diff --git a/docs/conceptual/setting-cus.rst b/docs/conceptual/setting-cus.rst new file mode 100644 index 000000000..84607b1b8 --- /dev/null +++ b/docs/conceptual/setting-cus.rst @@ -0,0 +1,48 @@ +.. meta:: + :description: Setting the number of CUs + :keywords: AMD, ROCm, cu, number of cus + +.. _env-variables-reference: + +************************************************************* +Setting the number of CUs +************************************************************* + +When using GPUs to accelerate compute workloads, it becames necessary sometimes +to configure the usage of Compute Units (CU) of the hardware. This is a more advanced +option, so please read this explainer before experimentation. + +The GPU driver provides two environment variables to set the number of CUs used. The +first one is ``HSA_CU_MASK`` and the second one is ``ROC_GLOBAL_CU_MASK``. The main +difference is, is that ``ROC_GLOBAL_CU_MASK`` sets the CU mask on queues created by +the HIP or the OpenCL runtimes. While ``HSA_CU_MASK`` sets the mask on a lower level of +queue creation in the driver, which means that this mask will also be set for queues +being profiled. + +The environment variables have the following syntax: + +:: + + ID = [0-9][0-9]* ex. base 10 numbers + ID_list = (ID | ID-ID)[, (ID | ID-ID)]* ex. 0,2-4,7 + GPU_list = ID_list ex. 0,2-4,7 + CU_list = 0x[0-F]* | ID_list ex. 0x337F OR 0,2-4,7 + CU_Set = GPU_list : CU_list ex. 0,2-4,7:0-15,32-47 OR 0,2-4,7:0x337F + HSA_CU_MASK = CU_Set [; CU_Set]* ex. 0,2-4,7:0-15,32-47; 3-9:0x337F + +The GPU indices are taken post ``ROCR_VISIBLE_DEVICES`` reordering. For GPUs listed +the listed or masked CUs will be enabled, the rest disabled. Unlisted GPUs will not +be affected, their CUs will all be enabled. + +The parsing of the variable is stopped when a syntax error occurs. The erroneus set +and the ones following will be ignored. Repeating GPU or CU ids are a syntax error. +Specifying a mask with no usable CUs (CU_list is 0x0) is a syntax error, for excluding +GPU devices use ``ROCR_VISIBLE_DEVICES``. + +These environment variables only affect ROCm software, not graphics applications. + +It's important to know, that not all CU configurations are valid on all devices. For +instance, on devices where two CUs can be combined together into a WGP (for kernels +running in WGP mode), it is not legal to disable only a single CU in a WGP. `This paper +`_ can provide more information +about what to expect, when disabling CUs. diff --git a/docs/sphinx/_toc.yml.in b/docs/sphinx/_toc.yml.in index ebb46ad4e..69ccd9336 100644 --- a/docs/sphinx/_toc.yml.in +++ b/docs/sphinx/_toc.yml.in @@ -110,6 +110,8 @@ subtrees: title: White paper - file: conceptual/gpu-memory.md title: GPU memory + - file: conceptual/setting-cus + title: Configuring CUs - file: conceptual/file-reorg.md title: File structure (Linux FHS) - file: conceptual/gpu-isolation.md