From d0ecf51b0c9202475e2abe90a45b50df0de6d7ae Mon Sep 17 00:00:00 2001 From: Peter Park Date: Fri, 11 Oct 2024 15:47:23 -0400 Subject: [PATCH] add oversubscription conceptual doc (#3885) add mitigiation steps add to toc move page for build move doc fix spelling update doc update oversubscription update order fix spelling add oversubscription to wordlist move oversubscription topic to bottom of toc and index --- .wordlist.txt | 2 ++ docs/conceptual/oversubscription.rst | 34 ++++++++++++++++++++++++++++ docs/index.md | 1 + docs/sphinx/_toc.yml.in | 2 ++ 4 files changed, 39 insertions(+) create mode 100644 docs/conceptual/oversubscription.rst diff --git a/.wordlist.txt b/.wordlist.txt index 6621603dc..580890845 100644 --- a/.wordlist.txt +++ b/.wordlist.txt @@ -272,6 +272,7 @@ OpenMPI OpenSSL OpenVX OpenXLA +Oversubscription PCC PCI PCIe @@ -620,6 +621,7 @@ openmp openssl optimizers os +oversubscription pageable parallelization parameterization diff --git a/docs/conceptual/oversubscription.rst b/docs/conceptual/oversubscription.rst new file mode 100644 index 000000000..83865876a --- /dev/null +++ b/docs/conceptual/oversubscription.rst @@ -0,0 +1,34 @@ +.. meta:: + :description: Learn what causes oversubscription. + :keywords: warning, log, gpu, performance penalty, help + +******************************************************************* +Oversubscription of hardware resources in AMD Instinct accelerators +******************************************************************* + +When an AMD Instinctâ„¢ MI series accelerator enters an oversubscribed state, the ``amdgpu`` driver outputs the following +message. + +``amdgpu: Runlist is getting oversubscribed. Expect reduced ROCm performance.`` + +Oversubscription occurs when application demands exceed the available hardware resources. In an oversubscribed +state, the hardware scheduler tries to manage resource usage in a round-robin fashion. However, +this can result in reduced performance, as resources might be occupied by applications or queues not actively +submitting work. The granularity of hardware resources occupied by an inactive queue can be in the order of +milliseconds, during which the accelerator or GPU is effectively blocked and unable to process work submitted by other +queues. + +What triggers oversubscription? +=============================== + +The system enters an oversubscribed state when one of the following conditions is met: + +* **Hardware queue limit exceeded**: The number of user-mode compute queues requested by applications exceeds the + hardware limit of 24 queues for current Instinct accelerators. + +* **Virtual memory context slots exceeded**: The number of user processes exceeds the number of available virtual memory + context slots, which is 11 for current Instinct accelerators. + +* **Multiple processes using cooperative workgroups**: More than one process attempts to use the cooperative workgroup + feature, leading to resource contention. + diff --git a/docs/index.md b/docs/index.md index e6d689239..8513180ab 100644 --- a/docs/index.md +++ b/docs/index.md @@ -64,6 +64,7 @@ ROCm documentation is organized into the following categories: * [Using CMake](./conceptual/cmake-packages.rst) * [ROCm & PCIe atomics](./conceptual/More-about-how-ROCm-uses-PCIe-Atomics.rst) * [Inception v3 with PyTorch](./conceptual/ai-pytorch-inception.md) +* [Oversubscription of hardware resources](./conceptual/oversubscription.rst) ::: diff --git a/docs/sphinx/_toc.yml.in b/docs/sphinx/_toc.yml.in index b9afa4a5f..9dd8af346 100644 --- a/docs/sphinx/_toc.yml.in +++ b/docs/sphinx/_toc.yml.in @@ -158,6 +158,8 @@ subtrees: title: ROCm & PCIe atomics - file: conceptual/ai-pytorch-inception.md title: Inception v3 with PyTorch + - file: conceptual/oversubscription.rst + title: Oversubscription of hardware resources - caption: Reference entries: