add oversubscription conceptual doc (#3885)

add mitigiation steps add to toc move page for build move doc fix spelling update doc update oversubscription update order fix spelling add oversubscription to wordlist move oversubscription topic to bottom of toc and index
2026-01-10 07:08:08 -05:00 · 2024-10-11 15:47:23 -04:00
parent 5656ea9285
commit d0ecf51b0c
4 changed files with 39 additions and 0 deletions
--- a/.wordlist.txt
+++ b/.wordlist.txt
@@ -272,6 +272,7 @@ OpenMPI
 OpenSSL
 OpenVX
 OpenXLA
+Oversubscription
 PCC
 PCI
 PCIe
@@ -620,6 +621,7 @@ openmp
 openssl
 optimizers
 os
+oversubscription
 pageable
 parallelization
 parameterization
--- a/docs/conceptual/oversubscription.rst
+++ b/docs/conceptual/oversubscription.rst
@@ -0,0 +1,34 @@
+.. meta::
+   :description: Learn what causes oversubscription.
+   :keywords: warning, log, gpu, performance penalty, help
+
+*******************************************************************
+Oversubscription of hardware resources in AMD Instinct accelerators
+*******************************************************************
+
+When an AMD Instinct™ MI series accelerator enters an oversubscribed state, the ``amdgpu`` driver outputs the following
+message.
+
+``amdgpu: Runlist is getting oversubscribed. Expect reduced ROCm performance.``
+
+Oversubscription occurs when application demands exceed the available hardware resources. In an oversubscribed
+state, the hardware scheduler tries to manage resource usage in a round-robin fashion. However,
+this can result in reduced performance, as resources might be occupied by applications or queues not actively
+submitting work. The granularity of hardware resources occupied by an inactive queue can be in the order of
+milliseconds, during which the accelerator or GPU is effectively blocked and unable to process work submitted by other
+queues.
+
+What triggers oversubscription?
+===============================
+
+The system enters an oversubscribed state when one of the following conditions is met:
+
+* **Hardware queue limit exceeded**: The number of user-mode compute queues requested by applications exceeds the
+  hardware limit of 24 queues for current Instinct accelerators.
+
+* **Virtual memory context slots exceeded**: The number of user processes exceeds the number of available virtual memory
+  context slots, which is 11 for current Instinct accelerators.
+
+* **Multiple processes using cooperative workgroups**: More than one process attempts to use the cooperative workgroup
+  feature, leading to resource contention.
+
--- a/docs/index.md
+++ b/docs/index.md
@@ -64,6 +64,7 @@ ROCm documentation is organized into the following categories:
 * [Using CMake](./conceptual/cmake-packages.rst)
 * [ROCm & PCIe atomics](./conceptual/More-about-how-ROCm-uses-PCIe-Atomics.rst)
 * [Inception v3 with PyTorch](./conceptual/ai-pytorch-inception.md)
+* [Oversubscription of hardware resources](./conceptual/oversubscription.rst)
 :::

 <!-- markdownlint-disable MD051 -->
--- a/docs/sphinx/_toc.yml.in
+++ b/docs/sphinx/_toc.yml.in
@@ -158,6 +158,8 @@ subtrees:
    title: ROCm & PCIe atomics
  - file: conceptual/ai-pytorch-inception.md
    title: Inception v3 with PyTorch
+  - file: conceptual/oversubscription.rst
+    title: Oversubscription of hardware resources

 - caption: Reference
  entries: