From d0ecf51b0c9202475e2abe90a45b50df0de6d7ae Mon Sep 17 00:00:00 2001
From: Peter Park <peter.park@amd.com>
Date: Fri, 11 Oct 2024 15:47:23 -0400
Subject: [PATCH] add oversubscription conceptual doc (#3885)

add mitigiation steps

add to toc

move page for build

move doc

fix spelling

update doc

update oversubscription

update order

fix spelling

add oversubscription to wordlist

move oversubscription topic to bottom of toc and index
---
 .wordlist.txt                        |  2 ++
 docs/conceptual/oversubscription.rst | 34 ++++++++++++++++++++++++++++
 docs/index.md                        |  1 +
 docs/sphinx/_toc.yml.in              |  2 ++
 4 files changed, 39 insertions(+)
 create mode 100644 docs/conceptual/oversubscription.rst

diff --git a/.wordlist.txt b/.wordlist.txt
index 6621603dc..580890845 100644
--- a/.wordlist.txt
+++ b/.wordlist.txt
@@ -272,6 +272,7 @@ OpenMPI
 OpenSSL
 OpenVX
 OpenXLA
+Oversubscription
 PCC
 PCI
 PCIe
@@ -620,6 +621,7 @@ openmp
 openssl
 optimizers
 os
+oversubscription
 pageable
 parallelization
 parameterization
diff --git a/docs/conceptual/oversubscription.rst b/docs/conceptual/oversubscription.rst
new file mode 100644
index 000000000..83865876a
--- /dev/null
+++ b/docs/conceptual/oversubscription.rst
@@ -0,0 +1,34 @@
+.. meta::
+   :description: Learn what causes oversubscription.
+   :keywords: warning, log, gpu, performance penalty, help
+
+*******************************************************************
+Oversubscription of hardware resources in AMD Instinct accelerators
+*******************************************************************
+
+When an AMD Instinct™ MI series accelerator enters an oversubscribed state, the ``amdgpu`` driver outputs the following
+message.
+
+``amdgpu: Runlist is getting oversubscribed. Expect reduced ROCm performance.``
+
+Oversubscription occurs when application demands exceed the available hardware resources. In an oversubscribed
+state, the hardware scheduler tries to manage resource usage in a round-robin fashion. However,
+this can result in reduced performance, as resources might be occupied by applications or queues not actively
+submitting work. The granularity of hardware resources occupied by an inactive queue can be in the order of
+milliseconds, during which the accelerator or GPU is effectively blocked and unable to process work submitted by other
+queues.
+
+What triggers oversubscription?
+===============================
+
+The system enters an oversubscribed state when one of the following conditions is met:
+
+* **Hardware queue limit exceeded**: The number of user-mode compute queues requested by applications exceeds the
+  hardware limit of 24 queues for current Instinct accelerators.
+
+* **Virtual memory context slots exceeded**: The number of user processes exceeds the number of available virtual memory
+  context slots, which is 11 for current Instinct accelerators.
+
+* **Multiple processes using cooperative workgroups**: More than one process attempts to use the cooperative workgroup
+  feature, leading to resource contention.
+
diff --git a/docs/index.md b/docs/index.md
index e6d689239..8513180ab 100644
--- a/docs/index.md
+++ b/docs/index.md
@@ -64,6 +64,7 @@ ROCm documentation is organized into the following categories:
 * [Using CMake](./conceptual/cmake-packages.rst)
 * [ROCm & PCIe atomics](./conceptual/More-about-how-ROCm-uses-PCIe-Atomics.rst)
 * [Inception v3 with PyTorch](./conceptual/ai-pytorch-inception.md)
+* [Oversubscription of hardware resources](./conceptual/oversubscription.rst)
 :::
 
 <!-- markdownlint-disable MD051 -->
diff --git a/docs/sphinx/_toc.yml.in b/docs/sphinx/_toc.yml.in
index b9afa4a5f..9dd8af346 100644
--- a/docs/sphinx/_toc.yml.in
+++ b/docs/sphinx/_toc.yml.in
@@ -158,6 +158,8 @@ subtrees:
     title: ROCm & PCIe atomics
   - file: conceptual/ai-pytorch-inception.md
     title: Inception v3 with PyTorch
+  - file: conceptual/oversubscription.rst
+    title: Oversubscription of hardware resources
 
 - caption: Reference
   entries: