mirror of
https://github.com/ROCm/ROCm.git
synced 2026-01-09 14:48:06 -05:00
add oversubscription conceptual doc (#3885)
add mitigiation steps
add to toc
move page for build
move doc
fix spelling
update doc
update oversubscription
update order
fix spelling
add oversubscription to wordlist
move oversubscription topic to bottom of toc and index
(cherry picked from commit d0ecf51b0c)
This commit is contained in:
@@ -272,6 +272,7 @@ OpenMPI
|
||||
OpenSSL
|
||||
OpenVX
|
||||
OpenXLA
|
||||
Oversubscription
|
||||
PCC
|
||||
PCI
|
||||
PCIe
|
||||
@@ -620,6 +621,7 @@ openmp
|
||||
openssl
|
||||
optimizers
|
||||
os
|
||||
oversubscription
|
||||
pageable
|
||||
parallelization
|
||||
parameterization
|
||||
|
||||
34
docs/conceptual/oversubscription.rst
Normal file
34
docs/conceptual/oversubscription.rst
Normal file
@@ -0,0 +1,34 @@
|
||||
.. meta::
|
||||
:description: Learn what causes oversubscription.
|
||||
:keywords: warning, log, gpu, performance penalty, help
|
||||
|
||||
*******************************************************************
|
||||
Oversubscription of hardware resources in AMD Instinct accelerators
|
||||
*******************************************************************
|
||||
|
||||
When an AMD Instinct™ MI series accelerator enters an oversubscribed state, the ``amdgpu`` driver outputs the following
|
||||
message.
|
||||
|
||||
``amdgpu: Runlist is getting oversubscribed. Expect reduced ROCm performance.``
|
||||
|
||||
Oversubscription occurs when application demands exceed the available hardware resources. In an oversubscribed
|
||||
state, the hardware scheduler tries to manage resource usage in a round-robin fashion. However,
|
||||
this can result in reduced performance, as resources might be occupied by applications or queues not actively
|
||||
submitting work. The granularity of hardware resources occupied by an inactive queue can be in the order of
|
||||
milliseconds, during which the accelerator or GPU is effectively blocked and unable to process work submitted by other
|
||||
queues.
|
||||
|
||||
What triggers oversubscription?
|
||||
===============================
|
||||
|
||||
The system enters an oversubscribed state when one of the following conditions is met:
|
||||
|
||||
* **Hardware queue limit exceeded**: The number of user-mode compute queues requested by applications exceeds the
|
||||
hardware limit of 24 queues for current Instinct accelerators.
|
||||
|
||||
* **Virtual memory context slots exceeded**: The number of user processes exceeds the number of available virtual memory
|
||||
context slots, which is 11 for current Instinct accelerators.
|
||||
|
||||
* **Multiple processes using cooperative workgroups**: More than one process attempts to use the cooperative workgroup
|
||||
feature, leading to resource contention.
|
||||
|
||||
@@ -64,7 +64,7 @@ ROCm documentation is organized into the following categories:
|
||||
* [Using CMake](./conceptual/cmake-packages.rst)
|
||||
* [ROCm & PCIe atomics](./conceptual/More-about-how-ROCm-uses-PCIe-Atomics.rst)
|
||||
* [Inception v3 with PyTorch](./conceptual/ai-pytorch-inception.md)
|
||||
* [Inference optimization with MIGraphX](./conceptual/ai-migraphx-optimization.md)
|
||||
* [Oversubscription of hardware resources](./conceptual/oversubscription.rst)
|
||||
:::
|
||||
|
||||
<!-- markdownlint-disable MD051 -->
|
||||
|
||||
@@ -158,8 +158,8 @@ subtrees:
|
||||
title: ROCm & PCIe atomics
|
||||
- file: conceptual/ai-pytorch-inception.md
|
||||
title: Inception v3 with PyTorch
|
||||
- file: conceptual/ai-migraphx-optimization.md
|
||||
title: Inference optimization with MIGraphX
|
||||
- file: conceptual/oversubscription.rst
|
||||
title: Oversubscription of hardware resources
|
||||
|
||||
- caption: Reference
|
||||
entries:
|
||||
|
||||
Reference in New Issue
Block a user