From 55b5b66901d7d36b422795214159f7a92a4c7312 Mon Sep 17 00:00:00 2001
From: =?UTF-8?q?M=C3=A9sz=C3=A1ros=20Gergely?= <gergely@streamhpc.com>
Date: Thu, 4 May 2023 19:44:09 +0200
Subject: [PATCH] Add GPU isolation (#2114)

* Add GPU isolation guide

* Add hover text expansion of DKMS in linux quick start guide
---
 docs/.sphinx/_toc.yml.in         |   1 +
 docs/deploy/docker.md            |   7 +-
 docs/deploy/linux/quick_start.md |   4 +-
 docs/index.md                    |   2 +-
 docs/reference/gpu_isolation.md  | 108 +++++++++++++++++++++++++++++++
 5 files changed, 119 insertions(+), 3 deletions(-)
 create mode 100644 docs/reference/gpu_isolation.md

diff --git a/docs/.sphinx/_toc.yml.in b/docs/.sphinx/_toc.yml.in
index 99a06ade8..0be8809c8 100644
--- a/docs/.sphinx/_toc.yml.in
+++ b/docs/.sphinx/_toc.yml.in
@@ -171,6 +171,7 @@ subtrees:
               title: MI250
             - file: reference/gpu_arch/mi100
               title: MI100
+    - file: reference/gpu_isolation
 - caption: Understand ROCm
   entries:
     - title: Compiler Disambiguation
diff --git a/docs/deploy/docker.md b/docs/deploy/docker.md
index 6b84b64fc..86a6d9b11 100644
--- a/docs/deploy/docker.md
+++ b/docs/deploy/docker.md
@@ -8,6 +8,8 @@ ROCm kernel-mode driver must be installed on the host. Please refer to
 (like the HIP-runtime or math libraries) of the ROCm stack will be loaded from
 the container image and don't need to be installed to the host.
 
+(docker-access-gpus-in-container)=
+
 ## Accessing GPUs in containers
 
 In order to access GPUs in a container (to run applications using HIP, OpenCL or
@@ -38,6 +40,8 @@ docker run --device /dev/kfd --device /dev/dri
 Note that this gives more access than strictly required, as it also exposes the
 other device files found in that folder to the container.
 
+(docker-restrict-gpus)=
+
 ### Restricting a container to a subset of the GPUs
 
 If a `/dev/dri/renderD` device is not exposed to a container then it cannot use
@@ -82,4 +86,5 @@ applications, but does not include any libraries.
 
 AMD provides pre-built images for various GPU-ready applications through its
 Infinity Hub at <https://www.amd.com/en/technologies/infinity-hub>.
-There are also examples of invocating each application and suggested parameters used for benchmarking.
+Examples for invoking each application and suggested parameters used for
+benchmarking are also provided there.
diff --git a/docs/deploy/linux/quick_start.md b/docs/deploy/linux/quick_start.md
index 4fcbb81e7..5262baa7a 100644
--- a/docs/deploy/linux/quick_start.md
+++ b/docs/deploy/linux/quick_start.md
@@ -3,13 +3,15 @@
 ## Install Prerequisites
 
 The driver package uses
-[`DKMS`](https://en.wikipedia.org/wiki/Dynamic_Kernel_Module_Support) to build
+[{abbr}`DKMS (Dynamic Kernel Module Support)`][DKMS-wiki] to build
 the `amdgpu-dkms` module (driver) for the installed kernels. This requires the Linux
 kernel headers and modules to be installed for each. Usually these are
 automatically installed with the kernel, but if you have multiple kernel
 versions or you have downloaded the kernel images and not the kernel
 meta-packages then they must be manually installed.
 
+[DKMS-wiki]: https://en.wikipedia.org/wiki/Dynamic_Kernel_Module_Support
+
 To install for the currently active kernel run the command corresponding
 to your distribution.
 ::::{tab-set}
diff --git a/docs/index.md b/docs/index.md
index f2e37c479..6a4fd5216 100644
--- a/docs/index.md
+++ b/docs/index.md
@@ -56,6 +56,7 @@ agile, flexible, rapid and secure manner. [more...](rocm)
 - [Management Tools](reference/management_tools)
 - [Validation Tools](reference/validation_tools)
 - [GPU Architecture](reference/gpu_arch)
+- [GPU Isolation Techniques](reference/gpu_isolation)
 
 :::
 
@@ -76,7 +77,6 @@ Understand ROCm
 How to Guides
 ^^^
 
-- [How to Isolate GPUs in Docker?](how_to/docker_gpu_isolation)
 - [Setting up for Deep Learning with ROCm](how_to/deep_learning_rocm)
   - [Magma Installation](how_to/magma_install/magma_install)
   - [PyTorch Installation](how_to/pytorch_install/pytorch_install)
diff --git a/docs/reference/gpu_isolation.md b/docs/reference/gpu_isolation.md
new file mode 100644
index 000000000..70ad2769e
--- /dev/null
+++ b/docs/reference/gpu_isolation.md
@@ -0,0 +1,108 @@
+# GPU Isolation Techniques
+
+Restricting the access of applications to a subset of GPUs, aka isolating
+GPUs allows users to hide GPU resources from programs. The programs by default
+will only use the "exposed" GPUs ignoring other (hidden) GPUs in the system.
+
+There are multiple ways to achieve isolation of GPUs in the ROCm software stack,
+differing in which applications they apply to and the security they provide.
+This page serves as an overview of the techniques.
+
+## Environment Variables
+
+The runtimes in the ROCm software stack read these environment variables to
+select the exposed or default device to present to applications using them.
+
+Environment variables shouldn't be used for isolating untrusted applications,
+as an application can reset them before initializing the runtime.
+
+### `ROCR_VISIBLE_DEVICES`
+
+A list of device indices or {abbr}`UUID (universally unique identifier)`s
+that will be exposed to applications.
+
+Runtime
+: ROCm Platform Runtime. Applies to all applications using the user mode ROCm
+  software stack.
+
+```{code-block} shell
+:caption: Example to expose the 1. device and a device based on UUID.
+export ROCR_VISIBLE_DEVICES="0,GPU-DEADBEEFDEADBEEF"
+```
+
+### `GPU_DEVICE_ORDINAL`
+
+Devices indices exposed to OpenCL and HIP applications.
+
+Runtime
+: ROCm Common Language Runtime (`ROCclr`). Applies to applications and runtimes
+  using the `ROCclr` abstraction layer including HIP and OpenCL applications.
+
+```{code-block} shell
+:caption: Example to expose the 1. and 3. device in the system.
+export GPU_DEVICE_ORDINAL="0,2"
+```
+
+### `HIP_VISIBLE_DEVICES`
+
+Device indices exposed to HIP applications.
+
+Runtime
+: HIP Runtime. Applies only to applications using HIP on the AMD platform.
+
+```{code-block} shell
+:caption: Example to expose the 1. and 3. devices in the system.
+export HIP_VISIBLE_DEVICES="0,2"
+```
+
+### `CUDA_VISIBLE_DEVICES`
+
+Provided for CUDA compatibility, has the same effect as `HIP_VISIBLE_DEVICES`
+on the AMD platform.
+
+Runtime
+: HIP or CUDA Runtime. Applies to HIP applications on the AMD or NVIDIA platform
+  and CUDA applications.
+
+### `OMP_DEFAULT_DEVICE`
+
+Default device used for OpenMP target offloading.
+
+Runtime
+: OpenMP Runtime. Applies only to applications using OpenMP offloading.
+
+```{code-block} shell
+:caption: Example on setting the default device to the third device.
+export OMP_DEFAULT_DEVICE="2"
+```
+
+## Docker
+
+Docker uses Linux kernel namespaces to provide isolated environments for
+applications. This isolation applies to most devices by default, including
+GPUs. To access them in containers explicit access must be granted, please see
+{ref}`docker-access-gpus-in-container` for details.
+Specifically refer to {ref}`docker-restrict-gpus` on exposing just a subset
+of all GPUs.
+
+Docker isolation is more secure than environment variables, and applies
+to all programs that use the `amdgpu` kernel module interfaces.
+Even programs that don't use the ROCm runtime, like graphics applications
+using OpenGL or Vulkan, can only access the GPUs exposed to the container.
+
+## GPU Passthrough to Virtual Machines
+
+Virtual machines achieve the highest level of isolation, because even the kernel
+of the virtual machine is isolated from the host. Devices physically installed
+in the host system can be passed to the virtual machine using PCIe passthrough.
+This allows for using the GPU with a different operating systems like a Windows
+guest from a Linux host.
+
+Setting up PCIe passthrough is specific to the hypervisor used. ROCm officially
+supports [VMware ESXi](https://www.vmware.com/products/esxi-and-esx.html)
+for select GPUs.
+
+<!--
+TODO: This should link to a page about virtualization that explains
+      pass-through and SR-IOV and how-tos for maybe `libvirt` and `VMWare`
+-->