mirror of
https://github.com/ROCm/ROCm.git
synced 2026-01-08 22:28:06 -05:00
Add GPU isolation (#2114)
* Add GPU isolation guide * Add hover text expansion of DKMS in linux quick start guide
This commit is contained in:
@@ -171,6 +171,7 @@ subtrees:
|
||||
title: MI250
|
||||
- file: reference/gpu_arch/mi100
|
||||
title: MI100
|
||||
- file: reference/gpu_isolation
|
||||
- caption: Understand ROCm
|
||||
entries:
|
||||
- title: Compiler Disambiguation
|
||||
|
||||
@@ -8,6 +8,8 @@ ROCm kernel-mode driver must be installed on the host. Please refer to
|
||||
(like the HIP-runtime or math libraries) of the ROCm stack will be loaded from
|
||||
the container image and don't need to be installed to the host.
|
||||
|
||||
(docker-access-gpus-in-container)=
|
||||
|
||||
## Accessing GPUs in containers
|
||||
|
||||
In order to access GPUs in a container (to run applications using HIP, OpenCL or
|
||||
@@ -38,6 +40,8 @@ docker run --device /dev/kfd --device /dev/dri
|
||||
Note that this gives more access than strictly required, as it also exposes the
|
||||
other device files found in that folder to the container.
|
||||
|
||||
(docker-restrict-gpus)=
|
||||
|
||||
### Restricting a container to a subset of the GPUs
|
||||
|
||||
If a `/dev/dri/renderD` device is not exposed to a container then it cannot use
|
||||
@@ -82,4 +86,5 @@ applications, but does not include any libraries.
|
||||
|
||||
AMD provides pre-built images for various GPU-ready applications through its
|
||||
Infinity Hub at <https://www.amd.com/en/technologies/infinity-hub>.
|
||||
There are also examples of invocating each application and suggested parameters used for benchmarking.
|
||||
Examples for invoking each application and suggested parameters used for
|
||||
benchmarking are also provided there.
|
||||
|
||||
@@ -3,13 +3,15 @@
|
||||
## Install Prerequisites
|
||||
|
||||
The driver package uses
|
||||
[`DKMS`](https://en.wikipedia.org/wiki/Dynamic_Kernel_Module_Support) to build
|
||||
[{abbr}`DKMS (Dynamic Kernel Module Support)`][DKMS-wiki] to build
|
||||
the `amdgpu-dkms` module (driver) for the installed kernels. This requires the Linux
|
||||
kernel headers and modules to be installed for each. Usually these are
|
||||
automatically installed with the kernel, but if you have multiple kernel
|
||||
versions or you have downloaded the kernel images and not the kernel
|
||||
meta-packages then they must be manually installed.
|
||||
|
||||
[DKMS-wiki]: https://en.wikipedia.org/wiki/Dynamic_Kernel_Module_Support
|
||||
|
||||
To install for the currently active kernel run the command corresponding
|
||||
to your distribution.
|
||||
::::{tab-set}
|
||||
|
||||
@@ -56,6 +56,7 @@ agile, flexible, rapid and secure manner. [more...](rocm)
|
||||
- [Management Tools](reference/management_tools)
|
||||
- [Validation Tools](reference/validation_tools)
|
||||
- [GPU Architecture](reference/gpu_arch)
|
||||
- [GPU Isolation Techniques](reference/gpu_isolation)
|
||||
|
||||
:::
|
||||
|
||||
@@ -76,7 +77,6 @@ Understand ROCm
|
||||
How to Guides
|
||||
^^^
|
||||
|
||||
- [How to Isolate GPUs in Docker?](how_to/docker_gpu_isolation)
|
||||
- [Setting up for Deep Learning with ROCm](how_to/deep_learning_rocm)
|
||||
- [Magma Installation](how_to/magma_install/magma_install)
|
||||
- [PyTorch Installation](how_to/pytorch_install/pytorch_install)
|
||||
|
||||
108
docs/reference/gpu_isolation.md
Normal file
108
docs/reference/gpu_isolation.md
Normal file
@@ -0,0 +1,108 @@
|
||||
# GPU Isolation Techniques
|
||||
|
||||
Restricting the access of applications to a subset of GPUs, aka isolating
|
||||
GPUs allows users to hide GPU resources from programs. The programs by default
|
||||
will only use the "exposed" GPUs ignoring other (hidden) GPUs in the system.
|
||||
|
||||
There are multiple ways to achieve isolation of GPUs in the ROCm software stack,
|
||||
differing in which applications they apply to and the security they provide.
|
||||
This page serves as an overview of the techniques.
|
||||
|
||||
## Environment Variables
|
||||
|
||||
The runtimes in the ROCm software stack read these environment variables to
|
||||
select the exposed or default device to present to applications using them.
|
||||
|
||||
Environment variables shouldn't be used for isolating untrusted applications,
|
||||
as an application can reset them before initializing the runtime.
|
||||
|
||||
### `ROCR_VISIBLE_DEVICES`
|
||||
|
||||
A list of device indices or {abbr}`UUID (universally unique identifier)`s
|
||||
that will be exposed to applications.
|
||||
|
||||
Runtime
|
||||
: ROCm Platform Runtime. Applies to all applications using the user mode ROCm
|
||||
software stack.
|
||||
|
||||
```{code-block} shell
|
||||
:caption: Example to expose the 1. device and a device based on UUID.
|
||||
export ROCR_VISIBLE_DEVICES="0,GPU-DEADBEEFDEADBEEF"
|
||||
```
|
||||
|
||||
### `GPU_DEVICE_ORDINAL`
|
||||
|
||||
Devices indices exposed to OpenCL and HIP applications.
|
||||
|
||||
Runtime
|
||||
: ROCm Common Language Runtime (`ROCclr`). Applies to applications and runtimes
|
||||
using the `ROCclr` abstraction layer including HIP and OpenCL applications.
|
||||
|
||||
```{code-block} shell
|
||||
:caption: Example to expose the 1. and 3. device in the system.
|
||||
export GPU_DEVICE_ORDINAL="0,2"
|
||||
```
|
||||
|
||||
### `HIP_VISIBLE_DEVICES`
|
||||
|
||||
Device indices exposed to HIP applications.
|
||||
|
||||
Runtime
|
||||
: HIP Runtime. Applies only to applications using HIP on the AMD platform.
|
||||
|
||||
```{code-block} shell
|
||||
:caption: Example to expose the 1. and 3. devices in the system.
|
||||
export HIP_VISIBLE_DEVICES="0,2"
|
||||
```
|
||||
|
||||
### `CUDA_VISIBLE_DEVICES`
|
||||
|
||||
Provided for CUDA compatibility, has the same effect as `HIP_VISIBLE_DEVICES`
|
||||
on the AMD platform.
|
||||
|
||||
Runtime
|
||||
: HIP or CUDA Runtime. Applies to HIP applications on the AMD or NVIDIA platform
|
||||
and CUDA applications.
|
||||
|
||||
### `OMP_DEFAULT_DEVICE`
|
||||
|
||||
Default device used for OpenMP target offloading.
|
||||
|
||||
Runtime
|
||||
: OpenMP Runtime. Applies only to applications using OpenMP offloading.
|
||||
|
||||
```{code-block} shell
|
||||
:caption: Example on setting the default device to the third device.
|
||||
export OMP_DEFAULT_DEVICE="2"
|
||||
```
|
||||
|
||||
## Docker
|
||||
|
||||
Docker uses Linux kernel namespaces to provide isolated environments for
|
||||
applications. This isolation applies to most devices by default, including
|
||||
GPUs. To access them in containers explicit access must be granted, please see
|
||||
{ref}`docker-access-gpus-in-container` for details.
|
||||
Specifically refer to {ref}`docker-restrict-gpus` on exposing just a subset
|
||||
of all GPUs.
|
||||
|
||||
Docker isolation is more secure than environment variables, and applies
|
||||
to all programs that use the `amdgpu` kernel module interfaces.
|
||||
Even programs that don't use the ROCm runtime, like graphics applications
|
||||
using OpenGL or Vulkan, can only access the GPUs exposed to the container.
|
||||
|
||||
## GPU Passthrough to Virtual Machines
|
||||
|
||||
Virtual machines achieve the highest level of isolation, because even the kernel
|
||||
of the virtual machine is isolated from the host. Devices physically installed
|
||||
in the host system can be passed to the virtual machine using PCIe passthrough.
|
||||
This allows for using the GPU with a different operating systems like a Windows
|
||||
guest from a Linux host.
|
||||
|
||||
Setting up PCIe passthrough is specific to the hypervisor used. ROCm officially
|
||||
supports [VMware ESXi](https://www.vmware.com/products/esxi-and-esx.html)
|
||||
for select GPUs.
|
||||
|
||||
<!--
|
||||
TODO: This should link to a page about virtualization that explains
|
||||
pass-through and SR-IOV and how-tos for maybe `libvirt` and `VMWare`
|
||||
-->
|
||||
Reference in New Issue
Block a user