Inception V3 Example, Deep Learning Guide Decomposed and OpenMP Guide (#1937)
BIN
docs/data/understand/deep_learning/TextClassification_3.png
Normal file
|
After Width: | Height: | Size: 66 KiB |
BIN
docs/data/understand/deep_learning/TextClassification_4.png
Normal file
|
After Width: | Height: | Size: 36 KiB |
BIN
docs/data/understand/deep_learning/TextClassification_5.png
Normal file
|
After Width: | Height: | Size: 87 KiB |
|
Before Width: | Height: | Size: 20 KiB After Width: | Height: | Size: 20 KiB |
|
Before Width: | Height: | Size: 18 KiB After Width: | Height: | Size: 18 KiB |
|
Before Width: | Height: | Size: 64 KiB After Width: | Height: | Size: 64 KiB |
|
Before Width: | Height: | Size: 22 KiB After Width: | Height: | Size: 22 KiB |
|
Before Width: | Height: | Size: 69 KiB After Width: | Height: | Size: 69 KiB |
|
Before Width: | Height: | Size: 9.8 KiB After Width: | Height: | Size: 9.8 KiB |
BIN
docs/data/understand/deep_learning/mnist_4.png
Normal file
|
After Width: | Height: | Size: 9.1 KiB |
BIN
docs/data/understand/deep_learning/mnist_5.png
Normal file
|
After Width: | Height: | Size: 4.8 KiB |
@@ -1 +1,12 @@
|
||||
# Deep Learning Guide
|
||||
# Frameworks Installation
|
||||
|
||||
The following sections cover the different framework installations for ROCm and
|
||||
Deep Learning applications. {numref}`Rocm-Compat-Frameworks-Flowchart` provides the sequential flow for the use of each framework. Refer to the ROCm Compatible Frameworks Release Notes for each framework's most current release notes at [Framework Release Notes](https://docs.amd.com/bundle/ROCm-Compatible-Frameworks-Release-Notes/page/Framework_Release_Notes.html).
|
||||
|
||||
```{figure} ../data/how_to/magma_install/image.005.png
|
||||
:name: Rocm-Compat-Frameworks-Flowchart
|
||||
---
|
||||
align: center
|
||||
---
|
||||
ROCm Compatible Frameworks Flowchart
|
||||
```
|
||||
|
||||
@@ -1,415 +1,5 @@
|
||||
# Magma Installation for ROCm
|
||||
|
||||
Pull content from
|
||||
<https://docs.amd.com/bundle/ROCm-Deep-Learning-Guide-v5.4.1/page/Frameworks_Installation.html>
|
||||
|
||||
The following sections cover the different framework installations for ROCm and
|
||||
Deep Learning applications. {numref}`rocm-compat-frameworks-flowchart` provides the sequential flow for the use of
|
||||
each framework. Refer to the ROCm Compatible Frameworks Release Notes for each
|
||||
framework's most current release notes at
|
||||
[/bundle/ROCm-Compatible-Frameworks-Release-Notes/page/Framework_Release_Notes.html](/bundle/ROCm-Compatible-Frameworks-Release-Notes/page/Framework_Release_Notes.html).
|
||||
|
||||
|
||||
:::{figure-md} rocm-compat-frameworks-flowchart
|
||||
|
||||
<img src="../../data/how_to/magma_install/image.005.png" alt="ROCm Compatible Frameworks Flowchart">
|
||||
|
||||
ROCm Compatible Frameworks Flowchart
|
||||
:::
|
||||
|
||||
## PyTorch
|
||||
|
||||
PyTorch is an open source Machine Learning Python library, primarily
|
||||
differentiated by Tensor computing with GPU acceleration and a type-based
|
||||
automatic differentiation. Other advanced features include:
|
||||
|
||||
- Support for distributed training
|
||||
- Native ONNX support
|
||||
- C++ frontend
|
||||
- The ability to deploy at scale using TorchServe
|
||||
- A production-ready deployment mechanism through TorchScript
|
||||
|
||||
### Installing PyTorch
|
||||
|
||||
To install ROCm on bare metal, refer to the section
|
||||
[ROCm Installation](https://docs.amd.com/bundle/ROCm-Deep-Learning-Guide-v5.4-/page/Prerequisites.html#d2999e60).
|
||||
The recommended option to get a PyTorch environment is through Docker. However,
|
||||
installing the PyTorch wheels package on bare metal is also supported.
|
||||
|
||||
#### Option 1 (Recommended): Use Docker Image with PyTorch Pre-installed
|
||||
|
||||
Using Docker gives you portability and access to a prebuilt Docker container
|
||||
that has been rigorously tested within AMD. This might also save on the
|
||||
compilation time and should perform as it did when tested without facing
|
||||
potential installation issues.
|
||||
|
||||
Follow these steps:
|
||||
|
||||
1. Pull the latest public PyTorch Docker image.
|
||||
|
||||
```bash
|
||||
docker pull rocm/pytorch:latest
|
||||
```
|
||||
|
||||
Optionally, you may download a specific and supported configuration with
|
||||
different user-space ROCm versions, PyTorch versions, and supported operating
|
||||
systems. To download the PyTorch Docker image, refer to
|
||||
[https://hub.docker.com/r/rocm/pytorch](https://hub.docker.com/r/rocm/pytorch).
|
||||
|
||||
2. Start a Docker container using the downloaded image.
|
||||
|
||||
```bash
|
||||
docker run -it --cap-add=SYS_PTRACE --security-opt seccomp=unconfined --device=/dev/kfd --device=/dev/dri --group-add video --ipc=host --shm-size 8G rocm/pytorch:latest
|
||||
```
|
||||
|
||||
:::{note}
|
||||
This will automatically download the image if it does not exist on the host.
|
||||
You can also pass the -v argument to mount any data directories from the host
|
||||
onto the container.
|
||||
:::
|
||||
|
||||
#### Option 2: Install PyTorch Using Wheels Package
|
||||
|
||||
PyTorch supports the ROCm platform by providing tested wheels packages. To
|
||||
access this feature, refer to
|
||||
[https://pytorch.org/get-started/locally/](https://pytorch.org/get-started/locally/)
|
||||
and choose the "ROCm" compute platform. Figure 6 is a matrix from pytroch.org
|
||||
that illustrates the installation compatibility between ROCm and the PyTorch
|
||||
build.
|
||||
|
||||
|  |
|
||||
|:------------------------------------------------------------------:|
|
||||
| Figure 6. Installation Matrix from Pytorch.org |
|
||||
|
||||
To install PyTorch using the wheels package, follow these installation steps:
|
||||
|
||||
1. Choose one of the following options:
|
||||
a. Obtain a base Docker image with the correct user-space ROCm version
|
||||
installed from
|
||||
[https://hub.docker.com/repository/docker/rocm/dev-ubuntu-20.04](https://hub.docker.com/repository/docker/rocm/dev-ubuntu-20.04).
|
||||
|
||||
or
|
||||
|
||||
b. Download a base OS Docker image and install ROCm following the
|
||||
installation directions in the section
|
||||
[Installation](https://docs.amd.com/bundle/ROCm-Deep-Learning-Guide-v5.4-/page/Prerequisites.html#d2999e60).
|
||||
ROCm 5.2 is installed in this example, as supported by the installation
|
||||
matrix from pytorch.org.
|
||||
|
||||
or
|
||||
|
||||
c. Install on bare metal. Skip to Step 3.
|
||||
|
||||
```bash
|
||||
docker run -it --device=/dev/kfd --device=/dev/dri --group-add video rocm/dev-ubuntu-20.04:latest
|
||||
```
|
||||
|
||||
3. Install any dependencies needed for installing the wheels package.
|
||||
|
||||
```bash
|
||||
sudo apt update
|
||||
sudo apt install libjpeg-dev python3-dev
|
||||
pip3 install wheel setuptools
|
||||
```
|
||||
|
||||
4. Install torch, torchvision, and torchaudio as specified by the installation
|
||||
matrix.
|
||||
|
||||
:::{note}
|
||||
ROCm 5.2 PyTorch wheel in the command below is shown for reference.
|
||||
:::
|
||||
|
||||
```bash
|
||||
pip3 install --pre torch torchvision torchaudio --extra-index-url https://download.pytorch.org/whl/nightly/rocm5.2/
|
||||
```
|
||||
|
||||
#### Option 3: Install PyTorch Using PyTorch ROCm Base Docker Image
|
||||
|
||||
A prebuilt base Docker image is used to build PyTorch in this option. The base
|
||||
Docker has all dependencies installed, including:
|
||||
|
||||
- ROCm
|
||||
- Torchvision
|
||||
- Conda packages
|
||||
- Compiler toolchain
|
||||
|
||||
Additionally, a particular environment flag (`BUILD_ENVIRONMENT`) is set, and
|
||||
the build scripts utilize that to determine the build environment configuration.
|
||||
|
||||
Follow these steps:
|
||||
|
||||
1. Obtain the Docker image.
|
||||
|
||||
```bash
|
||||
docker pull rocm/pytorch:latest-base
|
||||
```
|
||||
|
||||
The above will download the base container, which does not contain PyTorch.
|
||||
|
||||
2. Start a Docker container using the image.
|
||||
|
||||
```bash
|
||||
docker run -it --cap-add=SYS_PTRACE --security-opt seccomp=unconfined --device=/dev/kfd --device=/dev/dri --group-add video --ipc=host --shm-size 8G rocm/pytorch:latest-base
|
||||
```
|
||||
|
||||
You can also pass the -v argument to mount any data directories from the host
|
||||
onto the container.
|
||||
|
||||
3. Clone the PyTorch repository.
|
||||
|
||||
```bash
|
||||
cd ~
|
||||
git clone https://github.com/pytorch/pytorch.git
|
||||
cd pytorch
|
||||
git submodule update --init –recursive
|
||||
```
|
||||
|
||||
4. Build PyTorch for ROCm.
|
||||
|
||||
:::{note}
|
||||
By default in the rocm/pytorch:latest-base, PyTorch builds for these
|
||||
architectures simultaneously:
|
||||
- gfx900
|
||||
- gfx906
|
||||
- gfx908
|
||||
- gfx90a
|
||||
- gfx1030
|
||||
:::
|
||||
|
||||
5. To determine your AMD uarch, run:
|
||||
|
||||
```bash
|
||||
rocminfo | grep gfx
|
||||
```
|
||||
|
||||
6. In the event you want to compile only for your uarch, use:
|
||||
|
||||
```bash
|
||||
export PYTORCH_ROCM_ARCH=<uarch>
|
||||
```
|
||||
|
||||
`<uarch>` is the architecture reported by the `rocminfo` command.
|
||||
|
||||
7. Build PyTorch using the following command:
|
||||
|
||||
```bash
|
||||
./.jenkins/pytorch/build.sh
|
||||
```
|
||||
|
||||
This will first convert PyTorch sources for HIP compatibility and build the
|
||||
PyTorch framework.
|
||||
|
||||
8. Alternatively, build PyTorch by issuing the following commands:
|
||||
|
||||
```bash
|
||||
python3 tools/amd_build/build_amd.py
|
||||
USE_ROCM=1 MAX_JOBS=4 python3 setup.py install ––user
|
||||
```
|
||||
|
||||
#### Option 4: Install Using PyTorch Upstream Docker File
|
||||
|
||||
Instead of using a prebuilt base Docker image, you can build a custom base
|
||||
Docker image using scripts from the PyTorch repository. This will utilize a
|
||||
standard Docker image from operating system maintainers and install all the
|
||||
dependencies required to build PyTorch, including
|
||||
|
||||
- ROCm
|
||||
- Torchvision
|
||||
- Conda packages
|
||||
- Compiler toolchain
|
||||
|
||||
Follow these steps:
|
||||
|
||||
1. Clone the PyTorch repository on the host.
|
||||
|
||||
```bash
|
||||
cd ~
|
||||
git clone https://github.com/pytorch/pytorch.git
|
||||
cd pytorch
|
||||
git submodule update --init –recursive
|
||||
```
|
||||
|
||||
2. Build the PyTorch Docker image.
|
||||
|
||||
```bash
|
||||
cd.circleci/docker
|
||||
./build.sh pytorch-linux-bionic-rocm<version>-py3.7
|
||||
# eg. ./build.sh pytorch-linux-bionic-rocm3.10-py3.7
|
||||
```
|
||||
|
||||
This should be complete with a message "Successfully build `<image_id>`."
|
||||
|
||||
3. Start a Docker container using the image:
|
||||
|
||||
```bash
|
||||
docker run -it --cap-add=SYS_PTRACE --security-opt
|
||||
seccomp=unconfined --device=/dev/kfd --device=/dev/dri --group-add
|
||||
video --ipc=host --shm-size 8G <image_id>
|
||||
```
|
||||
|
||||
You can also pass -v argument to mount any data directories from the host
|
||||
onto the container.
|
||||
|
||||
4. Clone the PyTorch repository.
|
||||
|
||||
```bash
|
||||
cd ~
|
||||
git clone https://github.com/pytorch/pytorch.git
|
||||
cd pytorch
|
||||
git submodule update --init --recursive
|
||||
```
|
||||
|
||||
5. Build PyTorch for ROCm.
|
||||
|
||||
:::{note}
|
||||
By default in the rocm/pytorch:latest-base, PyTorch builds for these
|
||||
architectures simultaneously:
|
||||
- gfx900
|
||||
- gfx906
|
||||
- gfx908
|
||||
- gfx90a
|
||||
- gfx1030
|
||||
:::
|
||||
|
||||
6. To determine your AMD uarch, run:
|
||||
|
||||
```bash
|
||||
rocminfo | grep gfx
|
||||
```
|
||||
|
||||
7. If you want to compile only for your uarch:
|
||||
|
||||
```bash
|
||||
export PYTORCH_ROCM_ARCH=<uarch>
|
||||
```
|
||||
|
||||
`<uarch>` is the architecture reported by the rocminfo command.
|
||||
|
||||
8. Build PyTorch using:
|
||||
|
||||
```bash
|
||||
./.jenkins/pytorch/build.sh
|
||||
```
|
||||
|
||||
This will first convert PyTorch sources to be HIP compatible and then build the
|
||||
PyTorch framework.
|
||||
|
||||
Alternatively, build PyTorch by issuing the following commands:
|
||||
|
||||
```bash
|
||||
python3 tools/amd_build/build_amd.py
|
||||
USE_ROCM=1 MAX_JOBS=4 python3 setup.py install --user
|
||||
```
|
||||
|
||||
### Test the PyTorch Installation
|
||||
|
||||
You can use PyTorch unit tests to validate a PyTorch installation. If using a
|
||||
prebuilt PyTorch Docker image from AMD ROCm DockerHub or installing an official
|
||||
wheels package, these tests are already run on those configurations.
|
||||
Alternatively, you can manually run the unit tests to validate the PyTorch
|
||||
installation fully.
|
||||
|
||||
Follow these steps:
|
||||
|
||||
1. Test if PyTorch is installed and accessible by importing the torch package in
|
||||
Python.
|
||||
|
||||
:::{note}
|
||||
Do not run in the PyTorch git folder.
|
||||
:::
|
||||
|
||||
```bash
|
||||
python3 -c 'import torch' 2> /dev/null && echo 'Success' || echo 'Failure'
|
||||
```
|
||||
|
||||
2. Test if the GPU is accessible from PyTorch. In the PyTorch framework,
|
||||
torch.cuda is a generic mechanism to access the GPU; it will access an AMD
|
||||
GPU only if available.
|
||||
|
||||
```bash
|
||||
python3 -c 'import torch; print(torch.cuda.is_available())'
|
||||
```
|
||||
|
||||
3. Run the unit tests to validate the PyTorch installation fully. Run the
|
||||
following command from the PyTorch home directory:
|
||||
|
||||
```bash
|
||||
BUILD_ENVIRONMENT=${BUILD_ENVIRONMENT:-rocm} ./.jenkins/pytorch/test.sh
|
||||
```
|
||||
|
||||
This ensures that even for wheel installs in a non-controlled environment,
|
||||
the required environment variable will be set to skip certain unit tests for
|
||||
ROCm.
|
||||
|
||||
:::{note}
|
||||
Make sure the PyTorch source code is corresponding to the PyTorch wheel or
|
||||
installation in the Docker image. Incompatible PyTorch source code might give
|
||||
errors when running the unit tests.
|
||||
:::
|
||||
|
||||
This will first install some dependencies, such as a supported torchvision
|
||||
version for PyTorch. Torchvision is used in some PyTorch tests for loading
|
||||
models. Next, this will run all the unit tests.
|
||||
|
||||
:::{note}
|
||||
Some tests may be skipped, as appropriate, based on your system
|
||||
configuration. All features of PyTorch are not supported on ROCm, and the
|
||||
tests that evaluate these features are skipped. In addition, depending on the
|
||||
host memory, or the number of available GPUs, other tests may be skipped. No
|
||||
test should fail if the compilation and installation are correct.
|
||||
:::
|
||||
|
||||
4. Run individual unit tests with the following command:
|
||||
|
||||
```bash
|
||||
PYTORCH_TEST_WITH_ROCM=1 python3 test/test_nn.py --verbose
|
||||
```
|
||||
|
||||
test_nn.py can be replaced with any other test set.
|
||||
|
||||
### Run a Basic PyTorch Example
|
||||
|
||||
The PyTorch examples repository provides basic examples that exercise the
|
||||
functionality of the framework. MNIST (Modified National Institute of Standards
|
||||
and Technology) database is a collection of handwritten digits that may be used
|
||||
to train a Convolutional Neural Network for handwriting recognition.
|
||||
Alternatively, ImageNet is a database of images used to train a network for
|
||||
visual object recognition.
|
||||
|
||||
Follow these steps:
|
||||
|
||||
1. Clone the PyTorch examples repository.
|
||||
|
||||
```bash
|
||||
git clone https://github.com/pytorch/examples.git
|
||||
```
|
||||
|
||||
2. Run the MNIST example.
|
||||
|
||||
```bash
|
||||
cd examples/mnist
|
||||
```
|
||||
|
||||
3. Follow the instructions in the README file in this folder. In this case:
|
||||
|
||||
```bash
|
||||
pip3 install -r requirements.txt
|
||||
python3 main.py
|
||||
```
|
||||
|
||||
4. Run the ImageNet example.
|
||||
|
||||
```bash
|
||||
cd examples/imagenet
|
||||
```
|
||||
|
||||
5. Follow the instructions in the README file in this folder. In this case:
|
||||
|
||||
```bash
|
||||
pip3 install -r requirements.txt
|
||||
python3 main.py
|
||||
```
|
||||
|
||||
## MAGMA for ROCm
|
||||
|
||||
Matrix Algebra on GPU and Multicore Architectures, abbreviated as MAGMA, is a
|
||||
@@ -472,180 +62,3 @@ To build MAGMA from the source, follow these steps:
|
||||
popd
|
||||
mv magma /opt/rocm
|
||||
```
|
||||
|
||||
## TensorFlow
|
||||
|
||||
TensorFlow is an open source library for solving Machine Learning,
|
||||
Deep Learning, and Artificial Intelligence problems. It can be used to solve
|
||||
many problems across different sectors and industries but primarily focuses on
|
||||
training and inference in neural networks. It is one of the most popular and
|
||||
in-demand frameworks and is very active in open source contribution and
|
||||
development.
|
||||
|
||||
### Installing TensorFlow
|
||||
|
||||
The following sections contain options for installing TensorFlow.
|
||||
|
||||
#### Option 1: Install TensorFlow Using Docker Image
|
||||
|
||||
To install ROCm on bare metal, follow the section
|
||||
[ROCm Installation](https://docs.amd.com/bundle/ROCm-Deep-Learning-Guide-v5.4-/page/Prerequisites.html#d2999e60).
|
||||
The recommended option to get a TensorFlow environment is through Docker.
|
||||
|
||||
Using Docker provides portability and access to a prebuilt Docker container that
|
||||
has been rigorously tested within AMD. This might also save compilation time and
|
||||
should perform as tested without facing potential installation issues.
|
||||
Follow these steps:
|
||||
|
||||
1. Pull the latest public TensorFlow Docker image.
|
||||
|
||||
```bash
|
||||
docker pull rocm/tensorflow:latest
|
||||
```
|
||||
|
||||
2. Once you have pulled the image, run it by using the command below:
|
||||
|
||||
```bash
|
||||
docker run -it --network=host --device=/dev/kfd --device=/dev/dri
|
||||
--ipc=host --shm-size 16G --group-add video --cap-add=SYS_PTRACE
|
||||
--security-opt seccomp=unconfined rocm/tensorflow:latest
|
||||
```
|
||||
|
||||
#### Option 2: Install TensorFlow Using Wheels Package
|
||||
|
||||
To install TensorFlow using the wheels package, follow these steps:
|
||||
|
||||
1. Check the Python version.
|
||||
|
||||
```bash
|
||||
python3 –version
|
||||
```
|
||||
|
||||
| If: | Then: |
|
||||
|:-----------------------------------:|:--------------------------------:|
|
||||
| The Python version is less than 3.7 | Upgrade Python. |
|
||||
| The Python version is more than 3.7 | Skip this step and go to Step 3. |
|
||||
|
||||
:::{note}
|
||||
The supported Python versions are:
|
||||
|
||||
- 3.7
|
||||
- 3.8
|
||||
- 3.9
|
||||
- 3.10
|
||||
:::
|
||||
|
||||
```bash
|
||||
sudo apt-get install python3.7 # or python3.8 or python 3.9 or python 3.10
|
||||
```
|
||||
|
||||
2. Set up multiple Python versions using update-alternatives.
|
||||
|
||||
```bash
|
||||
update-alternatives --query python3
|
||||
sudo update-alternatives --install
|
||||
/usr/bin/python3 python3 /usr/bin/python[version] [priority]
|
||||
```
|
||||
|
||||
:::{note}
|
||||
Follow the instruction in Step 2 for incompatible Python versions.
|
||||
:::
|
||||
|
||||
```bash
|
||||
sudo update-alternatives --config python3
|
||||
```
|
||||
|
||||
3. Follow the screen prompts, and select the Python version installed in Step 2.
|
||||
|
||||
4. Install or upgrade PIP.
|
||||
|
||||
```bash
|
||||
sudo apt install python3-pip
|
||||
```
|
||||
|
||||
To install PIP, use the following:
|
||||
|
||||
```bash
|
||||
/usr/bin/python[version] -m pip install --upgrade pip
|
||||
```
|
||||
|
||||
Upgrade PIP for Python version installed in step 2:
|
||||
|
||||
```bash
|
||||
sudo pip3 install --upgrade pip
|
||||
```
|
||||
|
||||
5. Install TensorFlow for the Python version as indicated in Step 2.
|
||||
|
||||
```bash
|
||||
/usr/bin/python[version] -m pip install --user tensorflow-rocm==[wheel-version] –upgrade
|
||||
```
|
||||
|
||||
For a valid wheel version for a ROCm release, refer to the instruction below:
|
||||
|
||||
```bash
|
||||
sudo apt install rocm-libs rccl
|
||||
```
|
||||
|
||||
6. Update protobuf to 3.19 or lower.
|
||||
|
||||
```bash
|
||||
/usr/bin/python3.7 -m pip install protobuf=3.19.0
|
||||
sudo pip3 install tensorflow
|
||||
```
|
||||
|
||||
7. Set the environment variable PYTHONPATH.
|
||||
|
||||
```bash
|
||||
export PYTHONPATH="./.local/lib/python[version]/site-packages:$PYTHONPATH" #Use same python version as in step 2
|
||||
```
|
||||
|
||||
8. Install libraries.
|
||||
|
||||
```bash
|
||||
sudo apt install rocm-libs rccl
|
||||
```
|
||||
|
||||
9. Test installation.
|
||||
|
||||
```bash
|
||||
python3 -c 'import tensorflow' 2> /dev/null && echo 'Success' || echo 'Failure'
|
||||
```
|
||||
|
||||
:::{note}
|
||||
For details on tensorflow-rocm wheels and ROCm version compatibility, see:
|
||||
[https://github.com/ROCmSoftwarePlatform/tensorflow-upstream/blob/develop-upstream/rocm_docs/tensorflow-rocm-release.md](https://github.com/ROCmSoftwarePlatform/tensorflow-upstream/blob/develop-upstream/rocm_docs/tensorflow-rocm-release.md)
|
||||
:::
|
||||
|
||||
### Test the TensorFlow Installation
|
||||
|
||||
To test the installation of TensorFlow, run the container image as specified in
|
||||
the previous section Installing TensorFlow. Ensure you have access to the Python
|
||||
shell in the Docker container.
|
||||
|
||||
```bash
|
||||
python3 -c 'import tensorflow' 2> /dev/null && echo ‘Success’ || echo ‘Failure’
|
||||
```
|
||||
|
||||
### Run a Basic TensorFlow Example
|
||||
|
||||
The TensorFlow examples repository provides basic examples that exercise the
|
||||
framework's functionality. The MNIST database is a collection of handwritten
|
||||
digits that may be used to train a Convolutional Neural Network for handwriting
|
||||
recognition.
|
||||
|
||||
Follow these steps:
|
||||
|
||||
1. Clone the TensorFlow example repository.
|
||||
|
||||
```bash
|
||||
cd ~
|
||||
git clone https://github.com/tensorflow/models.git
|
||||
```
|
||||
|
||||
2. Install the dependencies of the code, and run the code.
|
||||
|
||||
```bash
|
||||
#pip3 install requirement.txt
|
||||
#python mnist_tf.py
|
||||
```
|
||||
|
||||
@@ -1,6 +1,402 @@
|
||||
# PyTorch Installation for ROCm
|
||||
|
||||
Pull content from
|
||||
<https://docs.amd.com/bundle/ROCm-Deep-Learning-Guide-v5.4.1/page/Frameworks_Installation.html>
|
||||
## PyTorch
|
||||
|
||||
TEST
|
||||
PyTorch is an open source Machine Learning Python library, primarily
|
||||
differentiated by Tensor computing with GPU acceleration and a type-based
|
||||
automatic differentiation. Other advanced features include:
|
||||
|
||||
- Support for distributed training
|
||||
- Native ONNX support
|
||||
- C++ frontend
|
||||
- The ability to deploy at scale using TorchServe
|
||||
- A production-ready deployment mechanism through TorchScript
|
||||
|
||||
### Installing PyTorch
|
||||
|
||||
To install ROCm on bare metal, refer to the section
|
||||
[ROCm Installation](https://docs.amd.com/bundle/ROCm-Deep-Learning-Guide-v5.4-/page/Prerequisites.html#d2999e60).
|
||||
The recommended option to get a PyTorch environment is through Docker. However,
|
||||
installing the PyTorch wheels package on bare metal is also supported.
|
||||
|
||||
#### Option 1 (Recommended): Use Docker Image with PyTorch Pre-installed
|
||||
|
||||
Using Docker gives you portability and access to a prebuilt Docker container
|
||||
that has been rigorously tested within AMD. This might also save on the
|
||||
compilation time and should perform as it did when tested without facing
|
||||
potential installation issues.
|
||||
|
||||
Follow these steps:
|
||||
|
||||
1. Pull the latest public PyTorch Docker image.
|
||||
|
||||
```bash
|
||||
docker pull rocm/pytorch:latest
|
||||
```
|
||||
|
||||
Optionally, you may download a specific and supported configuration with
|
||||
different user-space ROCm versions, PyTorch versions, and supported operating
|
||||
systems. To download the PyTorch Docker image, refer to
|
||||
[https://hub.docker.com/r/rocm/pytorch](https://hub.docker.com/r/rocm/pytorch).
|
||||
|
||||
2. Start a Docker container using the downloaded image.
|
||||
|
||||
```bash
|
||||
docker run -it --cap-add=SYS_PTRACE --security-opt seccomp=unconfined --device=/dev/kfd --device=/dev/dri --group-add video --ipc=host --shm-size 8G rocm/pytorch:latest
|
||||
```
|
||||
|
||||
:::{note}
|
||||
This will automatically download the image if it does not exist on the host.
|
||||
You can also pass the -v argument to mount any data directories from the host
|
||||
onto the container.
|
||||
:::
|
||||
|
||||
#### Option 2: Install PyTorch Using Wheels Package
|
||||
|
||||
PyTorch supports the ROCm platform by providing tested wheels packages. To
|
||||
access this feature, refer to
|
||||
[https://pytorch.org/get-started/locally/](https://pytorch.org/get-started/locally/)
|
||||
and choose the "ROCm" compute platform. {numref}`Installation-Matrix-from-Pytorch` is a matrix from pytroch.org that illustrates the installation compatibility between ROCm and the PyTorch build.
|
||||
|
||||
```{figure} ../../data/how_to/magma_install/image.006.png
|
||||
:name: Installation-Matrix-from-Pytorch
|
||||
---
|
||||
align: center
|
||||
---
|
||||
Installation Matrix from Pytorch.org
|
||||
```
|
||||
|
||||
To install PyTorch using the wheels package, follow these installation steps:
|
||||
|
||||
1. Choose one of the following options:
|
||||
a. Obtain a base Docker image with the correct user-space ROCm version
|
||||
installed from
|
||||
[https://hub.docker.com/repository/docker/rocm/dev-ubuntu-20.04](https://hub.docker.com/repository/docker/rocm/dev-ubuntu-20.04).
|
||||
|
||||
or
|
||||
|
||||
b. Download a base OS Docker image and install ROCm following the
|
||||
installation directions in the section
|
||||
[Installation](https://docs.amd.com/bundle/ROCm-Deep-Learning-Guide-v5.4-/page/Prerequisites.html#d2999e60).
|
||||
ROCm 5.2 is installed in this example, as supported by the installation
|
||||
matrix from pytorch.org.
|
||||
|
||||
or
|
||||
|
||||
c. Install on bare metal. Skip to Step 3.
|
||||
|
||||
```bash
|
||||
docker run -it --device=/dev/kfd --device=/dev/dri --group-add video rocm/dev-ubuntu-20.04:latest
|
||||
```
|
||||
|
||||
2. Start the Docker container, if not installing on bare metal.
|
||||
|
||||
```dockerfile
|
||||
docker run -it --device=/dev/kfd --device=/dev/dri --group-add video rocm/dev-ubuntu-20.04:latest
|
||||
```
|
||||
|
||||
3. Install any dependencies needed for installing the wheels package.
|
||||
|
||||
```bash
|
||||
sudo apt update
|
||||
sudo apt install libjpeg-dev python3-dev
|
||||
pip3 install wheel setuptools
|
||||
```
|
||||
|
||||
4. Install torch, torchvision, and torchaudio as specified by the installation
|
||||
matrix.
|
||||
|
||||
:::{note}
|
||||
ROCm 5.2 PyTorch wheel in the command below is shown for reference.
|
||||
:::
|
||||
|
||||
```bash
|
||||
pip3 install --pre torch torchvision torchaudio --extra-index-url https://download.pytorch.org/whl/nightly/rocm5.2/
|
||||
```
|
||||
|
||||
#### Option 3: Install PyTorch Using PyTorch ROCm Base Docker Image
|
||||
|
||||
A prebuilt base Docker image is used to build PyTorch in this option. The base
|
||||
Docker has all dependencies installed, including:
|
||||
|
||||
- ROCm
|
||||
- Torchvision
|
||||
- Conda packages
|
||||
- Compiler toolchain
|
||||
|
||||
Additionally, a particular environment flag (`BUILD_ENVIRONMENT`) is set, and
|
||||
the build scripts utilize that to determine the build environment configuration.
|
||||
|
||||
Follow these steps:
|
||||
|
||||
1. Obtain the Docker image.
|
||||
|
||||
```bash
|
||||
docker pull rocm/pytorch:latest-base
|
||||
```
|
||||
|
||||
The above will download the base container, which does not contain PyTorch.
|
||||
|
||||
2. Start a Docker container using the image.
|
||||
|
||||
```bash
|
||||
docker run -it --cap-add=SYS_PTRACE --security-opt seccomp=unconfined --device=/dev/kfd --device=/dev/dri --group-add video --ipc=host --shm-size 8G rocm/pytorch:latest-base
|
||||
```
|
||||
|
||||
You can also pass the -v argument to mount any data directories from the host
|
||||
onto the container.
|
||||
|
||||
3. Clone the PyTorch repository.
|
||||
|
||||
```bash
|
||||
cd ~
|
||||
git clone https://github.com/pytorch/pytorch.git
|
||||
cd pytorch
|
||||
git submodule update --init –recursive
|
||||
```
|
||||
|
||||
4. Build PyTorch for ROCm.
|
||||
|
||||
:::{note}
|
||||
By default in the rocm/pytorch:latest-base, PyTorch builds for these
|
||||
architectures simultaneously:
|
||||
- gfx900
|
||||
- gfx906
|
||||
- gfx908
|
||||
- gfx90a
|
||||
- gfx1030
|
||||
:::
|
||||
|
||||
5. To determine your AMD uarch, run:
|
||||
|
||||
```bash
|
||||
rocminfo | grep gfx
|
||||
```
|
||||
|
||||
6. In the event you want to compile only for your uarch, use:
|
||||
|
||||
```bash
|
||||
export PYTORCH_ROCM_ARCH=<uarch>
|
||||
```
|
||||
|
||||
`<uarch>` is the architecture reported by the `rocminfo` command.
|
||||
|
||||
7. Build PyTorch using the following command:
|
||||
|
||||
```bash
|
||||
./.jenkins/pytorch/build.sh
|
||||
```
|
||||
|
||||
This will first convert PyTorch sources for HIP compatibility and build the
|
||||
PyTorch framework.
|
||||
|
||||
8. Alternatively, build PyTorch by issuing the following commands:
|
||||
|
||||
```bash
|
||||
python3 tools/amd_build/build_amd.py
|
||||
USE_ROCM=1 MAX_JOBS=4 python3 setup.py install ––user
|
||||
```
|
||||
|
||||
#### Option 4: Install Using PyTorch Upstream Docker File
|
||||
|
||||
Instead of using a prebuilt base Docker image, you can build a custom base
|
||||
Docker image using scripts from the PyTorch repository. This will utilize a
|
||||
standard Docker image from operating system maintainers and install all the
|
||||
dependencies required to build PyTorch, including
|
||||
|
||||
- ROCm
|
||||
- Torchvision
|
||||
- Conda packages
|
||||
- Compiler toolchain
|
||||
|
||||
Follow these steps:
|
||||
|
||||
1. Clone the PyTorch repository on the host.
|
||||
|
||||
```bash
|
||||
cd ~
|
||||
git clone https://github.com/pytorch/pytorch.git
|
||||
cd pytorch
|
||||
git submodule update --init –recursive
|
||||
```
|
||||
|
||||
2. Build the PyTorch Docker image.
|
||||
|
||||
```bash
|
||||
cd.circleci/docker
|
||||
./build.sh pytorch-linux-bionic-rocm<version>-py3.7
|
||||
# eg. ./build.sh pytorch-linux-bionic-rocm3.10-py3.7
|
||||
```
|
||||
|
||||
This should be complete with a message "Successfully build `<image_id>`."
|
||||
|
||||
3. Start a Docker container using the image:
|
||||
|
||||
```bash
|
||||
docker run -it --cap-add=SYS_PTRACE --security-opt
|
||||
seccomp=unconfined --device=/dev/kfd --device=/dev/dri --group-add
|
||||
video --ipc=host --shm-size 8G <image_id>
|
||||
```
|
||||
|
||||
You can also pass -v argument to mount any data directories from the host
|
||||
onto the container.
|
||||
|
||||
4. Clone the PyTorch repository.
|
||||
|
||||
```bash
|
||||
cd ~
|
||||
git clone https://github.com/pytorch/pytorch.git
|
||||
cd pytorch
|
||||
git submodule update --init --recursive
|
||||
```
|
||||
|
||||
5. Build PyTorch for ROCm.
|
||||
|
||||
:::{note}
|
||||
By default in the rocm/pytorch:latest-base, PyTorch builds for these
|
||||
architectures simultaneously:
|
||||
- gfx900
|
||||
- gfx906
|
||||
- gfx908
|
||||
- gfx90a
|
||||
- gfx1030
|
||||
:::
|
||||
|
||||
6. To determine your AMD uarch, run:
|
||||
|
||||
```bash
|
||||
rocminfo | grep gfx
|
||||
```
|
||||
|
||||
7. If you want to compile only for your uarch:
|
||||
|
||||
```bash
|
||||
export PYTORCH_ROCM_ARCH=<uarch>
|
||||
```
|
||||
|
||||
`<uarch>` is the architecture reported by the rocminfo command.
|
||||
|
||||
8. Build PyTorch using:
|
||||
|
||||
```bash
|
||||
./.jenkins/pytorch/build.sh
|
||||
```
|
||||
|
||||
This will first convert PyTorch sources to be HIP compatible and then build the
|
||||
PyTorch framework.
|
||||
|
||||
Alternatively, build PyTorch by issuing the following commands:
|
||||
|
||||
```bash
|
||||
python3 tools/amd_build/build_amd.py
|
||||
USE_ROCM=1 MAX_JOBS=4 python3 setup.py install --user
|
||||
```
|
||||
|
||||
### Test the PyTorch Installation
|
||||
|
||||
You can use PyTorch unit tests to validate a PyTorch installation. If using a
|
||||
prebuilt PyTorch Docker image from AMD ROCm DockerHub or installing an official
|
||||
wheels package, these tests are already run on those configurations.
|
||||
Alternatively, you can manually run the unit tests to validate the PyTorch
|
||||
installation fully.
|
||||
|
||||
Follow these steps:
|
||||
|
||||
1. Test if PyTorch is installed and accessible by importing the torch package in
|
||||
Python.
|
||||
|
||||
:::{note}
|
||||
Do not run in the PyTorch git folder.
|
||||
:::
|
||||
|
||||
```bash
|
||||
python3 -c 'import torch' 2> /dev/null && echo 'Success' || echo 'Failure'
|
||||
```
|
||||
|
||||
2. Test if the GPU is accessible from PyTorch. In the PyTorch framework,
|
||||
torch.cuda is a generic mechanism to access the GPU; it will access an AMD
|
||||
GPU only if available.
|
||||
|
||||
```bash
|
||||
python3 -c 'import torch; print(torch.cuda.is_available())'
|
||||
```
|
||||
|
||||
3. Run the unit tests to validate the PyTorch installation fully. Run the
|
||||
following command from the PyTorch home directory:
|
||||
|
||||
```bash
|
||||
BUILD_ENVIRONMENT=${BUILD_ENVIRONMENT:-rocm} ./.jenkins/pytorch/test.sh
|
||||
```
|
||||
|
||||
This ensures that even for wheel installs in a non-controlled environment,
|
||||
the required environment variable will be set to skip certain unit tests for
|
||||
ROCm.
|
||||
|
||||
:::{note}
|
||||
Make sure the PyTorch source code is corresponding to the PyTorch wheel or
|
||||
installation in the Docker image. Incompatible PyTorch source code might give
|
||||
errors when running the unit tests.
|
||||
:::
|
||||
|
||||
This will first install some dependencies, such as a supported torchvision
|
||||
version for PyTorch. Torchvision is used in some PyTorch tests for loading
|
||||
models. Next, this will run all the unit tests.
|
||||
|
||||
:::{note}
|
||||
Some tests may be skipped, as appropriate, based on your system
|
||||
configuration. All features of PyTorch are not supported on ROCm, and the
|
||||
tests that evaluate these features are skipped. In addition, depending on the
|
||||
host memory, or the number of available GPUs, other tests may be skipped. No
|
||||
test should fail if the compilation and installation are correct.
|
||||
:::
|
||||
|
||||
4. Run individual unit tests with the following command:
|
||||
|
||||
```bash
|
||||
PYTORCH_TEST_WITH_ROCM=1 python3 test/test_nn.py --verbose
|
||||
```
|
||||
|
||||
test_nn.py can be replaced with any other test set.
|
||||
|
||||
### Run a Basic PyTorch Example
|
||||
|
||||
The PyTorch examples repository provides basic examples that exercise the
|
||||
functionality of the framework. MNIST (Modified National Institute of Standards
|
||||
and Technology) database is a collection of handwritten digits that may be used
|
||||
to train a Convolutional Neural Network for handwriting recognition.
|
||||
Alternatively, ImageNet is a database of images used to train a network for
|
||||
visual object recognition.
|
||||
|
||||
Follow these steps:
|
||||
|
||||
1. Clone the PyTorch examples repository.
|
||||
|
||||
```bash
|
||||
git clone https://github.com/pytorch/examples.git
|
||||
```
|
||||
|
||||
2. Run the MNIST example.
|
||||
|
||||
```bash
|
||||
cd examples/mnist
|
||||
```
|
||||
|
||||
3. Follow the instructions in the README file in this folder. In this case:
|
||||
|
||||
```bash
|
||||
pip3 install -r requirements.txt
|
||||
python3 main.py
|
||||
```
|
||||
|
||||
4. Run the ImageNet example.
|
||||
|
||||
```bash
|
||||
cd examples/imagenet
|
||||
```
|
||||
|
||||
5. Follow the instructions in the README file in this folder. In this case:
|
||||
|
||||
```bash
|
||||
pip3 install -r requirements.txt
|
||||
python3 main.py
|
||||
```
|
||||
|
||||
@@ -1,4 +1,178 @@
|
||||
# TensorFlow Installation for ROCm
|
||||
|
||||
Pull content from
|
||||
<https://docs.amd.com/bundle/ROCm-Deep-Learning-Guide-v5.4.1/page/Frameworks_Installation.html>
|
||||
## TensorFlow
|
||||
|
||||
TensorFlow is an open source library for solving Machine Learning,
|
||||
Deep Learning, and Artificial Intelligence problems. It can be used to solve
|
||||
many problems across different sectors and industries but primarily focuses on
|
||||
training and inference in neural networks. It is one of the most popular and
|
||||
in-demand frameworks and is very active in open source contribution and
|
||||
development.
|
||||
|
||||
### Installing TensorFlow
|
||||
|
||||
The following sections contain options for installing TensorFlow.
|
||||
|
||||
#### Option 1: Install TensorFlow Using Docker Image
|
||||
|
||||
To install ROCm on bare metal, follow the section
|
||||
[ROCm Installation](https://docs.amd.com/bundle/ROCm-Deep-Learning-Guide-v5.4-/page/Prerequisites.html#d2999e60).
|
||||
The recommended option to get a TensorFlow environment is through Docker.
|
||||
|
||||
Using Docker provides portability and access to a prebuilt Docker container that
|
||||
has been rigorously tested within AMD. This might also save compilation time and
|
||||
should perform as tested without facing potential installation issues.
|
||||
Follow these steps:
|
||||
|
||||
1. Pull the latest public TensorFlow Docker image.
|
||||
|
||||
```bash
|
||||
docker pull rocm/tensorflow:latest
|
||||
```
|
||||
|
||||
2. Once you have pulled the image, run it by using the command below:
|
||||
|
||||
```bash
|
||||
docker run -it --network=host --device=/dev/kfd --device=/dev/dri
|
||||
--ipc=host --shm-size 16G --group-add video --cap-add=SYS_PTRACE
|
||||
--security-opt seccomp=unconfined rocm/tensorflow:latest
|
||||
```
|
||||
|
||||
#### Option 2: Install TensorFlow Using Wheels Package
|
||||
|
||||
To install TensorFlow using the wheels package, follow these steps:
|
||||
|
||||
1. Check the Python version.
|
||||
|
||||
```bash
|
||||
python3 –version
|
||||
```
|
||||
|
||||
| If: | Then: |
|
||||
|:-----------------------------------:|:--------------------------------:|
|
||||
| The Python version is less than 3.7 | Upgrade Python. |
|
||||
| The Python version is more than 3.7 | Skip this step and go to Step 3. |
|
||||
|
||||
:::{note}
|
||||
The supported Python versions are:
|
||||
|
||||
- 3.7
|
||||
- 3.8
|
||||
- 3.9
|
||||
- 3.10
|
||||
:::
|
||||
|
||||
```bash
|
||||
sudo apt-get install python3.7 # or python3.8 or python 3.9 or python 3.10
|
||||
```
|
||||
|
||||
2. Set up multiple Python versions using update-alternatives.
|
||||
|
||||
```bash
|
||||
update-alternatives --query python3
|
||||
sudo update-alternatives --install
|
||||
/usr/bin/python3 python3 /usr/bin/python[version] [priority]
|
||||
```
|
||||
|
||||
:::{note}
|
||||
Follow the instruction in Step 2 for incompatible Python versions.
|
||||
:::
|
||||
|
||||
```bash
|
||||
sudo update-alternatives --config python3
|
||||
```
|
||||
|
||||
3. Follow the screen prompts, and select the Python version installed in Step 2.
|
||||
|
||||
4. Install or upgrade PIP.
|
||||
|
||||
```bash
|
||||
sudo apt install python3-pip
|
||||
```
|
||||
|
||||
To install PIP, use the following:
|
||||
|
||||
```bash
|
||||
/usr/bin/python[version] -m pip install --upgrade pip
|
||||
```
|
||||
|
||||
Upgrade PIP for Python version installed in step 2:
|
||||
|
||||
```bash
|
||||
sudo pip3 install --upgrade pip
|
||||
```
|
||||
|
||||
5. Install TensorFlow for the Python version as indicated in Step 2.
|
||||
|
||||
```bash
|
||||
/usr/bin/python[version] -m pip install --user tensorflow-rocm==[wheel-version] –upgrade
|
||||
```
|
||||
|
||||
For a valid wheel version for a ROCm release, refer to the instruction below:
|
||||
|
||||
```bash
|
||||
sudo apt install rocm-libs rccl
|
||||
```
|
||||
|
||||
6. Update protobuf to 3.19 or lower.
|
||||
|
||||
```bash
|
||||
/usr/bin/python3.7 -m pip install protobuf=3.19.0
|
||||
sudo pip3 install tensorflow
|
||||
```
|
||||
|
||||
7. Set the environment variable PYTHONPATH.
|
||||
|
||||
```bash
|
||||
export PYTHONPATH="./.local/lib/python[version]/site-packages:$PYTHONPATH" #Use same python version as in step 2
|
||||
```
|
||||
|
||||
8. Install libraries.
|
||||
|
||||
```bash
|
||||
sudo apt install rocm-libs rccl
|
||||
```
|
||||
|
||||
9. Test installation.
|
||||
|
||||
```bash
|
||||
python3 -c 'import tensorflow' 2> /dev/null && echo 'Success' || echo 'Failure'
|
||||
```
|
||||
|
||||
:::{note}
|
||||
For details on tensorflow-rocm wheels and ROCm version compatibility, see:
|
||||
[https://github.com/ROCmSoftwarePlatform/tensorflow-upstream/blob/develop-upstream/rocm_docs/tensorflow-rocm-release.md](https://github.com/ROCmSoftwarePlatform/tensorflow-upstream/blob/develop-upstream/rocm_docs/tensorflow-rocm-release.md)
|
||||
:::
|
||||
|
||||
### Test the TensorFlow Installation
|
||||
|
||||
To test the installation of TensorFlow, run the container image as specified in
|
||||
the previous section Installing TensorFlow. Ensure you have access to the Python
|
||||
shell in the Docker container.
|
||||
|
||||
```bash
|
||||
python3 -c 'import tensorflow' 2> /dev/null && echo ‘Success’ || echo ‘Failure’
|
||||
```
|
||||
|
||||
### Run a Basic TensorFlow Example
|
||||
|
||||
The TensorFlow examples repository provides basic examples that exercise the
|
||||
framework's functionality. The MNIST database is a collection of handwritten
|
||||
digits that may be used to train a Convolutional Neural Network for handwriting
|
||||
recognition.
|
||||
|
||||
Follow these steps:
|
||||
|
||||
1. Clone the TensorFlow example repository.
|
||||
|
||||
```bash
|
||||
cd ~
|
||||
git clone https://github.com/tensorflow/models.git
|
||||
```
|
||||
|
||||
2. Install the dependencies of the code, and run the code.
|
||||
|
||||
```bash
|
||||
#pip3 install requirement.txt
|
||||
#python mnist_tf.py
|
||||
```
|
||||
|
||||
@@ -1,4 +1,330 @@
|
||||
# OpenMP Support in ROCm
|
||||
|
||||
Pull from
|
||||
<https://docs.amd.com/bundle/OpenMP-Support-Guide-v5.4/page/Introduction_to_OpenMP_Support_Guide.html>
|
||||
## Introduction to OpenMP Support Guide
|
||||
|
||||
he ROCm™ installation includes an LLVM-based implementation that fully supports the OpenMP 4.5 standard and a subset of OpenMP 5.0, 5.1, and 5.2 standards. Fortran, C/C++ compilers, and corresponding runtime libraries are included. Along with host APIs, the OpenMP compilers support offloading code and data onto GPU devices. This document briefly describes the installation location of the OpenMP toolchain, example usage of device offloading, and usage of rocprof with OpenMP applications. The GPUs supported are the same as those supported by this ROCm release. See the list of supported GPUs in the installation guide at [https://docs.amd.com/](https://docs.amd.com/).
|
||||
|
||||
### Installation
|
||||
|
||||
The OpenMP toolchain is automatically installed as part of the standard ROCm installation and is available under /opt/rocm-{version}/llvm. The sub-directories are:
|
||||
|
||||
bin: Compilers (flang and clang) and other binaries.
|
||||
|
||||
- examples: The usage section below shows how to compile and run these programs.
|
||||
|
||||
- include: Header files.
|
||||
|
||||
- lib: Libraries including those required for target offload.
|
||||
|
||||
- lib-debug: Debug versions of the above libraries.
|
||||
|
||||
## OpenMP: Usage
|
||||
|
||||
The example programs can be compiled and run by pointing the environment variable AOMP to the OpenMP install directory.
|
||||
|
||||
**Example:**
|
||||
|
||||
```bash
|
||||
% export AOMP=/opt/rocm-{version}/llvm
|
||||
% cd $AOMP/examples/openmp/veccopy
|
||||
% make run
|
||||
```
|
||||
|
||||
The above invocation of Make compiles and runs the program. Note the options that are required for target offload from an OpenMP program:
|
||||
|
||||
```bash
|
||||
-target x86_64-pc-linux-gnu -fopenmp -fopenmp-targets=amdgcn-amd-amdhsa -Xopenmp-target=amdgcn-amd-amdhsa -march=<gpu-arch>
|
||||
```
|
||||
|
||||
Obtain the value of gpu-arch by running the following command:
|
||||
|
||||
```bash
|
||||
% /opt/rocm-{version}/bin/rocminfo | grep gfx
|
||||
```
|
||||
|
||||
[//]: # (dated link below, needs upading)
|
||||
|
||||
See the complete list of compiler command-line references [here](https://github.com/RadeonOpenCompute/llvm-project/blob/amd-stg-open/clang/docs/ClangCommandLineReference.rst).
|
||||
|
||||
### Using rocprof with OpenMP
|
||||
|
||||
The following steps describe a typical workflow for using rocprof with OpenMP code compiled with AOMP:
|
||||
|
||||
1. Run rocprof with the program command line:
|
||||
|
||||
```bash
|
||||
% rocprof <application> <args>
|
||||
```
|
||||
|
||||
This produces a results.csv file in the user’s current directory that shows basic stats such as kernel names, grid size, number of registers used, etc. The user can choose to specify the preferred output file name using the o option.
|
||||
|
||||
2. Add options for a detailed result:
|
||||
|
||||
```bash
|
||||
--stats: % rocprof --stats <application> <args>
|
||||
```
|
||||
|
||||
The stats option produces timestamps for the kernels. Look into the output CSV file for the field, DurationNs, which is useful in getting an understanding of the critical kernels in the code.
|
||||
|
||||
Apart from --stats, the option --timestamp on produces a timestamp for the kernels.
|
||||
|
||||
3. After learning about the required kernels, the user can take a detailed look at each one of them. rocprof has support for hardware counters: a set of basic and a set of derived ones. See the complete list of counters using options --list-basic and --list-derived. rocprof accepts either a text or an XML file as an input.
|
||||
|
||||
For more details on rocprof, refer to the ROCm Profiling Tools document on [https://docs.amd.com](https://docs.amd.com).
|
||||
|
||||
### Using Tracing Options
|
||||
|
||||
**Prerequisite:** When using the --sys-trace option, compile the OpenMP program with:
|
||||
|
||||
```bash
|
||||
-Wl,–rpath,/opt/rocm-{version}/lib -lamdhip64
|
||||
```
|
||||
|
||||
The following tracing options are widely used to generate useful information:
|
||||
|
||||
- **--hsa-trace**: This option is used to get a JSON output file with the HSA API execution traces and a flat profile in a CSV file.
|
||||
|
||||
- **--sys-trace**: This allows programmers to trace both HIP and HSA calls. Since this option results in loading ``libamdhip64.so``, follow the prerequisite as mentioned above.
|
||||
|
||||
A CSV and a JSON file are produced by the above trace options. The CSV file presents the data in a tabular format, and the JSON file can be visualized using Google Chrome at chrome://tracing/ or [Perfetto](https://perfetto.dev/). Navigate to Chrome or Perfetto and load the JSON file to see the timeline of the HSA calls.
|
||||
|
||||
For more details on tracing, refer to the ROCm Profiling Tools document on [https://docs.amd.com](https://docs.amd.com).
|
||||
|
||||
### Environment Variables
|
||||
|
||||
:::{table}
|
||||
:widths: auto
|
||||
| Environment Variable | Description |
|
||||
| ----------- | ----------- |
|
||||
| OMP_NUM_TEAMS | The implementation chooses the number of teams for kernel launch. The user can change this number for performance tuning using this environment variable, subject to implementation limits. |
|
||||
| OMPX_DISABLE_MAPS | Under USM mode, the implementation automatically checks for correctness of the map clauses without performing any copying. The user can disable this check by setting this environment variable to 1. |
|
||||
| LIBOMPTARGET_KERNEL_TRACE | This environment variable is used to print useful statistics for device operations. Setting it to 1 and running the program emits the name of every kernel launched, the number of teams and threads used, and the corresponding register usage. Setting it to 2 additionally emits timing information for kernel launches and data transfer operations between the host and the device. |
|
||||
| LIBOMPTARGET_INFO | This environment variable is used to print informational messages from the device runtime as the program executes. Users can request fine-grain information by setting it to the value of 1 or higher and can set the value of -1 for complete information. |
|
||||
| LIBOMPTARGET_DEBUG | If a debug version of the device library is present, setting this environment variable to 1 and using that library emits further detailed debugging information about data transfer operations and kernel launch. |
|
||||
| GPU_MAX_HW_QUEUES | This environment variable is used to set the number of HSA queues in the OpenMP runtime. |
|
||||
:::
|
||||
|
||||
## OpenMP: Features
|
||||
|
||||
The OpenMP programming model is greatly enhanced with the following new features implemented in the past releases.
|
||||
|
||||
### Asynchronous Behavior in OpenMP Target Regions
|
||||
|
||||
- Multithreaded offloading on the same device
|
||||
|
||||
The libomptarget plugin for GPU offloading allows creation of separate configurable HSA queues per chiplet, which enables two or more threads to concurrently offload to the same device.
|
||||
|
||||
- Parallel memory copy invocations
|
||||
|
||||
Implicit asynchronous execution of single target region enables parallel memory copy invocations.
|
||||
|
||||
### Unified Shared Memory
|
||||
|
||||
Unified Shared Memory (USM) provides a pointer-based approach to memory management. To implement USM, fulfill the following system requirements along with Xnack capability.
|
||||
|
||||
#### Prerequisites
|
||||
|
||||
- Linux Kernel versions above 5.14
|
||||
|
||||
- Latest KFD driver packaged in ROCm stack
|
||||
|
||||
- Xnack, as USM support can only be tested with applications compiled with Xnack capability
|
||||
|
||||
#### Xnack Capability
|
||||
|
||||
When enabled, Xnack capability allows programmers to handle page faults at runtime gracefully. When executing the binaries compiled with Xnack replay enabled, any page fault at runtime leads to a repeated attempt to access the memory.
|
||||
|
||||
```bash
|
||||
xnack+ --offload-arch=gfx908:xnack+
|
||||
```
|
||||
|
||||
The programmer must write offloading kernels carefully to avoid any page faults on the GPU at runtime when choosing to disable Xnack replay.
|
||||
|
||||
```bash
|
||||
xnack- with –offload-arch=gfx908:xnack-
|
||||
```
|
||||
|
||||
#### Unified Shared Memory Pragma
|
||||
|
||||
This OpenMP pragma is available on MI200 through xnack+ support.
|
||||
|
||||
```bash
|
||||
omp requires unified_shared_memory
|
||||
```
|
||||
|
||||
As stated in the OpenMP specifications, this pragma makes the map clause on target constructs optional. By default, on MI200, all memory allocated on the host is fine grain. Using the map clause on a target clause is allowed, which transforms the access semantics of the associated memory to coarse grain.
|
||||
|
||||
```bash
|
||||
A simple program demonstrating the use of this feature is:
|
||||
$ cat parallel_for.cpp
|
||||
#include <stdlib.h>
|
||||
#include <stdio.h>
|
||||
|
||||
#define N 64
|
||||
#pragma omp requires unified_shared_memory
|
||||
int main() {
|
||||
int n = N;
|
||||
int *a = new int[n];
|
||||
int *b = new int[n];
|
||||
|
||||
for(int i = 0; i < n; i++)
|
||||
b[i] = i;
|
||||
|
||||
#pragma omp target parallel for map(to:b[:n])
|
||||
for(int i = 0; i < n; i++)
|
||||
a[i] = b[i];
|
||||
|
||||
for(int i = 0; i < n; i++)
|
||||
if(a[i] != i)
|
||||
printf("error at %d: expected %d, got %d\n", i, i+1, a[i]);
|
||||
|
||||
return 0;
|
||||
}
|
||||
$ clang++ -O2 -target x86_64-pc-linux-gnu -fopenmp --offload-arch=gfx90a:xnack+ parallel_for.cpp
|
||||
$ HSA_XNACK=1 ./a.out
|
||||
```
|
||||
|
||||
In the above code example, pointer “a” is not mapped in the target region, while pointer “b” is. Both are valid pointers on the GPU device and passed by-value to the kernel implementing the target region. This means the pointer values on the host and the device are the same.
|
||||
|
||||
The difference between the memory pages pointed to by these two variables is that the pages pointed by “a” are in fine-grain memory, while the pages pointed to by “b” are in coarse-grain memory during and after the execution of the target region. This is accomplished in the OpenMP runtime library with calls to the ROCR runtime to set the pages pointed by “b” as coarse grain.
|
||||
|
||||
### OMPT Target Support
|
||||
|
||||
The OpenMP runtime in ROCm implements a subset of the OMPT device APIs, as described in the OpenMP specification document. These APIs allow first-party tools to examine the profile and kernel traces that execute on a device. A tool can register callbacks for data transfer and kernel dispatch entry points or use APIs to start and stop tracing for device-related activities such as data transfer and kernel dispatch timings and associated metadata. If device tracing is enabled, trace records for device activities are collected during program execution and returned to the tool using the APIs described in the specification.
|
||||
|
||||
The following example demonstrates how a tool uses the supported OMPT target APIs. The README in /opt/rocm/llvm/examples/tools/ompt outlines the steps to be followed, and the provided example can be run as shown below:
|
||||
|
||||
```bash
|
||||
% cd /opt/rocm/llvm/examples/tools/ompt/veccopy-ompt-target-tracing
|
||||
% make run
|
||||
```
|
||||
|
||||
The file veccopy-ompt-target-tracing.c simulates how a tool initiates device activity tracing. The file callbacks.h shows the callbacks registered and implemented by the tool.
|
||||
|
||||
### Floating Point Atomic Operations
|
||||
|
||||
The MI200-series GPUs support the generation of hardware floating-point atomics using the OpenMP atomic pragma. The support includes single- and double-precision floating-point atomic operations. The programmer must ensure that the memory subjected to the atomic operation is in coarse-grain memory by mapping it explicitly with the help of map clauses when not implicitly mapped by the compiler as per the [OpenMP specifications](https://www.openmp.org/specifications/). This makes these hardware floating-point atomic instructions “fast,” as they are faster than using a default compare-and-swap loop scheme, but at the same time “unsafe,” as they are not supported on fine-grain memory. The operation in unified_shared_memory mode also requires programmers to map the memory explicitly when not implicitly mapped by the compiler.
|
||||
|
||||
To request fast floating-point atomic instructions at the file level, use compiler flag -munsafe-fp-atomics or a hint clause on a specific pragma:
|
||||
|
||||
```bash
|
||||
double a = 0.0;
|
||||
#pragma omp atomic hint(AMD_fast_fp_atomics)
|
||||
a = a + 1.0;
|
||||
```
|
||||
|
||||
NOTE AMD_unsafe_fp_atomics is an alias for AMD_fast_fp_atomics, and AMD_safe_fp_atomics is implemented with a compare-and-swap loop.
|
||||
|
||||
To disable the generation of fast floating-point atomic instructions at the file level, build using the option -msafe-fp-atomics or use a hint clause on a specific pragma:
|
||||
|
||||
```bash
|
||||
double a = 0.0;
|
||||
#pragma omp atomic hin.t(AMD_safe_fp_atomics)
|
||||
a = a + 1.0;
|
||||
```
|
||||
|
||||
The hint clause value always has a precedence over the compiler flag, which allows programmers to create atomic constructs with a different behavior than the rest of the file.
|
||||
|
||||
See the example below, where the user builds the program using -msafe-fp-atomics to select a file-wide “safe atomic” compilation. However, the fast atomics hint clause over variable “a” takes precedence and operates on “a” using a fast/unsafe floating-point atomic, while the variable “b” in the absence of a hint clause is operated upon using safe floating-point atomics as per the compiler flag.
|
||||
|
||||
```bash
|
||||
double a = 0.0;.
|
||||
#pragma omp atomic hint(AMD_fast_fp_atomics)
|
||||
a = a + 1.0;
|
||||
|
||||
double b = 0.0;
|
||||
#pragma omp atomic
|
||||
b = b + 1.0;
|
||||
```
|
||||
|
||||
### Address Sanitizer (ASan) Tool
|
||||
|
||||
Address Sanitizer is a memory error detector tool utilized by applications to detect various errors ranging from spatial issues such as out-of-bound access to temporal issues such as use-after-free. The AOMP compiler supports ASan for AMDGPUs with applications written in both HIP and OpenMP.
|
||||
|
||||
**Features Supported on Host Platform (Target x86_64):**
|
||||
|
||||
- Use-after-free
|
||||
|
||||
- Buffer overflows
|
||||
|
||||
- Heap buffer overflow
|
||||
|
||||
- Stack buffer overflow
|
||||
|
||||
- Global buffer overflow
|
||||
|
||||
- Use-after-return
|
||||
|
||||
- Use-after-scope
|
||||
|
||||
- Initialization order bugs
|
||||
|
||||
**Features Supported on AMDGPU Platform (amdgcn-amd-amdhsa):**
|
||||
- Heap buffer overflow
|
||||
|
||||
- Global buffer overflow
|
||||
|
||||
**Software (Kernel/OS) Requirements:** Unified Shared Memory support with Xnack capability. See the section on [Unified Shared Memory](#unified-shared-memory) for prerequisites and details on Xnack.
|
||||
|
||||
**Example:**
|
||||
|
||||
- Heap buffer overflow
|
||||
|
||||
```bash
|
||||
void main() {
|
||||
....... // Some program statements
|
||||
....... // Some program statements
|
||||
#pragma omp target map(to : A[0:N], B[0:N]) map(from: C[0:N])
|
||||
{
|
||||
#pragma omp parallel for
|
||||
for(int i =0 ; i < N; i++){
|
||||
C[i+10] = A[i] + B[i];
|
||||
} // end of for loop
|
||||
}
|
||||
....... // Some program statements
|
||||
}// end of main
|
||||
```
|
||||
|
||||
See the complete sample code for heap buffer overflow [here](https://github.com/ROCm-Developer-Tools/aomp/blob/aomp-dev/examples/tools/asan/heap_buffer_overflow/openmp/vecadd-HBO.cpp).
|
||||
|
||||
- Global buffer overflow
|
||||
|
||||
```bash
|
||||
#pragma omp declare target
|
||||
int A[N],B[N],C[N];
|
||||
#pragma omp end declare target
|
||||
void main(){
|
||||
...... // some program statements
|
||||
...... // some program statements
|
||||
#pragma omp target data map(to:A[0:N],B[0:N]) map(from: C[0:N])
|
||||
{
|
||||
#pragma omp target update to(A,B)
|
||||
#pragma omp target parallel for
|
||||
for(int i=0; i<N; i++){
|
||||
C[i]=A[i*100]+B[i+22];
|
||||
} // end of for loop
|
||||
#pragma omp target update from(C)
|
||||
}
|
||||
........ // some program statements
|
||||
} // end of main
|
||||
```
|
||||
|
||||
See the complete sample code for global buffer overflow [here](https://github.com/ROCm-Developer-Tools/aomp/blob/aomp-dev/examples/tools/asan/global_buffer_overflow/openmp/vecadd-GBO.cpp).
|
||||
|
||||
### No-loop Kernel Generation
|
||||
|
||||
The No-loop kernel generation feature optimizes the compiler performance by generating a specialized kernel for certain OpenMP Target Constructs such as target teams distribute parallel for. The specialized kernel generation assumes that every thread executes a single iteration of the user loop, which implies that the runtime launches a total number of GPU threads equal to or greater than the iteration space size of the target region loop. This allows the compiler to generate code for the loop body without an enclosing loop, resulting in reduced control-flow complexity and potentially better performance.
|
||||
|
||||
To enable the generation of the specialized kernel, follow these guidelines:
|
||||
|
||||
- Do not specify teams, threads, and schedule-related environment variables. The num_teams or a thread_limit clause in an OpenMP target construct acts as an override and prevents the generation of the specialized kernel. As the user is unable to specify the number of teams and threads used within target regions in the absence of the above-mentioned environment variables, the runtime will select the best values for the launch configuration based on runtime knowledge of the program.
|
||||
|
||||
- Assert the absence of the above-mentioned environment variables by adding the command-line option fopenmp-target-ignore-env-vars. This option also allows programmers to enable the No-loop functionality at lower optimization levels.
|
||||
|
||||
- Also, the No-loop functionality is automatically enabled when -O3 or -Ofast is used for compilation. To disable this feature, use -fno-openmp-target-ignore-env-vars.
|
||||
|
||||
Note The compiler might not generate the No-loop kernel in certain scenarios where the performance improvement is not substantial.
|
||||
|
||||
### Cross-Team Optimized Reductions
|
||||
|
||||
In scenarios where a No-loop kernel is generated but the OpenMP construct has a reduction clause, the compiler may generate optimized code utilizing efficient Cross-Team (Xteam) communication. No separate user option is required, and there is a significant performance improvement with Xteam reduction. New APIs for Xteam reduction are implemented in the device runtime, and clang generates these APIs automatically.
|
||||
|
||||