Update titles to sentence case (#2455)

This commit is contained in:
Lisa
2023-09-18 12:26:31 -06:00
committed by GitHub
parent 772b51a7d2
commit d0d4eed1a6
128 changed files with 1857 additions and 2019 deletions

View File

@@ -1 +1 @@
# Development Tools
# Development tools

View File

@@ -9,7 +9,6 @@ The AMD Debugger API is a library that provides all the support necessary for a
debugger and other tools to perform low level control of the execution and
inspection of execution state of AMD's commercially available GPU architectures.
* {doc}`Documentation <rocdbgapi:index>`
* [GitHub](https://github.com/ROCm-Developer-Tools/ROCdbgapi/)
:::
@@ -21,15 +20,12 @@ ROCmCC is a Clang/LLVM-based compiler. It is optimized for high-performance
computing on AMD GPUs and CPUs and supports various heterogeneous programming
models such as HIP, OpenMP, and OpenCL.
* [Documentation](../rocmcc/rocmcc.md)
:::
:::{grid-item-card} {doc}`ROCgdb <rocgdb:index>`
This is ROCgdb, the ROCm source-level debugger for Linux, based on GDB, the GNU source-level debugger.
* {doc}`Documentation <rocgdb:index>`
* [GitHub](https://github.com/ROCm-Developer-Tools/ROCgdb/)
:::
@@ -38,7 +34,6 @@ This is ROCgdb, the ROCm source-level debugger for Linux, based on GDB, the GNU
ROC profiler library. Profiling with performance counters and derived metrics. Library supports GFX8/GFX9. Hardware specific low-level performance analysis interface for profiling of GPU compute applications. The profiling includes hardware performance counters with complex performance metrics.
* {doc}`Documentation <rocprofiler:rocprof>`
* [GitHub](https://github.com/ROCm-Developer-Tools/rocprofiler/)
:::
@@ -47,13 +42,12 @@ ROC profiler library. Profiling with performance counters and derived metrics. L
Callback/Activity Library for Performance tracing AMD GPUs
* {doc}`Documentation <roctracer:index>`
* [GitHub](https://github.com/ROCm-Developer-Tools/roctracer)
:::
:::::
## See Also
## See also
* [Compiler Disambiguation](../../conceptual/compiler-disambiguation.md)
* [Compiler disambiguation](../../conceptual/compiler-disambiguation.md)

View File

@@ -1,4 +1,4 @@
# Management Tools
# Management tools
:::::{grid} 1 1 3 3
:gutter: 1
@@ -14,9 +14,8 @@ The AMD System Management Interface Library, or AMD SMI library, is a C library
:::{grid-item-card} {doc}`ROCm SMI LIB <rocm_smi_lib:index>`
This tool acts as a command line interface for manipulating and monitoring the AMD GPU kernel, and is intended to replace and deprecate the existing `rocm_smi.py` CLI tool. It uses `ctypes` to call the `rocm_smi_lib` API.
This tool acts as a command-line interface (CLI) for manipulating and monitoring the AMD GPU kernel, and is intended to replace and deprecate the existing `rocm_smi.py` CLI tool. It uses `ctypes` to call the `rocm_smi_lib` API.
* {doc}`Documentation <rocm_smi_lib:index>`
* [GitHub](https://github.com/RadeonOpenCompute/rocm_smi_lib)
* [Examples](https://github.com/RadeonOpenCompute/rocm_smi_lib/tree/master/python_smi_tools)

View File

@@ -1,4 +1,4 @@
# Validation Tools
# Validation tools
:::::{grid} 1 1 2 2
:gutter: 1
@@ -7,7 +7,6 @@
The ROCm Validation Suite is a system administrators and cluster manager's tool for detecting and troubleshooting common problems affecting AMD GPU(s) running in a high-performance computing environment, enabled using the ROCm software stack on a compatible platform.
* {doc}`Documentation <rocmvalidationsuite:index>`
* [GitHub](https://github.com/ROCm-Developer-Tools/ROCmValidationSuite)
* [Changelog](https://github.com/ROCm-Developer-Tools/ROCmValidationSuite/blob/master/CHANGELOG.md)
@@ -17,7 +16,6 @@ The ROCm Validation Suite is a system administrators and cluster manager's to
TransferBench is a simple utility capable of benchmarking simultaneous transfers between user-specified devices (CPUs/GPUs).
* {doc}`Documentation <transferbench:index>`
* [GitHub](https://github.com/ROCmSoftwarePlatform/TransferBench/)
* [Changelog](https://github.com/ROCmSoftwarePlatform/TransferBench/blob/develop/CHANGELOG.md)
* {doc}`transferbench:examples/index`

View File

@@ -3,17 +3,16 @@
HIP is both AMD's GPU programming language extension and the GPU runtime. This
page introduces the HIP runtime and other HIP libraries and tools.
## HIP Runtime
## HIP runtime
:::::{grid} 1 1 2 2
:gutter: 1
:::{grid-item-card} {doc}`HIP Runtime <hip:index>`
:::{grid-item-card} {doc}`HIP runtime <hip:index>`
The HIP Runtime is used to enable GPU acceleration for all HIP language based
The HIP runtime is used to enable GPU acceleration for all HIP language based
products.
* {doc}`Documentation <hip:index>`
* [GitHub](https://github.com/ROCm-Developer-Tools/HIP)
* [Examples](https://github.com/amd/rocm-examples/tree/develop/HIP-Basic)
@@ -28,10 +27,9 @@ products.
:::{grid-item-card} {doc}`HIPIFY <hipify:index>`
HIPIFY assists with porting applications from based on CUDA to the HIP Runtime.
HIPIFY assists with porting applications from based on CUDA to the HIP runtime.
Supported CUDA APIs are documented here as well.
* {doc}`Documentation <hipify:index>`
* [GitHub](https://github.com/ROCm-Developer-Tools/HIPIFY/)
* [Changelog](https://github.com/ROCm-Developer-Tools/HIPIFY/blob/amd-staging/CHANGELOG.md)

View File

@@ -1,6 +1,6 @@
# Reference material
## ROCm Software Groups
## ROCm software groups
:::::{grid} 1 1 2 2
:gutter: 1
@@ -10,14 +10,13 @@
HIP is both AMD's GPU programming language extension and the GPU runtime.
* {doc}`HIP <hip:index>`
* [HIP Examples](https://github.com/amd/rocm-examples/tree/develop/HIP-Basic)
* {doc}`HIPIFY <hipify:index>`
:::
:::{grid-item-card}
**[Math Libraries](./libraries/gpu-libraries/math.md)**
**[Math libraries](./libraries/gpu-libraries/math.md)**
HIP Math Libraries support the following domains:
@@ -28,7 +27,7 @@ HIP Math Libraries support the following domains:
:::
:::{grid-item-card}
**[C++ Primitive Libraries](./libraries/gpu-libraries/c++primitives.md)**
**[C++ primitive libraries](./libraries/gpu-libraries/c++primitives.md)**
ROCm template libraries for C++ primitives and algorithms are as follows:
@@ -39,7 +38,7 @@ ROCm template libraries for C++ primitives and algorithms are as follows:
:::
:::{grid-item-card} [Communication Libraries](./libraries/gpu-libraries/communication.md)
:::{grid-item-card} [Communication libraries](./libraries/gpu-libraries/communication.md)
Inter and intra-node communication is supported by the following projects:
* {doc}`RCCL <rccl:index>`
@@ -47,26 +46,26 @@ Inter and intra-node communication is supported by the following projects:
:::
:::{grid-item-card}
**[Artificial intelligence](../rocm-ai.md)**
**Artificial intelligence**
Libraries related to AI.
Libraries related to artificial intelligence.
* {doc}`MIOpen <miopen:index>`
* {doc}`Composable Kernel <composable_kernel:index>`
* {doc}`MIGraphX <amdmigraphx:index>`
* {doc}`MIVisionX <mivisionx:README>`
* {doc}`rocAL <rocal:README>`
* {doc}`ROCm Augmentation Library (rocAL) <rocal:README>`
:::
:::{grid-item-card}
**[OpenMP](./openmp/openmp.md)**
* [OpenMP Support Guide](./openmp/openmp.md)
* [OpenMP support guide](./openmp/openmp.md)
:::
:::{grid-item-card}
**[Compilers and Tools](./compilers-tools/index.md)**
**[Compilers and tools](./compilers-tools/index.md)**
* [ROCmCC](./rocmcc/rocmcc.md)
* {doc}`ROCdbgapi <rocdbgapi:index>`
@@ -77,7 +76,7 @@ Libraries related to AI.
:::
:::{grid-item-card}
**[Management Tools](./compilers-tools/management-tools.md)**
**[Management tools](./compilers-tools/management-tools.md)**
* {doc}`AMD SMI <amdsmi:index>`
* {doc}`ROCm SMI <rocm_smi_lib:index>`
@@ -86,14 +85,14 @@ Libraries related to AI.
:::
:::{grid-item-card}
**[Validation Tools](./compilers-tools/validation-tools.md)**
**[Validation tools](./compilers-tools/validation-tools.md)**
* {doc}`ROCm Validation Suite <rocmvalidationsuite:index>`
* {doc}`TransferBench <transferbench:index>`
:::
:::{grid-item-card} **GPU Architectures**
:::{grid-item-card} **GPU architectures**
* [AMD Instinct MI200](../conceptual/gpu-arch/mi250.md)
* [AMD Instinct MI100](../conceptual/gpu-arch/mi100.md)

View File

@@ -1,4 +1,4 @@
# AI Libraries
# AI libraries
::::{grid} 1 1 2 2
:gutter: 1
@@ -7,7 +7,6 @@
AMD's library for high performance machine learning primitives.
* {doc}`Documentation <miopen:index>`
* [GitHub](https://github.com/ROCmSoftwarePlatform/MIOpen)
* [Changelog](https://github.com/ROCmSoftwarePlatform/MIOpen/blob/develop/CHANGELOG.md)
@@ -17,7 +16,6 @@ AMD's library for high performance machine learning primitives.
Composable Kernel: Performance Portable Programming Model for Machine Learning Tensor Operators
* {doc}`Documentation <composable_kernel:index>`
* [GitHub](https://github.com/ROCmSoftwarePlatform/composable_kernel)
* [Changelog](https://github.com/ROCmSoftwarePlatform/composable_kernel/blob/develop/CHANGELOG.md)
@@ -27,7 +25,6 @@ Composable Kernel: Performance Portable Programming Model for Machine Learning T
AMD MIGraphX is AMD's graph inference engine that accelerates machine learning model inference.
* {doc}`Documentation <amdmigraphx:index>`
* [GitHub](https://github.com/ROCmSoftwarePlatform/AMDMIGraphX)
* [Changelog](https://github.com/ROCmSoftwarePlatform/AMDMIGraphX/blob/develop/CHANGELOG.md)

View File

@@ -1,4 +1,4 @@
# C++ Primitive Libraries
# C++ primitive libraries
ROCm template libraries for algorithms are as follows:
@@ -11,7 +11,6 @@ rocPRIM is an AMD GPU optimized template library of algorithm primitives, like
transforms, reductions, scans, etc. It also serves as a common back-end for
similar libraries found inside ROCm.
* {doc}`Documentation <rocprim:index>`
* [GitHub](https://github.com/ROCmSoftwarePlatform/rocPRIM/)
* [Changelog](https://github.com/ROCmSoftwarePlatform/rocPRIM/blob/develop/CHANGELOG.md)
* [Examples](https://github.com/amd/rocm-examples/tree/develop/Libraries/rocPRIM)
@@ -24,7 +23,6 @@ rocThrust is a template library of algorithm primitives with a Thrust-compatible
interface. Their CPU back-ends are identical, while the GPU back-end calls into
rocPRIM.
* {doc}`Documentation <rocthrust:index>`
* [GitHub](https://github.com/ROCmSoftwarePlatform/rocThrust)
* [Changelog](https://github.com/ROCmSoftwarePlatform/rocThrust/blob/develop/CHANGELOG.md)
* [Examples](https://github.com/amd/rocm-examples/tree/develop/Libraries/rocThrust)
@@ -36,7 +34,6 @@ rocPRIM.
hipCUB is a template library of algorithm primitives with a CUB-compatible
interface. It's back-end is rocPRIM.
* {doc}`Documentation <hipcub:index>`
* [GitHub](https://github.com/ROCmSoftwarePlatform/hipCUB)
* [Changelog](https://github.com/ROCmSoftwarePlatform/hipCUB/blob/develop/CHANGELOG.md)
* [Examples](https://github.com/amd/rocm-examples/tree/develop/Libraries/hipCUB)
@@ -49,7 +46,6 @@ hipTensor is AMD's C++ library for accelerating tensor primitives
based on the composable kernel library,
through general purpose kernel languages, like HIP C++.
* {doc}`Documentation <hiptensor:index>`
* [GitHub](https://github.com/ROCmSoftwarePlatform/hipTensor)
:::

View File

@@ -1,4 +1,4 @@
# Communication Libraries
# Communication libraries
:::::{grid} 1 1 1 1
:gutter: 1
@@ -10,7 +10,6 @@ implementing all-reduce, all-gather, reduce, broadcast, reduce-scatter, gather,
The collective operations are implemented using ring and tree algorithms and have been optimized for
throughput and latency.
* {doc}`Documentation <rccl:index>`
* [GitHub](https://github.com/ROCmSoftwarePlatform/rccl)
* [Changelog](https://github.com/ROCmSoftwarePlatform/rocFFT/blob/develop/CHANGELOG.md)
* [Examples](https://github.com/ROCmSoftwarePlatform/rccl/tree/develop/tools)

View File

@@ -1,6 +1,6 @@
# Fast Fourier transforms
ROCm libraries for FFT are as follows:
ROCm libraries for fast Fourier transforms (FFTs) are as follows:
:::::{grid} 1 1 2 2
:gutter: 1

View File

@@ -1,4 +1,4 @@
# Linear Algebra Libraries
# Linear algebra libraries
ROCm libraries for linear algebra are as follows:
@@ -9,7 +9,6 @@ ROCm libraries for linear algebra are as follows:
`rocBLAS` is an AMD GPU optimized library for BLAS (Basic Linear Algebra Subprograms).
* {doc}`Documentation <rocblas:index>`
* [GitHub](https://github.com/ROCmSoftwarePlatform/rocBLAS)
* [Changelog](https://github.com/ROCmSoftwarePlatform/rocBLAS/blob/develop/CHANGELOG.md)
* [Examples](https://github.com/amd/rocm-examples/tree/develop/Libraries/rocBLAS)
@@ -22,7 +21,6 @@ ROCm libraries for linear algebra are as follows:
via `rocBLAS` and `rocSOLVER`. `hipBLAS` allows for a common interface for other GPU
BLAS libraries.
* {doc}`Documentation <hipblas:index>`
* [GitHub](https://github.com/ROCmSoftwarePlatform/hipBLAS)
* [Changelog](https://github.com/ROCmSoftwarePlatform/hipBLAS/blob/develop/CHANGELOG.md)
@@ -35,7 +33,6 @@ flexible API and extends functionalities beyond traditional BLAS library.
`hipBLASLt` is exposed APIs in HIP programming language with an underlying
optimized generator as a back-end kernel provider.
* {doc}`Documentation <hipblaslt:index>`
* [GitHub](https://github.com/ROCmSoftwarePlatform/hipBLASLt)
* [Changelog](https://github.com/ROCmSoftwarePlatform/hipBLASLt/blob/develop/CHANGELOG.md)
@@ -47,7 +44,6 @@ optimized generator as a back-end kernel provider.
fine-grained parallelism on top of AMD's ROCm runtime and toolchains, targeting
modern CPU and GPU platforms.
* {doc}`Documentation <rocalution:index>`
* [GitHub](https://github.com/ROCmSoftwarePlatform/rocALUTION)
* [Changelog](https://github.com/ROCmSoftwarePlatform/rocALUTION/blob/develop/CHANGELOG.md)
@@ -58,7 +54,6 @@ modern CPU and GPU platforms.
`rocWMMA` provides an API to break down mixed precision matrix multiply-accumulate
(MMA) problems into fragments and distributes these over GPU wavefronts.
* {doc}`Documentation <rocwmma:index>`
* [GitHub](https://github.com/ROCmSoftwarePlatform/rocWMMA)
* [Changelog](https://github.com/ROCmSoftwarePlatform/rocWMMA/blob/develop/CHANGELOG.md)
@@ -68,7 +63,6 @@ modern CPU and GPU platforms.
`rocSOLVER` provides a subset of LAPACK (Linear Algebra Package) functionality on the ROCm platform.
* {doc}`Documentation <rocsolver:index>`
* [GitHub](https://github.com/ROCmSoftwarePlatform/rocSOLVER)
* [Changelog](https://github.com/ROCmSoftwarePlatform/rocSOLVER/blob/develop/CHANGELOG.md)
@@ -79,7 +73,6 @@ modern CPU and GPU platforms.
`hipSOLVER` is a LAPACK marshalling library supporting both `rocSOLVER` and `cuSOLVER`
as backends whilst exporting a unified interface.
* {doc}`Documentation <hipsolver:index>`
* [GitHub](https://github.com/ROCmSoftwarePlatform/hipSOLVER)
* [Changelog](https://github.com/ROCmSoftwarePlatform/hipSOLVER/blob/develop/CHANGELOG.md)
@@ -89,7 +82,6 @@ as backends whilst exporting a unified interface.
`rocSPARSE` is a library to provide BLAS for sparse computations.
* {doc}`Documentation <rocsparse:index>`
* [GitHub](https://github.com/ROCmSoftwarePlatform/rocSPARSE)
* [Changelog](https://github.com/ROCmSoftwarePlatform/rocSOLVER/blob/develop/CHANGELOG.md)
@@ -100,7 +92,6 @@ as backends whilst exporting a unified interface.
`hipSPARSE` is a marshalling library to provide sparse BLAS functionality,
supporting both `rocSPARSE` and `cuSPARSE` as backends.
* {doc}`Documentation <hipsparse:index>`
* [GitHub](https://github.com/ROCmSoftwarePlatform/hipSPARSE)
* [Changelog](https://github.com/ROCmSoftwarePlatform/hipSOLVER/blob/develop/CHANGELOG.md)
@@ -111,7 +102,6 @@ supporting both `rocSPARSE` and `cuSPARSE` as backends.
`hipSPARSE` is a marshalling library to provide sparse BLAS functionality,
supporting both `rocSPARSELt` and `cuSPARSELt` as backends.
* {doc}`Documentation <hipsparselt:index>`
* [GitHub](https://github.com/ROCmSoftwarePlatform/hipSPARSELt)
:::

View File

@@ -16,7 +16,7 @@ at compile-time of the hipLIB in question. For dynamic dispatch between vendor i
:gutter: 1
:::{grid-item-card}
**[Linear Algebra Libraries](./math-linear-algebra.md)**
**[Linear algebra libraries](./math-linear-algebra.md)**
* {doc}`rocBLAS <rocblas:index>`
* {doc}`hipBLAS <hipblas:index>`
@@ -40,7 +40,7 @@ at compile-time of the hipLIB in question. For dynamic dispatch between vendor i
:::
:::{grid-item-card}
**[Random Numbers](./rand.md)**
**[Random numbers](./rand.md)**
* {doc}`rocRAND <rocrand:index>`
* {doc}`hipRAND <hiprand:index>`

View File

@@ -1,13 +1,12 @@
# Random Numbers
# Random numbers
:::::{grid} 1 1 2 2
:gutter: 1
:::{grid-item-card} {doc}`rocRAND <rocrand:index>`
rocRAND is an AMD GPU optimized library for pseudo-random number generators (PRNG).
rocRAND is an AMD GPU optimized library for pseudorandom number generators.
* {doc}`Documentation <rocrand:index>`
* [GitHub](https://github.com/ROCmSoftwarePlatform/rocRAND/)
* [Changelog](https://github.com/ROCmSoftwarePlatform/rocRAND/blob/develop/CHANGELOG.md)
* [Examples](https://github.com/amd/rocm-examples/tree/develop/Libraries/rocRAND)
@@ -16,11 +15,10 @@ rocRAND is an AMD GPU optimized library for pseudo-random number generators (PRN
:::{grid-item-card} {doc}`hipRAND <hiprand:index>`
hipRAND is a compatibility layer for GPU accelerated pseudo-random number
generation (PRNG) optimized for AMD GPUs using rocRAND. hipRAND allows for a
common interface for other non AMD GPU PRNG libraries.
hipRAND is a compatibility layer for GPU accelerated pseudorandom number generation optimized for
AMD GPUs using rocRAND. hipRAND allows for a common interface for other non AMD GPU
pseudorandom number generation libraries.
* {doc}`Documentation <hiprand:index>`
* [GitHub](https://github.com/ROCmSoftwarePlatform/hipRAND/)
* [Changelog](https://github.com/ROCmSoftwarePlatform/hipRAND/blob/develop/CHANGELOG.md)

View File

@@ -1,4 +1,4 @@
# OpenMP Support in ROCm
# OpenMP support in ROCm
## Introduction
@@ -14,9 +14,7 @@ this ROCm release. See the list of supported GPUs for [Linux](../../about/compat
The ROCm OpenMP compiler is implemented using LLVM compiler technology.
The following image illustrates the internal steps taken to translate a users application into an executable that can offload computation to the AMDGPU. The compilation is a two-pass process. Pass 1 compiles the application to generate the CPU code and Pass 2 links the CPU code to the AMDGPU device code.
```{figure} ../../data/reference/openmp/openmp-toolchain.svg
:name: openmp-toolchain
```
![OpenMP toolchain](../../data/reference/openmp/openmp-toolchain.svg "OpenMP toolchain")
### Installation
@@ -31,7 +29,7 @@ bin: Compilers (`flang` and `clang`) and other binaries.
* lib: Libraries including those required for target offload.
* lib-debug: Debug versions of the above libraries.
## OpenMP: Usage
## OpenMP: usage
The example programs can be compiled and run by pointing the environment
variable `ROCM_PATH` to the ROCm install directory.
@@ -44,10 +42,10 @@ cd $ROCM_PATH/share/openmp-extras/examples/openmp/veccopy
sudo make run
```
:::{note}
```{note}
`sudo` is required since we are building inside the `/opt` directory.
Alternatively, copy the files to your home directory first.
:::
```
The above invocation of Make compiles and runs the program. Note the options
that are required for target offload from an OpenMP program:
@@ -56,15 +54,13 @@ that are required for target offload from an OpenMP program:
-fopenmp --offload-arch=<gpu-arch>
```
:::{note}
```{note}
The compiler also accepts the alternative offloading notation:
```bash
-fopenmp -fopenmp-targets=amdgcn-amd-amdhsa -Xopenmp-target=amdgcn-amd-amdhsa -march=<gpu-arch>
-fopenmp -fopenmp-targets=amdgcn-amd-amdhsa -Xopenmp-target=amdgcn-amd-amdhsa -march=<gpu-arch>
```
:::
Obtain the value of `gpu-arch` by running the following command:
```bash
@@ -113,7 +109,7 @@ code compiled with AOMP:
For more details on `rocprof`, refer to the {doc}`ROCProfilerV1 User Manual <rocprofiler:rocprofv1>`.
### Using Tracing Options
### Using tracing options
**Prerequisite:** When using the `--sys-trace` option, compile the OpenMP
program with:
@@ -139,7 +135,7 @@ HSA calls.
For more details on tracing, refer to the {doc}`ROCProfilerV1 User Manual <rocprofiler:rocprofv1>`.
### Environment Variables
### Environment variables
:::{table}
:widths: auto
@@ -154,14 +150,14 @@ For more details on tracing, refer to the {doc}`ROCProfilerV1 User Manual <rocpr
| `OMPX_FORCE_SYNC_REGIONS` | To force the runtime to execute all operations synchronously, i.e., wait for an operation to complete immediately. This affects data transfers and kernel execution. While it is mainly designed for debugging, it may have a minor positive effect on performance in certain situations. |
:::
## OpenMP: Features
## OpenMP: features
The OpenMP programming model is greatly enhanced with the following new features
implemented in the past releases.
(openmp_usm)=
### Asynchronous Behavior in OpenMP Target Regions
### Asynchronous behavior in OpenMP target regions
* Controlling Asynchronous Behavior
@@ -176,7 +172,7 @@ The `libomptarget` plugin for GPU offloading allows creation of separate configu
Implicit asynchronous execution of single target region enables parallel memory copy invocations.
### Unified Shared Memory
### Unified shared memory
Unified Shared Memory (USM) provides a pointer-based approach to memory
management. To implement USM, fulfill the following system requirements along
@@ -189,7 +185,7 @@ with Xnack capability.
* Xnack, as USM support can only be tested with applications compiled with Xnack
capability
#### Xnack Capability
#### Xnack capability
When enabled, Xnack capability allows GPU threads to access CPU (system) memory,
allocated with OS-allocators, such as `malloc`, `new`, and `mmap`. Xnack must be
@@ -223,7 +219,7 @@ When Xnack support is not needed:
* At runtime, set the `HSA_XNACK` environment variable to 0.
#### Unified Shared Memory Pragma
#### Unified shared memory pragma
This OpenMP pragma is available on MI200 through `xnack+` support.
@@ -277,7 +273,7 @@ to by “b” are in coarse-grain memory during and after the execution of the
target region. This is accomplished in the OpenMP runtime library with calls to
the ROCr runtime to set the pages pointed by “b” as coarse grain.
### OMPT Target Support
### OMPT target support
The OpenMP runtime in ROCm implements a subset of the OMPT device APIs, as
described in the OpenMP specification document. These APIs allow first-party
@@ -302,7 +298,7 @@ The file `veccopy-ompt-target-tracing.c` simulates how a tool initiates device
activity tracing. The file `callbacks.h` shows the callbacks registered and
implemented by the tool.
### Floating Point Atomic Operations
### Floating point atomic operations
The MI200-series GPUs support the generation of hardware floating-point atomics
using the OpenMP atomic pragma. The support includes single- and
@@ -326,10 +322,10 @@ double a = 0.0;
a = a + 1.0;
```
:::{note}
```{note}
`AMD_unsafe_fp_atomics` is an alias for `AMD_fast_fp_atomics`, and
`AMD_safe_fp_atomics` is implemented with a compare-and-swap loop.
:::
```
To disable the generation of fast floating-point atomic instructions at the file
level, build using the option `-msafe-fp-atomics` or use a hint clause on a
@@ -362,14 +358,14 @@ double b = 0.0;
b = b + 1.0;
```
### Address Sanitizer (ASan) Tool
### AddressSanitizer tool
Address Sanitizer is a memory error detector tool utilized by applications to
AddressSanitizer (ASan) is a memory error detector tool utilized by applications to
detect various errors ranging from spatial issues such as out-of-bound access to
temporal issues such as use-after-free. The AOMP compiler supports ASan for AMD
GPUs with applications written in both HIP and OpenMP.
**Features Supported on Host Platform (Target x86_64):**
**Features supported on host platform (Target x86_64):**
* Use-after-free
* Buffer overflows
@@ -380,12 +376,12 @@ GPUs with applications written in both HIP and OpenMP.
* Use-after-scope
* Initialization order bugs
**Features Supported on AMDGPU Platform (`amdgcn-amd-amdhsa`):**
**Features supported on AMDGPU platform (`amdgcn-amd-amdhsa`):**
* Heap buffer overflow
* Global buffer overflow
**Software (Kernel/OS) Requirements:** Unified Shared Memory support with Xnack
**Software (kernel/OS) requirements:** Unified Shared Memory support with Xnack
capability. See the section on [Unified Shared Memory](#unified-shared-memory)
for prerequisites and details on Xnack.
@@ -436,7 +432,7 @@ for(int i=0; i<N; i++){
See the complete sample code for global buffer overflow
[here](https://github.com/ROCm-Developer-Tools/aomp/blob/aomp-dev/examples/tools/asan/global_buffer_overflow/openmp/vecadd-GBO.cpp).
### Clang Compiler Option for Kernel Optimization
### Clang compiler option for kernel optimization
You can use the clang compiler option `-fopenmp-target-fast` for kernel optimization if certain constraints implied by its component options are satisfied. `-fopenmp-target-fast` enables the following options:
@@ -448,7 +444,7 @@ You can use the clang compiler option `-fopenmp-target-fast` for kernel optimiza
* `-O3` if no `-O*` is specified by the user.
### Specialized Kernels
### Specialized kernels
Clang will attempt to generate specialized kernels based on compiler options and OpenMP constructs. The following specialized kernels are supported:
@@ -466,14 +462,14 @@ To enable the generation of specialized kernels, follow these guidelines:
* To disable specialized kernel generation, use `-fno-openmp-target-ignore-env-vars`.
#### No-Loop Kernel Generation
#### No-loop kernel generation
The No-loop kernel generation feature optimizes the compiler performance by generating a specialized kernel for certain OpenMP target constructs such as target teams distribute parallel for. The specialized kernel generation feature assumes every thread executes a single iteration of the user loop, which leads the runtime to launch a total number of GPU threads equal to or greater than the iteration space size of the target region loop. This allows the compiler to generate code for the loop body without an enclosing loop, resulting in reduced control-flow complexity and potentially better performance.
#### Big-Jump-Loop Kernel Generation
#### Big-jump-loop kernel generation
A No-Loop kernel is not generated if the OpenMP teams construct uses a `num_teams` clause. Instead, the compiler attempts to generate a different specialized kernel called the Big-Jump-Loop kernel. The compiler launches the kernel with a grid size determined by the number of teams specified by the OpenMP `num_teams` clause and the `blocksize` chosen either by the compiler or specified by the corresponding OpenMP clause.
A No-loop kernel is not generated if the OpenMP teams construct uses a `num_teams` clause. Instead, the compiler attempts to generate a different specialized kernel called the Big-Jump-Loop kernel. The compiler launches the kernel with a grid size determined by the number of teams specified by the OpenMP `num_teams` clause and the `blocksize` chosen either by the compiler or specified by the corresponding OpenMP clause.
#### Xteam Optimized Reduction Kernel Generation
#### Cross-team optimized reduction kernel generation
If the OpenMP construct has a reduction clause, the compiler attempts to generate optimized code by utilizing efficient Xteam communication. New APIs for Xteam reduction are implemented in the device runtime and are automatically generated by clang.
If the OpenMP construct has a reduction clause, the compiler attempts to generate optimized code by utilizing efficient cross-team communication. New APIs for cross-team reduction are implemented in the device runtime and are automatically generated by clang.

View File

@@ -1,6 +1,6 @@
# Compiler Reference Guide
# Compiler reference guide
## Introduction to Compiler Reference Guide
## Introduction to compiler reference guide
ROCmCC is a Clang/LLVM-based compiler. It is optimized for high-performance
computing on AMD GPUs and CPUs and supports various heterogeneous programming
@@ -22,7 +22,7 @@ For more details, see:
* AMD GPU usage: [llvm.org/docs/AMDGPUUsage.html](https://llvm.org/docs/AMDGPUUsage.html)
* Releases and source: <https://github.com/RadeonOpenCompute/llvm-project>
### ROCm Compiler Interfaces
### ROCm compiler interfaces
ROCm currently provides two compiler interfaces for compiling HIP programs:
@@ -52,11 +52,11 @@ The major differences between `hipcc` and `amdclang++` are listed below:
| Source code location | <https://github.com/ROCm-Developer-Tools/HIPCC> | <https://github.com/RadeonOpenCompute/llvm-project> |
::::
## Compiler Options and Features
## Compiler options and features
This chapter discusses compiler options and features.
### AMD GPU Compilation
### AMD GPU compilation
This section outlines commonly used compiler flags for `hipcc` and `amdclang++`.
:::{option} -x hip
@@ -111,7 +111,7 @@ This section outlines commonly used compiler flags for `hipcc` and `amdclang++`.
Generates relocatable device code, also known as separate compilation mode.
:::
### AMD Optimizations for Zen Architectures
### AMD optimizations for zen architectures
The CPU compiler optimizations described in this chapter originate from the AMD
Optimizing C/C++ Compiler (AOCC) compiler. They are available in ROCmCC if the
@@ -134,12 +134,12 @@ The `-famd-opt` flag is useful when a user wants to build with the proprietary
optimization compiler and not have to depend on setting any of the other
proprietary optimization flags.
:::{note}
```{note}
`-famd-opt` can be used in addition to the other proprietary CPU optimization
flags. The table of optimizations below implicitly enables the invocation of the
AMD proprietary optimizations compiler, whereas the `-famd-opt` flag requires
this to be handled explicitly.
:::
```
#### `-fstruct-layout=[1,2,3,4,5,6,7]`
@@ -213,7 +213,7 @@ This is an experimental option to generate non-temporal store instruction for
array accesses in a loop, whose iteration count cannot be determined at compile
time. In this case, the compiler assumes the iteration count to be huge.
#### Optimizations Through Driver `-mllvm <options>`
#### Optimizations through driver `-mllvm <options>`
The following optimization options must be invoked through driver
`-mllvm <options>`:
@@ -255,12 +255,12 @@ loop. The heuristic can be controlled with the following options:
Where, `n` is a positive integer and higher value of `<n>` facilitates more unswitching.
:::{note}
```{note}
These options may facilitate more unswitching under some workloads. Since
loop-unswitching inherently leads to code bloat, facilitating more
unswitching may significantly increase the code size. Hence, it may also lead
to longer compilation times.
:::
```
##### `-enable-strided-vectorization`
@@ -430,7 +430,7 @@ such as loop transformations and other optimizations requiring de-linearized
index expressions should use the Hz option. This option has no impact on any
other aspects of the Flang front end.
### Inline ASM Statements
### Inline ASM statements
Inline assembly (ASM) statements allow a developer to include assembly
instructions directly in either host or device code. While the ROCm compiler
@@ -451,18 +451,18 @@ supports ASM statements, their use is not recommended for the following reasons:
* Writing correct ASM statements is often difficult; we strongly recommend
thorough testing of any use of ASM statements.
:::{note}
```{note}
For developers who choose to include ASM statements in the code, AMD is
interested in understanding the use case and appreciates feedback at
[https://github.com/RadeonOpenCompute/ROCm/issues](https://github.com/RadeonOpenCompute/ROCm/issues)
:::
```
### Miscellaneous OpenMP Compiler Features
### Miscellaneous OpenMP compiler features
This section discusses features that have been added or enhanced in the OpenMP
compiler.
#### Offload-arch Tool
#### Offload-arch tool
An LLVM library and tool that is used to query the execution capability of the
current system as well as to query requirements of a binary file. It is used by
@@ -534,7 +534,7 @@ There are symbolic link aliases `amdgpu-offload-arch` and `nvidia-arch` for
These aliases are useful in determining whether architecture-specific tests
should be run or to conditionally load architecture-specific software.
#### Command-Line Simplification Using `offload-arch` Flag
#### Command-line simplification using `offload-arch` flag
Legacy mechanism of specifying offloading target for OpenMP involves using three
flags, `-fopenmp-targets`, `-Xopenmp-target`, and `-march`. The first two flags
@@ -563,7 +563,7 @@ clang -fopenmp -target x86_64-linux-gnu \
To ensure backward compatibility, both styles are supported. This option is
compatible with target ID support and multi-image fat binaries.
#### Target ID Support for OpenMP
#### Target ID support for OpenMP
The ROCmCC compiler supports specification of target features along with the GPU
name while specifying a target offload device in the command line, using
@@ -603,7 +603,7 @@ to linker using `-plugin-opt=-mattr` flag. This feature is compatible with
offload-arch command-line option and multi-image binaries for multiple
architectures.
#### Multi-image Fat Binary for OpenMP
#### Multi-image fat binary for OpenMP
The ROCmCC compiler is enhanced to generate binaries that can contain
heterogenous images. This heterogeneity could be in terms of:
@@ -656,7 +656,7 @@ of target triple and the target GPU (along with the associated target features).
modified to query this structure to identify a compatible image based on the
capability of the current system.
#### Unified Shared Memory (USM)
#### Unified shared memory (USM)
The following OpenMP pragma is available on MI200, and it must be executed with
`xnack+` support.
@@ -668,7 +668,7 @@ omp requires unified_shared_memory
For more details on USM refer to the {ref}`openmp_usm` section of the OpenMP
Guide.
### Support Status of Other Clang Options
### Support status of other Clang options
The following table lists the other Clang options and their support status.
@@ -679,8 +679,8 @@ The following table lists the other Clang options and their support status.
:widths: auto
:align: center
| **Option** | **Support Status** | **Description** |
|------------------------------------------|:------------------:|--------------------------------------------------------------------------------------------------------------------------------|
| **Option** | **Support Status** | **Description** |
|--------------|:-----------------------:|-------------------------|
| `-###` | Supported | Prints (but does not run) the commands to run for this compilation |
| `--analyzer-output <value>` | Supported | "Static analyzer report output format (`html|plist|plist-multi-file|plist-html|sarif|text`)" |
| `--analyze` | Supported | Runs the static analyzer |
@@ -1184,7 +1184,7 @@ The following table lists the other Clang options and their support status.
|-isysroot \<dir\>|Supported|Sets the system root directory (usually /)|
|-isystem-after \<directory\>|Supported|Adds the directory to end of the SYSTEM include search path|
|-isystem \<directory\>|Supported|Adds the directory to SYSTEM include search path|
|-ivfsoverlay \<value\>|Supported|Overlays the virtual filesystem described by the specified file over the real file system|
|-ivfsoverlay \<value\>|Supported|Overlays the virtual file system described by the specified file over the real file system|
|-iwithprefixbefore \<dir\>|Supported|Sets the directory to include search path with prefix|
|-iwithprefix \<dir\>|Supported|Sets the directory to SYSTEM include search path with prefix|
|-iwithsysroot \<directory\>|Supported|Adds directory to SYSTEM include search path; absolute paths are relative to -isysroot|