mirror of
https://github.com/ROCm/ROCm.git
synced 2026-04-27 03:01:52 -04:00
Spell checking (#2070)
* ci: cleanup linters and add spelling checker * docs: fix spelling and styling issues
This commit is contained in:
@@ -32,7 +32,7 @@ ROCm template libraries for C++ primitives and algorithms are as follows:
|
||||
:::
|
||||
|
||||
:::{grid-item-card} [Communication Libraries](gpu_libraries/communication)
|
||||
Inter and intra node communication is supported by the following projects:
|
||||
Inter and intra-node communication is supported by the following projects:
|
||||
|
||||
- [RCCL](https://rocmdocs.amd.com/projects/rccl/en/latest/)
|
||||
|
||||
|
||||
@@ -18,7 +18,7 @@ This is ROCgdb, the ROCm source-level debugger for Linux, based on GDB, the GNU
|
||||
:::
|
||||
|
||||
:::{grid-item-card} [ROCProfiler](https://rocmdocs.amd.com/projects/rocprofiler/en/latest/)
|
||||
ROC profiler library. Profiling with perf-counters and derived metrics. Library supports GFX8/GFX9. HW specific low-level performance analysis interface for profiling of GPU compute applications. The profiling includes HW performance counters with complex performance metrics.
|
||||
ROC profiler library. Profiling with performance counters and derived metrics. Library supports GFX8/GFX9. Hardware specific low-level performance analysis interface for profiling of GPU compute applications. The profiling includes hardware performance counters with complex performance metrics.
|
||||
|
||||
- [Documentation](https://rocmdocs.amd.com/projects/rocprofiler/en/latest/)
|
||||
|
||||
|
||||
@@ -2,7 +2,7 @@
|
||||
|
||||
Pull content from
|
||||
<https://docs.amd.com/bundle/ROCm-Deep-Learning-Guide-v5.4.1/page/Prerequisites.html>.
|
||||
Only the frameworks content. Link to kernel/userspace guide.
|
||||
Only the frameworks content. Link to kernel/user space guide.
|
||||
|
||||
Also pull content from
|
||||
<https://docs.amd.com/bundle/ROCm-Compatible-Frameworks-Release-Notes/page/Framework_Release_Notes.html>
|
||||
|
||||
@@ -11,12 +11,12 @@
|
||||
- [AMD RDNA Instruction Set Architecture](https://www.amd.com/system/files/TechDocs/rdna-shader-instruction-set-architecture.pdf)
|
||||
- [AMD GCN3 Instruction Set Architecture](https://www.amd.com/system/files/TechDocs/gcn3-instruction-set-architecture.pdf)
|
||||
|
||||
## Whitepapers
|
||||
## White Papers
|
||||
|
||||
- [AMD CDNA™ 2 Architecture Whitepaper](https://www.amd.com/system/files/documents/amd-cdna2-white-paper.pdf)
|
||||
- [AMD CDNA Architecture Whitepaper](https://www.amd.com/system/files/documents/amd-cdna-whitepaper.pdf)
|
||||
- [AMD Vega Architecture Whitepaper](https://en.wikichip.org/w/images/a/a1/vega-whitepaper.pdf)
|
||||
- [AMD RDNA Architecture Whitepaper](https://www.amd.com/system/files/documents/rdna-whitepaper.pdf)
|
||||
- [AMD CDNA™ 2 Architecture White Paper](https://www.amd.com/system/files/documents/amd-cdna2-white-paper.pdf)
|
||||
- [AMD CDNA Architecture White Paper](https://www.amd.com/system/files/documents/amd-cdna-whitepaper.pdf)
|
||||
- [AMD Vega Architecture White Paper](https://en.wikichip.org/w/images/a/a1/vega-whitepaper.pdf)
|
||||
- [AMD RDNA Architecture White Paper](https://www.amd.com/system/files/documents/rdna-whitepaper.pdf)
|
||||
|
||||
## Architecture Guides
|
||||
|
||||
|
||||
@@ -39,7 +39,7 @@ Fabric™ bridge for the AMD Instinct™ accelerators.
|
||||
The micro-architecture of the AMD Instinct accelerators is based on the AMD CDNA
|
||||
architecture, which targets compute applications such as high-performance
|
||||
computing (HPC) and AI & machine learning (ML) that run on everything from
|
||||
individual servers to the world’s largest exascale supercomputers. The overall
|
||||
individual servers to the world's largest exascale supercomputers. The overall
|
||||
system architecture is designed for extreme scalability and compute performance.
|
||||
|
||||
:::{figure-md} mi100-block
|
||||
@@ -56,7 +56,7 @@ high-speed links (23 GT/sec, also at the bottom) to the other GPUs of the local
|
||||
hive as shown in {numref}`mi100-arch`.
|
||||
|
||||
On the left and right of the floor plan, the High Bandwidth Memory (HBM)
|
||||
attaches via the GPU’s memory controller. The MI100 generation of the AMD
|
||||
attaches via the GPU's memory controller. The MI100 generation of the AMD
|
||||
Instinct accelerator offers four stacks of HBM generation 2 (HBM2) for a total
|
||||
of 32GB with a 4,096bit-wide memory interface. The peak memory bandwidth of the
|
||||
attached HBM2 is 1.228 TB/sec at a memory clock frequency of 1.2 GHz.
|
||||
@@ -66,7 +66,7 @@ Units (CU). There are a total 120 compute units that are physically organized
|
||||
into eight Shader Engines (SE) with fifteen compute units per shader engine.
|
||||
Each compute unit is further sub-divided into four SIMD units that process SIMD
|
||||
instructions of 16 data elements per instruction. This enables the CU to process
|
||||
64 data elements (a so-called ‘wavefront’) at a peak clock frequency of 1.5 GHz.
|
||||
64 data elements (a so-called 'wavefront') at a peak clock frequency of 1.5 GHz.
|
||||
Therefore, the theoretical maximum FP64 peak performance is 11.5 TFLOPS
|
||||
(`4 [SIMD units] x 16 [elements per instruction] x 120 [CU] x 1.5 [GHz]`).
|
||||
|
||||
@@ -91,7 +91,7 @@ thus may affect execution performance.
|
||||
|
||||
A wavefront can occupy any number of VGPRs from 0 to 256, directly affecting
|
||||
occupancy; that is, the number of concurrently active wavefronts in the CU. For
|
||||
instance, with 119 VPGRs used, only two wavefronts can be active in the CU at
|
||||
instance, with 119 VGPRs used, only two wavefronts can be active in the CU at
|
||||
the same time. With the instruction latency of four cycles per SIMD instruction,
|
||||
the occupancy should be as high as possible such that the compute unit can
|
||||
improve execution efficiency by scheduling instructions from multiple
|
||||
|
||||
@@ -15,7 +15,7 @@ rocBLAS is an AMD GPU optimized library for BLAS.
|
||||
:::
|
||||
|
||||
:::{grid-item-card} [hipBLAS](https://rocmdocs.amd.com/projects/hipBLAS/en/develop/)
|
||||
hipBLAS is a compatiblity layer for GPU accelerated BLAS optimized for AMD GPUs
|
||||
hipBLAS is a compatibility layer for GPU accelerated BLAS optimized for AMD GPUs
|
||||
via rocBLAS and rocSOLVER. hipBLAS allows for a common interface for other GPU
|
||||
BLAS libraries.
|
||||
|
||||
@@ -25,7 +25,7 @@ BLAS libraries.
|
||||
:::
|
||||
|
||||
:::{grid-item-card} [hipBLASLt](https://rocmdocs.amd.com/projects/hipBLASLt/en/develop/)
|
||||
hipBLASLt is a library that provides general matrix-matrix operations with a flexible API and extends funtionalities beyond traditional BLAS library. hipBLASLt is exposed APIs in HIP programming language with an underlying optimized generator as a backend kernel provider.
|
||||
hipBLASLt is a library that provides general matrix-matrix operations with a flexible API and extends functionalities beyond traditional BLAS library. hipBLASLt is exposed APIs in HIP programming language with an underlying optimized generator as a back-end kernel provider.
|
||||
|
||||
- [Documentation](https://rocmdocs.amd.com/projects/hipBLASLt/en/develop/)
|
||||
- [Changelog](https://github.com/ROCmSoftwarePlatform/hipBLASLt/blob/develop/CHANGELOG.md)
|
||||
|
||||
@@ -13,7 +13,7 @@ rocRAND is an AMD GPU optimized library for pseudo-random number generators (PRN
|
||||
:::
|
||||
|
||||
:::{grid-item-card} [hipRAND](https://rocmdocs.amd.com/projects/hipRAND/en/rtd/)
|
||||
hipRAND is a compatiblity layer for GPU accelerated FFT optimized for AMD GPUs
|
||||
hipRAND is a compatibility layer for GPU accelerated FFT optimized for AMD GPUs
|
||||
using rocFFT. hipFFT allows for a common interface for other non AMD GPU
|
||||
FFT libraries.
|
||||
|
||||
|
||||
@@ -1 +1 @@
|
||||
# Kernel Userspace Compatibility Reference
|
||||
# Kernel User Space Compatibility Reference
|
||||
|
||||
@@ -17,7 +17,7 @@ provided by the AMD E-SMI inband library and the ROCm SMI GPU library to the Pro
|
||||
:::
|
||||
|
||||
:::{grid-item-card} [ROCm SMI](https://rocmdocs.amd.com/projects/rocmsmi/en/latest/)
|
||||
This tool acts as a command line interface for manipulating and monitoring the amdgpu kernel, and is intended to replace and deprecate the existing rocm_smi.py CLI tool. It uses Ctypes to call the rocm_smi_lib API.
|
||||
This tool acts as a command line interface for manipulating and monitoring the AMD GPU kernel, and is intended to replace and deprecate the existing `rocm_smi.py` CLI tool. It uses `ctypes` to call the `rocm_smi_lib` API.
|
||||
|
||||
- [Documentation](https://rocmdocs.amd.com/projects/rocmsmi/en/latest/)
|
||||
- [Examples](https://github.com/RadeonOpenCompute/rocm_smi_lib/tree/master/python_smi_tools)
|
||||
@@ -25,7 +25,7 @@ This tool acts as a command line interface for manipulating and monitoring the a
|
||||
:::
|
||||
|
||||
:::{grid-item-card} [ROCm Datacenter Tool](https://rocmdocs.amd.com/projects/rdc/en/latest/)
|
||||
The ROCm™ Data Center Tool simplifies the administration and addresses key infrastructure challenges in AMD GPUs in cluster and datacenter environments.
|
||||
The ROCm™ Data Center Tool simplifies the administration and addresses key infrastructure challenges in AMD GPUs in cluster and data center environments.
|
||||
|
||||
- [Documentation](https://rocmdocs.amd.com/projects/rdc/en/latest/)
|
||||
- [Examples](https://github.com/RadeonOpenCompute/rdc/tree/master/example)
|
||||
|
||||
@@ -6,9 +6,9 @@ The ROCm™ installation includes an LLVM-based implementation that fully suppor
|
||||
|
||||
### Installation
|
||||
|
||||
The OpenMP toolchain is automatically installed as part of the standard ROCm installation and is available under /opt/rocm-{version}/llvm. The sub-directories are:
|
||||
The OpenMP toolchain is automatically installed as part of the standard ROCm installation and is available under `/opt/rocm-{version}/llvm`. The sub-directories are:
|
||||
|
||||
bin: Compilers (flang and clang) and other binaries.
|
||||
bin: Compilers (`flang` and `clang`) and other binaries.
|
||||
|
||||
- examples: The usage section below shows how to compile and run these programs.
|
||||
|
||||
@@ -36,13 +36,13 @@ The above invocation of Make compiles and runs the program. Note the options tha
|
||||
-fopenmp --offload-arch=<gpu-arch>
|
||||
```
|
||||
|
||||
Obtain the value of gpu-arch by running the following command:
|
||||
Obtain the value of `gpu-arch` by running the following command:
|
||||
|
||||
```bash
|
||||
% /opt/rocm-{version}/bin/rocminfo | grep gfx
|
||||
```
|
||||
|
||||
[//]: # (dated link below, needs upading)
|
||||
[//]: # (dated link below, needs updating)
|
||||
|
||||
See the complete list of compiler command-line references [here](https://github.com/RadeonOpenCompute/llvm-project/blob/amd-stg-open/clang/docs/CommandGuide/clang.rst).
|
||||
|
||||
@@ -56,7 +56,7 @@ The following steps describe a typical workflow for using rocprof with OpenMP co
|
||||
% rocprof <application> <args>
|
||||
```
|
||||
|
||||
This produces a results.csv file in the user’s current directory that shows basic stats such as kernel names, grid size, number of registers used, etc. The user can choose to specify the preferred output file name using the o option.
|
||||
This produces a `results.csv` file in the user’s current directory that shows basic stats such as kernel names, grid size, number of registers used, etc. The user can choose to specify the preferred output file name using the o option.
|
||||
|
||||
2. Add options for a detailed result:
|
||||
|
||||
@@ -64,9 +64,9 @@ The following steps describe a typical workflow for using rocprof with OpenMP co
|
||||
--stats: % rocprof --stats <application> <args>
|
||||
```
|
||||
|
||||
The stats option produces timestamps for the kernels. Look into the output CSV file for the field, DurationNs, which is useful in getting an understanding of the critical kernels in the code.
|
||||
The stats option produces timestamps for the kernels. Look into the output CSV file for the field, `Durations`, which is useful in getting an understanding of the critical kernels in the code.
|
||||
|
||||
Apart from --stats, the option --timestamp on produces a timestamp for the kernels.
|
||||
Apart from `--stats`, the option `--timestamp` on produces a timestamp for the kernels.
|
||||
|
||||
3. After learning about the required kernels, the user can take a detailed look at each one of them. rocprof has support for hardware counters: a set of basic and a set of derived ones. See the complete list of counters using options --list-basic and --list-derived. rocprof accepts either a text or an XML file as an input.
|
||||
|
||||
@@ -74,7 +74,7 @@ For more details on rocprof, refer to the ROCm Profiling Tools document on [http
|
||||
|
||||
### Using Tracing Options
|
||||
|
||||
**Prerequisite:** When using the --sys-trace option, compile the OpenMP program with:
|
||||
**Prerequisite:** When using the `--sys-trace` option, compile the OpenMP program with:
|
||||
|
||||
```bash
|
||||
-Wl,–rpath,/opt/rocm-{version}/lib -lamdhip64
|
||||
@@ -82,9 +82,9 @@ For more details on rocprof, refer to the ROCm Profiling Tools document on [http
|
||||
|
||||
The following tracing options are widely used to generate useful information:
|
||||
|
||||
- **--hsa-trace**: This option is used to get a JSON output file with the HSA API execution traces and a flat profile in a CSV file.
|
||||
- **`--hsa-trace`**: This option is used to get a JSON output file with the HSA API execution traces and a flat profile in a CSV file.
|
||||
|
||||
- **--sys-trace**: This allows programmers to trace both HIP and HSA calls. Since this option results in loading ``libamdhip64.so``, follow the prerequisite as mentioned above.
|
||||
- **`--sys-trace`**: This allows programmers to trace both HIP and HSA calls. Since this option results in loading ``libamdhip64.so``, follow the prerequisite as mentioned above.
|
||||
|
||||
A CSV and a JSON file are produced by the above trace options. The CSV file presents the data in a tabular format, and the JSON file can be visualized using Google Chrome at chrome://tracing/ or [Perfetto](https://perfetto.dev/). Navigate to Chrome or Perfetto and load the JSON file to see the timeline of the HSA calls.
|
||||
|
||||
@@ -94,14 +94,14 @@ For more details on tracing, refer to the ROCm Profiling Tools document on [http
|
||||
|
||||
:::{table}
|
||||
:widths: auto
|
||||
| Environment Variable | Description |
|
||||
| ----------- | ----------- |
|
||||
| OMP_NUM_TEAMS | The implementation chooses the number of teams for kernel launch. The user can change this number for performance tuning using this environment variable, subject to implementation limits. |
|
||||
| OMPX_DISABLE_MAPS | Under USM mode, the implementation automatically checks for correctness of the map clauses without performing any copying. The user can disable this check by setting this environment variable to 1. |
|
||||
| LIBOMPTARGET_KERNEL_TRACE | This environment variable is used to print useful statistics for device operations. Setting it to 1 and running the program emits the name of every kernel launched, the number of teams and threads used, and the corresponding register usage. Setting it to 2 additionally emits timing information for kernel launches and data transfer operations between the host and the device. |
|
||||
| LIBOMPTARGET_INFO | This environment variable is used to print informational messages from the device runtime as the program executes. Users can request fine-grain information by setting it to the value of 1 or higher and can set the value of -1 for complete information. |
|
||||
| LIBOMPTARGET_DEBUG | If a debug version of the device library is present, setting this environment variable to 1 and using that library emits further detailed debugging information about data transfer operations and kernel launch. |
|
||||
| GPU_MAX_HW_QUEUES | This environment variable is used to set the number of HSA queues in the OpenMP runtime. |
|
||||
| Environment Variable | Description |
|
||||
| --------------------------- | ----------- |
|
||||
| `OMP_NUM_TEAMS` | The implementation chooses the number of teams for kernel launch. The user can change this number for performance tuning using this environment variable, subject to implementation limits. |
|
||||
| `OMPX_DISABLE_MAPS` | Under USM mode, the implementation automatically checks for correctness of the map clauses without performing any copying. The user can disable this check by setting this environment variable to 1. |
|
||||
| `LIBOMPTARGET_KERNEL_TRACE` | This environment variable is used to print useful statistics for device operations. Setting it to 1 and running the program emits the name of every kernel launched, the number of teams and threads used, and the corresponding register usage. Setting it to 2 additionally emits timing information for kernel launches and data transfer operations between the host and the device. |
|
||||
| `LIBOMPTARGET_INFO` | This environment variable is used to print informational messages from the device runtime as the program executes. Users can request fine-grain information by setting it to the value of 1 or higher and can set the value of -1 for complete information. |
|
||||
| `LIBOMPTARGET_DEBUG` | If a debug version of the device library is present, setting this environment variable to 1 and using that library emits further detailed debugging information about data transfer operations and kernel launch. |
|
||||
| `GPU_MAX_HW_QUEUES` | This environment variable is used to set the number of HSA queues in the OpenMP runtime. |
|
||||
:::
|
||||
|
||||
## OpenMP: Features
|
||||
@@ -110,9 +110,9 @@ The OpenMP programming model is greatly enhanced with the following new features
|
||||
|
||||
### Asynchronous Behavior in OpenMP Target Regions
|
||||
|
||||
- Multithreaded offloading on the same device
|
||||
- Multi-threaded offloading on the same device
|
||||
|
||||
The libomptarget plugin for GPU offloading allows creation of separate configurable HSA queues per chiplet, which enables two or more threads to concurrently offload to the same device.
|
||||
The `libomptarget` plugin for GPU offloading allows creation of separate configurable HSA queues per chiplet, which enables two or more threads to concurrently offload to the same device.
|
||||
|
||||
- Parallel memory copy invocations
|
||||
|
||||
@@ -146,7 +146,7 @@ xnack- with -–offload-arch=gfx908:xnack-
|
||||
|
||||
#### Unified Shared Memory Pragma
|
||||
|
||||
This OpenMP pragma is available on MI200 through xnack+ support.
|
||||
This OpenMP pragma is available on MI200 through `xnack+` support.
|
||||
|
||||
```bash
|
||||
omp requires unified_shared_memory
|
||||
@@ -192,20 +192,20 @@ The difference between the memory pages pointed to by these two variables is tha
|
||||
|
||||
The OpenMP runtime in ROCm implements a subset of the OMPT device APIs, as described in the OpenMP specification document. These APIs allow first-party tools to examine the profile and kernel traces that execute on a device. A tool can register callbacks for data transfer and kernel dispatch entry points or use APIs to start and stop tracing for device-related activities such as data transfer and kernel dispatch timings and associated metadata. If device tracing is enabled, trace records for device activities are collected during program execution and returned to the tool using the APIs described in the specification.
|
||||
|
||||
The following example demonstrates how a tool uses the supported OMPT target APIs. The README in /opt/rocm/llvm/examples/tools/ompt outlines the steps to be followed, and the provided example can be run as shown below:
|
||||
The following example demonstrates how a tool uses the supported OMPT target APIs. The `README` in `/opt/rocm/llvm/examples/tools/ompt` outlines the steps to be followed, and the provided example can be run as shown below:
|
||||
|
||||
```bash
|
||||
% cd $ROCM_PATH/share/openmp-extras/examples/tools/ompt/veccopy-ompt-target-tracing
|
||||
% make run
|
||||
```
|
||||
|
||||
The file veccopy-ompt-target-tracing.c simulates how a tool initiates device activity tracing. The file callbacks.h shows the callbacks registered and implemented by the tool.
|
||||
The file `veccopy-ompt-target-tracing.c` simulates how a tool initiates device activity tracing. The file `callbacks.h` shows the callbacks registered and implemented by the tool.
|
||||
|
||||
### Floating Point Atomic Operations
|
||||
|
||||
The MI200-series GPUs support the generation of hardware floating-point atomics using the OpenMP atomic pragma. The support includes single- and double-precision floating-point atomic operations. The programmer must ensure that the memory subjected to the atomic operation is in coarse-grain memory by mapping it explicitly with the help of map clauses when not implicitly mapped by the compiler as per the [OpenMP specifications](https://www.openmp.org/specifications/). This makes these hardware floating-point atomic instructions “fast,” as they are faster than using a default compare-and-swap loop scheme, but at the same time “unsafe,” as they are not supported on fine-grain memory. The operation in unified_shared_memory mode also requires programmers to map the memory explicitly when not implicitly mapped by the compiler.
|
||||
The MI200-series GPUs support the generation of hardware floating-point atomics using the OpenMP atomic pragma. The support includes single- and double-precision floating-point atomic operations. The programmer must ensure that the memory subjected to the atomic operation is in coarse-grain memory by mapping it explicitly with the help of map clauses when not implicitly mapped by the compiler as per the [OpenMP specifications](https://www.openmp.org/specifications/). This makes these hardware floating-point atomic instructions “fast,” as they are faster than using a default compare-and-swap loop scheme, but at the same time “unsafe,” as they are not supported on fine-grain memory. The operation in `unified_shared_memory` mode also requires programmers to map the memory explicitly when not implicitly mapped by the compiler.
|
||||
|
||||
To request fast floating-point atomic instructions at the file level, use compiler flag -munsafe-fp-atomics or a hint clause on a specific pragma:
|
||||
To request fast floating-point atomic instructions at the file level, use compiler flag `-munsafe-fp-atomics` or a hint clause on a specific pragma:
|
||||
|
||||
```bash
|
||||
double a = 0.0;
|
||||
@@ -213,9 +213,9 @@ double a = 0.0;
|
||||
a = a + 1.0;
|
||||
```
|
||||
|
||||
NOTE AMD_unsafe_fp_atomics is an alias for AMD_fast_fp_atomics, and AMD_safe_fp_atomics is implemented with a compare-and-swap loop.
|
||||
NOTE `AMD_unsafe_fp_atomics` is an alias for `AMD_fast_fp_atomics`, and `AMD_safe_fp_atomics` is implemented with a compare-and-swap loop.
|
||||
|
||||
To disable the generation of fast floating-point atomic instructions at the file level, build using the option -msafe-fp-atomics or use a hint clause on a specific pragma:
|
||||
To disable the generation of fast floating-point atomic instructions at the file level, build using the option `-msafe-fp-atomics` or use a hint clause on a specific pragma:
|
||||
|
||||
```bash
|
||||
double a = 0.0;
|
||||
@@ -225,7 +225,7 @@ a = a + 1.0;
|
||||
|
||||
The hint clause value always has a precedence over the compiler flag, which allows programmers to create atomic constructs with a different behavior than the rest of the file.
|
||||
|
||||
See the example below, where the user builds the program using -msafe-fp-atomics to select a file-wide “safe atomic” compilation. However, the fast atomics hint clause over variable “a” takes precedence and operates on “a” using a fast/unsafe floating-point atomic, while the variable “b” in the absence of a hint clause is operated upon using safe floating-point atomics as per the compiler flag.
|
||||
See the example below, where the user builds the program using `-msafe-fp-atomics` to select a file-wide “safe atomic” compilation. However, the fast atomics hint clause over variable “a” takes precedence and operates on “a” using a fast/unsafe floating-point atomic, while the variable “b” in the absence of a hint clause is operated upon using safe floating-point atomics as per the compiler flag.
|
||||
|
||||
```bash
|
||||
double a = 0.0;.
|
||||
@@ -239,7 +239,7 @@ b = b + 1.0;
|
||||
|
||||
### Address Sanitizer (ASan) Tool
|
||||
|
||||
Address Sanitizer is a memory error detector tool utilized by applications to detect various errors ranging from spatial issues such as out-of-bound access to temporal issues such as use-after-free. The AOMP compiler supports ASan for AMDGPUs with applications written in both HIP and OpenMP.
|
||||
Address Sanitizer is a memory error detector tool utilized by applications to detect various errors ranging from spatial issues such as out-of-bound access to temporal issues such as use-after-free. The AOMP compiler supports ASan for AMD GPUs with applications written in both HIP and OpenMP.
|
||||
|
||||
**Features Supported on Host Platform (Target x86_64):**
|
||||
|
||||
@@ -259,7 +259,7 @@ Address Sanitizer is a memory error detector tool utilized by applications to de
|
||||
|
||||
- Initialization order bugs
|
||||
|
||||
**Features Supported on AMDGPU Platform (amdgcn-amd-amdhsa):**
|
||||
**Features Supported on AMDGPU Platform (`amdgcn-amd-amdhsa`):**
|
||||
|
||||
- Heap buffer overflow
|
||||
|
||||
@@ -318,11 +318,11 @@ The No-loop kernel generation feature optimizes the compiler performance by gene
|
||||
|
||||
To enable the generation of the specialized kernel, follow these guidelines:
|
||||
|
||||
- Do not specify teams, threads, and schedule-related environment variables. The num_teams or a thread_limit clause in an OpenMP target construct acts as an override and prevents the generation of the specialized kernel. As the user is unable to specify the number of teams and threads used within target regions in the absence of the above-mentioned environment variables, the runtime will select the best values for the launch configuration based on runtime knowledge of the program.
|
||||
- Do not specify teams, threads, and schedule-related environment variables. The `num_teams` or a `thread_limit` clause in an OpenMP target construct acts as an override and prevents the generation of the specialized kernel. As the user is unable to specify the number of teams and threads used within target regions in the absence of the above-mentioned environment variables, the runtime will select the best values for the launch configuration based on runtime knowledge of the program.
|
||||
|
||||
- Assert the absence of the above-mentioned environment variables by adding the command-line option fopenmp-target-ignore-env-vars. This option also allows programmers to enable the No-loop functionality at lower optimization levels.
|
||||
- Assert the absence of the above-mentioned environment variables by adding the command-line option `-fopenmp-target-ignore-env-vars`. This option also allows programmers to enable the No-loop functionality at lower optimization levels.
|
||||
|
||||
- Also, the No-loop functionality is automatically enabled when -O3 or -Ofast is used for compilation. To disable this feature, use -fno-openmp-target-ignore-env-vars.
|
||||
- Also, the No-loop functionality is automatically enabled when `-O3` or `-Ofast` is used for compilation. To disable this feature, use `-fno-openmp-target-ignore-env-vars`.
|
||||
|
||||
Note The compiler might not generate the No-loop kernel in certain scenarios where the performance improvement is not substantial.
|
||||
|
||||
|
||||
@@ -11,10 +11,10 @@ The differences are listed in [the table below](rocm-llvm-vs-alt).
|
||||
|
||||
:::{table} Differences between `rocm-llvm` and `rocm-llvm-alt`
|
||||
:name: rocm-llvm-vs-alt
|
||||
| **rocm-llvm** | **rocm-llvm-alt** |
|
||||
|:---------------------------------------------------:|:-----------------------------------------------------------------------------------------------------------------------------:|
|
||||
| Installed by default when ROCm™ itself is installed | An optional package |
|
||||
| Provides an open-source compiler | Provides an additional closed-source compiler for users interested in additional CPU optimizations not available in rocm-llvm |
|
||||
| **`rocm-llvm`** | **`rocm-llvm-alt`** |
|
||||
|:---------------------------------------------------:|:-------------------------------------------------------------------------------------------------------------------------------:|
|
||||
| Installed by default when ROCm™ itself is installed | An optional package |
|
||||
| Provides an open-source compiler | Provides an additional closed-source compiler for users interested in additional CPU optimizations not available in `rocm-llvm` |
|
||||
:::
|
||||
|
||||
For more details, see:
|
||||
@@ -30,25 +30,25 @@ ROCm currently provides two compiler interfaces for compiling HIP programs:
|
||||
- `/opt/rocm/bin/amdclang++`
|
||||
|
||||
Both leverage the same LLVM compiler technology with the AMD GCN GPU support;
|
||||
however, they offer a slightly different user experience. The hipcc command-line
|
||||
however, they offer a slightly different user experience. The `hipcc` command-line
|
||||
interface aims to provide a more familiar user interface to users who are
|
||||
experienced in CUDA but relatively new to the ROCm/HIP development environment.
|
||||
On the other hand, amdclang++ provides a user interface identical to the clang++
|
||||
On the other hand, `amdclang++` provides a user interface identical to the clang++
|
||||
compiler. It is more suitable for experienced developers who want to directly
|
||||
interact with the clang compiler and gain full control of their application’s
|
||||
build process.
|
||||
|
||||
The major differences between hipcc and amdclang++ are listed below:
|
||||
The major differences between `hipcc` and `amdclang++` are listed below:
|
||||
|
||||
::::{table} Differences between hipcc and amdclang++
|
||||
::::{table} Differences between `hipcc` and `amdclang++`
|
||||
:name: hipcc-vs-amdclang
|
||||
| * | **hipcc** | **amdclang++** |
|
||||
|:----------------------------------:|:------------------------------------------------------------------------------------------------------------------------:|:--------------:|
|
||||
| Compiling HIP source files | Treats all source files as HIP language source files | Enables the HIP language support for files with the “.hip” extension or through the -x hip compiler option |
|
||||
| Detecting GPU architecture | Auto-detects the GPUs available on the system and generates code for those devices when no GPU architecture is specified | Has AMD GCN gfx803 as the default GPU architecture. The --offload-arch compiler option may be used to target other GPU architectures |
|
||||
| Finding a HIP installation | Finds the HIP installation based on its own location and its knowledge about the ROCm directory structure | First looks for HIP under the same parent directory as its own LLVM directory and then falls back on /opt/rocm. Users can use the --rocm-path option to instruct the compiler to use HIP from the specified ROCm installation. |
|
||||
| Linking to the HIP runtime library | Is configured to automatically link to the HIP runtime from the detected HIP installation | Requires the --hip-link flag to be specified to link to the HIP runtime. Alternatively, users can use the -l`<dir>` -lamdhip64 option to link to a HIP runtime library. |
|
||||
| Device function inlining | Inlines all GPU device functions, which provide greater performance and compatibility for codes that contain file scoped or device function scoped `__shared__` variables. However, it may increase compile time. | Relies on inlining heuristics to control inlining. Users experiencing performance or compilation issues with code using file scoped or device function scoped `__shared__` variables could try -mllvm -amdgpu-early-inline-all=true -mllvm -amdgpu-function-calls=false to work around the issue. There are plans to address these issues with future compiler improvements. |
|
||||
| * | **`hipcc`** | **`amdclang++`** |
|
||||
|:----------------------------------:|:------------------------------------------------------------------------------------------------------------------------:|:----------------:|
|
||||
| Compiling HIP source files | Treats all source files as HIP language source files | Enables the HIP language support for files with the `.hip` extension or through the `-x hip` compiler option |
|
||||
| Detecting GPU architecture | Auto-detects the GPUs available on the system and generates code for those devices when no GPU architecture is specified | Has AMD GCN gfx803 as the default GPU architecture. The `--offload-arch` compiler option may be used to target other GPU architectures |
|
||||
| Finding a HIP installation | Finds the HIP installation based on its own location and its knowledge about the ROCm directory structure | First looks for HIP under the same parent directory as its own LLVM directory and then falls back on `/opt/rocm`. Users can use the `--rocm-path` option to instruct the compiler to use HIP from the specified ROCm installation. |
|
||||
| Linking to the HIP runtime library | Is configured to automatically link to the HIP runtime from the detected HIP installation | Requires the `--hip-link` flag to be specified to link to the HIP runtime. Alternatively, users can use the `-l<dir> -lamdhip64` option to link to a HIP runtime library. |
|
||||
| Device function inlining | Inlines all GPU device functions, which provide greater performance and compatibility for codes that contain file scoped or device function scoped `__shared__` variables. However, it may increase compile time. | Relies on inlining heuristics to control inlining. Users experiencing performance or compilation issues with code using file scoped or device function scoped `__shared__` variables could try `-mllvm -amdgpu-early-inline-all=true -mllvm -amdgpu-function-calls=false` to work around the issue. There are plans to address these issues with future compiler improvements. |
|
||||
| Source code location | <https://github.com/ROCm-Developer-Tools/HIPCC> | <https://github.com/RadeonOpenCompute/llvm-project> |
|
||||
::::
|
||||
|
||||
@@ -115,8 +115,8 @@ This section outlines commonly used compiler flags for `hipcc` and `amdclang++`.
|
||||
|
||||
The CPU compiler optimizations described in this chapter originate from the AMD
|
||||
Optimizing C/C++ Compiler (AOCC) compiler. They are available in ROCmCC if the
|
||||
optional rocm-llvm-alt package is installed. The user’s interaction with the
|
||||
compiler does not change once rocm-llvm-alt is installed. The user should use
|
||||
optional `rocm-llvm-alt` package is installed. The user’s interaction with the
|
||||
compiler does not change once `rocm-llvm-alt` is installed. The user should use
|
||||
the same compiler entry point, provided AMD provides high-performance compiler
|
||||
optimizations for Zen-based processors in AOCC.
|
||||
|
||||
@@ -149,13 +149,13 @@ feasible, this optimization transforms the code to enable these improvements.
|
||||
This transformation is likely to improve cache utilization and memory bandwidth.
|
||||
It is expected to improve the scalability of programs executed on multiple cores.
|
||||
|
||||
This is effective only under `flto`, as the whole program analysis is required
|
||||
This is effective only under `-flto`, as the whole program analysis is required
|
||||
to perform this optimization. Users can choose different levels of
|
||||
aggressiveness with which this optimization can be applied to the application,
|
||||
with 1 being the least aggressive and 7 being the most aggressive level.
|
||||
|
||||
:::{table} -fstruct-layout Values and Their Effects
|
||||
| -fstruct-layout value | Structure peeling | Pointer size after selective compression of self-referential pointers in structures, wherever safe | Type of structure fields eligible for compression | Whether compression performed under safety check |
|
||||
| `-fstruct-layout` value | Structure peeling | Pointer size after selective compression of self-referential pointers in structures, wherever safe | Type of structure fields eligible for compression | Whether compression performed under safety check |
|
||||
| ----------- | ----------- | ----------- | ----------- | ----------- |
|
||||
| 1 | Enabled | NA | NA | NA |
|
||||
| 2 | Enabled | 32-bit | NA | NA |
|
||||
@@ -191,14 +191,14 @@ optimization, which is invoked as `-flto -fitodcallsbyclone`.
|
||||
#### `-fremap-arrays`
|
||||
|
||||
Transforms the data layout of a single dimensional array to provide better cache
|
||||
locality. This optimization is effective only under `flto`, as the whole program
|
||||
locality. This optimization is effective only under `-flto`, as the whole program
|
||||
needs to be analyzed to perform this optimization, which can be invoked as
|
||||
`-flto -fremap-arrays`.
|
||||
|
||||
#### `-finline-aggressive`
|
||||
|
||||
Enables improved inlining capability through better heuristics. This
|
||||
optimization is more effective when used with `flto`, as the whole program
|
||||
optimization is more effective when used with `-flto`, as the whole program
|
||||
analysis is required to perform this optimization, which can be invoked as
|
||||
`-flto -finline-aggressive`.
|
||||
|
||||
@@ -282,7 +282,7 @@ or factor of 16. This vectorization width of 16 may be overwritten by
|
||||
|
||||
##### `-enable-redundant-movs`
|
||||
|
||||
Removes any redundant mov operations including redundant loads from memory and
|
||||
Removes any redundant `mov` operations including redundant loads from memory and
|
||||
stores to memory. This can be invoked using
|
||||
`-Wl,-plugin-opt=-enable-redundant-movs`.
|
||||
|
||||
@@ -322,13 +322,13 @@ functions at call sites.
|
||||
| 4 | 10 |
|
||||
:::
|
||||
|
||||
This is more effective with flto as the whole program needs to be analyzed to
|
||||
This is more effective with `-flto` as the whole program needs to be analyzed to
|
||||
perform this optimization, which can be invoked as
|
||||
`-flto -inline-recursion=[1,2,3,4]`.
|
||||
|
||||
##### `-reduce-array-computations=[1,2,3]`
|
||||
|
||||
Performs array dataflow analysis and optimizes the unused array computations.
|
||||
Performs array data flow analysis and optimizes the unused array computations.
|
||||
|
||||
:::{table} -reduce-array-computations Values and Their Effects
|
||||
| -reduce-array-computations value | Array elements eligible for elimination of computations |
|
||||
@@ -338,7 +338,7 @@ Performs array dataflow analysis and optimizes the unused array computations.
|
||||
| 3 | Both unused and zero valued |
|
||||
:::
|
||||
|
||||
This optimization is effective with flto as the whole program needs to be
|
||||
This optimization is effective with `-flto` as the whole program needs to be
|
||||
analyzed to perform this optimization, which can be invoked as
|
||||
`-flto -reduce-array-computations=[1,2,3]`.
|
||||
|
||||
@@ -352,7 +352,7 @@ vector operations. This option is set to **true** by default.
|
||||
Experimental flag for enabling vectorization on certain loops with complex
|
||||
control flow, which the normal vectorizer cannot handle.
|
||||
|
||||
This optimization is effective with flto as the whole program needs to be
|
||||
This optimization is effective with `-flto` as the whole program needs to be
|
||||
analyzed to perform this optimization, which can be invoked as
|
||||
`-flto -region-vectorize`.
|
||||
|
||||
@@ -423,12 +423,12 @@ This option is set to false by default.
|
||||
##### `-Hz,1,0x1 [Fortran]`
|
||||
|
||||
Helps to preserve array index information for array access expressions which get
|
||||
linearized in the compiler frontend. The preserved information is used by the
|
||||
linearized in the compiler front end. The preserved information is used by the
|
||||
compiler optimization phase in performing optimizations such as loop
|
||||
transformations. It is recommended that any user who is using optimizations
|
||||
such as loop transformations and other optimizations requiring de-linearized
|
||||
index expressions should use the Hz option. This option has no impact on any
|
||||
other aspects of the Flang frontend.
|
||||
other aspects of the Flang front end.
|
||||
|
||||
### Inline ASM Statements
|
||||
|
||||
@@ -467,7 +467,7 @@ compiler.
|
||||
An LLVM library and tool that is used to query the execution capability of the
|
||||
current system as well as to query requirements of a binary file. It is used by
|
||||
OpenMP device runtime to ensure compatibility of an image with the current
|
||||
system while loading it. It is compatible with TargetID support and multi-image
|
||||
system while loading it. It is compatible with target ID support and multi-image
|
||||
fat binary support.
|
||||
|
||||
**Usage:**
|
||||
@@ -478,7 +478,7 @@ offload-arch [Options] [Optional lookup-value]
|
||||
|
||||
When used without an option, offload-arch prints the value of the first offload
|
||||
arch found in the underlying system. This can be used by various clang
|
||||
frontends. For example, to compile for OpenMP offloading on your current system,
|
||||
front ends. For example, to compile for OpenMP offloading on your current system,
|
||||
invoke clang with the following command:
|
||||
|
||||
```bash
|
||||
@@ -507,11 +507,11 @@ The options are listed below:
|
||||
:::
|
||||
|
||||
:::{option} -m
|
||||
Prints device code name (often found in pci.ids file).
|
||||
Prints device code name (often found in `pci.ids` file).
|
||||
:::
|
||||
|
||||
:::{option} -n
|
||||
Prints numeric pci-id.
|
||||
Prints numeric `pci-id`.
|
||||
:::
|
||||
|
||||
:::{option} -t
|
||||
@@ -530,12 +530,12 @@ The options are listed below:
|
||||
Prints offload capabilities of the underlying system. This option is used by the language runtime to select an image when multiple images are available. A capability must exist for each requirement of the selected image.
|
||||
:::
|
||||
|
||||
There are symbolic link aliases amdgpu-offload-arch and nvidia-arch for
|
||||
offload-arch. These aliases return 1 if no amdgcn GPU or cuda GPU is found.
|
||||
There are symbolic link aliases `amdgpu-offload-arch` and `nvidia-arch` for
|
||||
`offload-arch`. These aliases return 1 if no AMD GCN GPU or CUDA GPU is found.
|
||||
These aliases are useful in determining whether architecture-specific tests
|
||||
should be run or to conditionally load architecture-specific software.
|
||||
|
||||
#### Command-Line Simplification Using offload-arch Flag
|
||||
#### Command-Line Simplification Using `offload-arch` Flag
|
||||
|
||||
Legacy mechanism of specifying offloading target for OpenMP involves using three
|
||||
flags, `-fopenmp-targets`, `-Xopenmp-target`, and `-march`. The first two flags
|
||||
@@ -562,14 +562,14 @@ clang -fopenmp -target x86_64-linux-gnu \
|
||||
```
|
||||
|
||||
To ensure backward compatibility, both styles are supported. This option is
|
||||
compatible with TargetID support and multi-image fat binaries.
|
||||
compatible with target ID support and multi-image fat binaries.
|
||||
|
||||
#### TargetID Support for OpenMP
|
||||
#### Target ID Support for OpenMP
|
||||
|
||||
The ROCmCC compiler supports specification of target features along with the GPU
|
||||
name while specifying a target offload device in the command line, using
|
||||
`-march` or `--offload-arch` options. The compiled image in such cases is
|
||||
specialized for a given configuration of device and target features (TargetID).
|
||||
specialized for a given configuration of device and target features (target ID).
|
||||
|
||||
**Example:**
|
||||
|
||||
@@ -598,8 +598,8 @@ clang -fopenmp -target x86_64-linux-gnu \
|
||||
-march=gfx908:sramecc+:xnack- helloworld.c -o helloworld
|
||||
```
|
||||
|
||||
The TargetID specified on the command line is passed to the clang driver using
|
||||
target-feature flag, to the LLVM optimizer and backend using `-mattr` flag, and
|
||||
The target ID specified on the command line is passed to the clang driver using
|
||||
`target-feature` flag, to the LLVM optimizer and back end using `-mattr` flag, and
|
||||
to linker using `-plugin-opt=-mattr` flag. This feature is compatible with
|
||||
offload-arch command-line option and multi-image binaries for multiple
|
||||
architectures.
|
||||
@@ -609,14 +609,14 @@ architectures.
|
||||
The ROCmCC compiler is enhanced to generate binaries that can contain
|
||||
heterogenous images. This heterogeneity could be in terms of:
|
||||
|
||||
- Images of different architectures, like amdgcn and nvptx
|
||||
- Images of different architectures, like AMD GCN and NVPTX
|
||||
- Images of same architectures but for different GPUs, like gfx906 and gfx908
|
||||
- Images of same architecture and same GPU but for different target features,
|
||||
like gfx908:xnack+ and gfx908:xnack-
|
||||
like `gfx908:xnack+` and `gfx908:xnack-`
|
||||
|
||||
An appropriate image is selected by the OpenMP device runtime for execution
|
||||
depending on the capability of the current system. This feature is compatible
|
||||
with TargetID support and offload-arch command-line options and uses
|
||||
with target ID support and offload-arch command-line options and uses
|
||||
offload-arch tool to determine capability of the current system.
|
||||
|
||||
**Example:**
|
||||
@@ -660,7 +660,7 @@ capability of the current system.
|
||||
#### Unified Shared Memory (USM)
|
||||
|
||||
The following OpenMP pragma is available on MI200, and it must be executed with
|
||||
xnack+ support.
|
||||
`xnack+` support.
|
||||
|
||||
```cpp
|
||||
omp requires unified_shared_memory
|
||||
@@ -674,6 +674,8 @@ refer to the OpenMP Support Guide at [https://docs.amd.com](https://docs.amd.com
|
||||
|
||||
The following table lists the other Clang options and their support status.
|
||||
|
||||
<!-- spellcheck-disable -->
|
||||
|
||||
:::{table} Clang Options
|
||||
:name: clang-options
|
||||
:widths: auto
|
||||
@@ -1440,3 +1442,4 @@ The following table lists the other Clang options and their support status.
|
||||
|-x \<language\>|Supported|Assumes subsequent input files to have the given type \<language\>|
|
||||
|-z \<arg\>|Supported|Passes -z \<arg\> to the linker|
|
||||
:::
|
||||
<!-- spellcheck-enable -->
|
||||
|
||||
Reference in New Issue
Block a user