Spell checking (#2070)

* ci: cleanup linters and add spelling checker

* docs: fix spelling and styling issues
This commit is contained in:
Nara
2023-04-24 15:09:09 +02:00
committed by GitHub
parent 08821f1098
commit 48db1eea8d
32 changed files with 312 additions and 300 deletions

View File

@@ -32,7 +32,7 @@ ROCm template libraries for C++ primitives and algorithms are as follows:
:::
:::{grid-item-card} [Communication Libraries](gpu_libraries/communication)
Inter and intra node communication is supported by the following projects:
Inter and intra-node communication is supported by the following projects:
- [RCCL](https://rocmdocs.amd.com/projects/rccl/en/latest/)

View File

@@ -18,7 +18,7 @@ This is ROCgdb, the ROCm source-level debugger for Linux, based on GDB, the GNU
:::
:::{grid-item-card} [ROCProfiler](https://rocmdocs.amd.com/projects/rocprofiler/en/latest/)
ROC profiler library. Profiling with perf-counters and derived metrics. Library supports GFX8/GFX9. HW specific low-level performance analysis interface for profiling of GPU compute applications. The profiling includes HW performance counters with complex performance metrics.
ROC profiler library. Profiling with performance counters and derived metrics. Library supports GFX8/GFX9. Hardware specific low-level performance analysis interface for profiling of GPU compute applications. The profiling includes hardware performance counters with complex performance metrics.
- [Documentation](https://rocmdocs.amd.com/projects/rocprofiler/en/latest/)

View File

@@ -2,7 +2,7 @@
Pull content from
<https://docs.amd.com/bundle/ROCm-Deep-Learning-Guide-v5.4.1/page/Prerequisites.html>.
Only the frameworks content. Link to kernel/userspace guide.
Only the frameworks content. Link to kernel/user space guide.
Also pull content from
<https://docs.amd.com/bundle/ROCm-Compatible-Frameworks-Release-Notes/page/Framework_Release_Notes.html>

View File

@@ -11,12 +11,12 @@
- [AMD RDNA Instruction Set Architecture](https://www.amd.com/system/files/TechDocs/rdna-shader-instruction-set-architecture.pdf)
- [AMD GCN3 Instruction Set Architecture](https://www.amd.com/system/files/TechDocs/gcn3-instruction-set-architecture.pdf)
## Whitepapers
## White Papers
- [AMD CDNA™ 2 Architecture Whitepaper](https://www.amd.com/system/files/documents/amd-cdna2-white-paper.pdf)
- [AMD CDNA Architecture Whitepaper](https://www.amd.com/system/files/documents/amd-cdna-whitepaper.pdf)
- [AMD Vega Architecture Whitepaper](https://en.wikichip.org/w/images/a/a1/vega-whitepaper.pdf)
- [AMD RDNA Architecture Whitepaper](https://www.amd.com/system/files/documents/rdna-whitepaper.pdf)
- [AMD CDNA™ 2 Architecture White Paper](https://www.amd.com/system/files/documents/amd-cdna2-white-paper.pdf)
- [AMD CDNA Architecture White Paper](https://www.amd.com/system/files/documents/amd-cdna-whitepaper.pdf)
- [AMD Vega Architecture White Paper](https://en.wikichip.org/w/images/a/a1/vega-whitepaper.pdf)
- [AMD RDNA Architecture White Paper](https://www.amd.com/system/files/documents/rdna-whitepaper.pdf)
## Architecture Guides

View File

@@ -39,7 +39,7 @@ Fabric™ bridge for the AMD Instinct™ accelerators.
The micro-architecture of the AMD Instinct accelerators is based on the AMD CDNA
architecture, which targets compute applications such as high-performance
computing (HPC) and AI & machine learning (ML) that run on everything from
individual servers to the worlds largest exascale supercomputers. The overall
individual servers to the world's largest exascale supercomputers. The overall
system architecture is designed for extreme scalability and compute performance.
:::{figure-md} mi100-block
@@ -56,7 +56,7 @@ high-speed links (23 GT/sec, also at the bottom) to the other GPUs of the local
hive as shown in {numref}`mi100-arch`.
On the left and right of the floor plan, the High Bandwidth Memory (HBM)
attaches via the GPUs memory controller. The MI100 generation of the AMD
attaches via the GPU's memory controller. The MI100 generation of the AMD
Instinct accelerator offers four stacks of HBM generation 2 (HBM2) for a total
of 32GB with a 4,096bit-wide memory interface. The peak memory bandwidth of the
attached HBM2 is 1.228 TB/sec at a memory clock frequency of 1.2 GHz.
@@ -66,7 +66,7 @@ Units (CU). There are a total 120 compute units that are physically organized
into eight Shader Engines (SE) with fifteen compute units per shader engine.
Each compute unit is further sub-divided into four SIMD units that process SIMD
instructions of 16 data elements per instruction. This enables the CU to process
64 data elements (a so-called wavefront) at a peak clock frequency of 1.5 GHz.
64 data elements (a so-called 'wavefront') at a peak clock frequency of 1.5 GHz.
Therefore, the theoretical maximum FP64 peak performance is 11.5 TFLOPS
(`4 [SIMD units] x 16 [elements per instruction] x 120 [CU] x 1.5 [GHz]`).
@@ -91,7 +91,7 @@ thus may affect execution performance.
A wavefront can occupy any number of VGPRs from 0 to 256, directly affecting
occupancy; that is, the number of concurrently active wavefronts in the CU. For
instance, with 119 VPGRs used, only two wavefronts can be active in the CU at
instance, with 119 VGPRs used, only two wavefronts can be active in the CU at
the same time. With the instruction latency of four cycles per SIMD instruction,
the occupancy should be as high as possible such that the compute unit can
improve execution efficiency by scheduling instructions from multiple

View File

@@ -15,7 +15,7 @@ rocBLAS is an AMD GPU optimized library for BLAS.
:::
:::{grid-item-card} [hipBLAS](https://rocmdocs.amd.com/projects/hipBLAS/en/develop/)
hipBLAS is a compatiblity layer for GPU accelerated BLAS optimized for AMD GPUs
hipBLAS is a compatibility layer for GPU accelerated BLAS optimized for AMD GPUs
via rocBLAS and rocSOLVER. hipBLAS allows for a common interface for other GPU
BLAS libraries.
@@ -25,7 +25,7 @@ BLAS libraries.
:::
:::{grid-item-card} [hipBLASLt](https://rocmdocs.amd.com/projects/hipBLASLt/en/develop/)
hipBLASLt is a library that provides general matrix-matrix operations with a flexible API and extends funtionalities beyond traditional BLAS library. hipBLASLt is exposed APIs in HIP programming language with an underlying optimized generator as a backend kernel provider.
hipBLASLt is a library that provides general matrix-matrix operations with a flexible API and extends functionalities beyond traditional BLAS library. hipBLASLt is exposed APIs in HIP programming language with an underlying optimized generator as a back-end kernel provider.
- [Documentation](https://rocmdocs.amd.com/projects/hipBLASLt/en/develop/)
- [Changelog](https://github.com/ROCmSoftwarePlatform/hipBLASLt/blob/develop/CHANGELOG.md)

View File

@@ -13,7 +13,7 @@ rocRAND is an AMD GPU optimized library for pseudo-random number generators (PRN
:::
:::{grid-item-card} [hipRAND](https://rocmdocs.amd.com/projects/hipRAND/en/rtd/)
hipRAND is a compatiblity layer for GPU accelerated FFT optimized for AMD GPUs
hipRAND is a compatibility layer for GPU accelerated FFT optimized for AMD GPUs
using rocFFT. hipFFT allows for a common interface for other non AMD GPU
FFT libraries.

View File

@@ -1 +1 @@
# Kernel Userspace Compatibility Reference
# Kernel User Space Compatibility Reference

View File

@@ -17,7 +17,7 @@ provided by the AMD E-SMI inband library and the ROCm SMI GPU library to the Pro
:::
:::{grid-item-card} [ROCm SMI](https://rocmdocs.amd.com/projects/rocmsmi/en/latest/)
This tool acts as a command line interface for manipulating and monitoring the amdgpu kernel, and is intended to replace and deprecate the existing rocm_smi.py CLI tool. It uses Ctypes to call the rocm_smi_lib API.
This tool acts as a command line interface for manipulating and monitoring the AMD GPU kernel, and is intended to replace and deprecate the existing `rocm_smi.py` CLI tool. It uses `ctypes` to call the `rocm_smi_lib` API.
- [Documentation](https://rocmdocs.amd.com/projects/rocmsmi/en/latest/)
- [Examples](https://github.com/RadeonOpenCompute/rocm_smi_lib/tree/master/python_smi_tools)
@@ -25,7 +25,7 @@ This tool acts as a command line interface for manipulating and monitoring the a
:::
:::{grid-item-card} [ROCm Datacenter Tool](https://rocmdocs.amd.com/projects/rdc/en/latest/)
The ROCm™ Data Center Tool simplifies the administration and addresses key infrastructure challenges in AMD GPUs in cluster and datacenter environments.
The ROCm™ Data Center Tool simplifies the administration and addresses key infrastructure challenges in AMD GPUs in cluster and data center environments.
- [Documentation](https://rocmdocs.amd.com/projects/rdc/en/latest/)
- [Examples](https://github.com/RadeonOpenCompute/rdc/tree/master/example)

View File

@@ -6,9 +6,9 @@ The ROCm™ installation includes an LLVM-based implementation that fully suppor
### Installation
The OpenMP toolchain is automatically installed as part of the standard ROCm installation and is available under /opt/rocm-{version}/llvm. The sub-directories are:
The OpenMP toolchain is automatically installed as part of the standard ROCm installation and is available under `/opt/rocm-{version}/llvm`. The sub-directories are:
bin: Compilers (flang and clang) and other binaries.
bin: Compilers (`flang` and `clang`) and other binaries.
- examples: The usage section below shows how to compile and run these programs.
@@ -36,13 +36,13 @@ The above invocation of Make compiles and runs the program. Note the options tha
-fopenmp --offload-arch=<gpu-arch>
```
Obtain the value of gpu-arch by running the following command:
Obtain the value of `gpu-arch` by running the following command:
```bash
% /opt/rocm-{version}/bin/rocminfo | grep gfx
```
[//]: # (dated link below, needs upading)
[//]: # (dated link below, needs updating)
See the complete list of compiler command-line references [here](https://github.com/RadeonOpenCompute/llvm-project/blob/amd-stg-open/clang/docs/CommandGuide/clang.rst).
@@ -56,7 +56,7 @@ The following steps describe a typical workflow for using rocprof with OpenMP co
% rocprof <application> <args>
```
This produces a results.csv file in the users current directory that shows basic stats such as kernel names, grid size, number of registers used, etc. The user can choose to specify the preferred output file name using the o option.
This produces a `results.csv` file in the users current directory that shows basic stats such as kernel names, grid size, number of registers used, etc. The user can choose to specify the preferred output file name using the o option.
2. Add options for a detailed result:
@@ -64,9 +64,9 @@ The following steps describe a typical workflow for using rocprof with OpenMP co
--stats: % rocprof --stats <application> <args>
```
The stats option produces timestamps for the kernels. Look into the output CSV file for the field, DurationNs, which is useful in getting an understanding of the critical kernels in the code.
The stats option produces timestamps for the kernels. Look into the output CSV file for the field, `Durations`, which is useful in getting an understanding of the critical kernels in the code.
Apart from --stats, the option --timestamp on produces a timestamp for the kernels.
Apart from `--stats`, the option `--timestamp` on produces a timestamp for the kernels.
3. After learning about the required kernels, the user can take a detailed look at each one of them. rocprof has support for hardware counters: a set of basic and a set of derived ones. See the complete list of counters using options --list-basic and --list-derived. rocprof accepts either a text or an XML file as an input.
@@ -74,7 +74,7 @@ For more details on rocprof, refer to the ROCm Profiling Tools document on [http
### Using Tracing Options
**Prerequisite:** When using the --sys-trace option, compile the OpenMP program with:
**Prerequisite:** When using the `--sys-trace` option, compile the OpenMP program with:
```bash
-Wl,rpath,/opt/rocm-{version}/lib -lamdhip64
@@ -82,9 +82,9 @@ For more details on rocprof, refer to the ROCm Profiling Tools document on [http
The following tracing options are widely used to generate useful information:
- **--hsa-trace**: This option is used to get a JSON output file with the HSA API execution traces and a flat profile in a CSV file.
- **`--hsa-trace`**: This option is used to get a JSON output file with the HSA API execution traces and a flat profile in a CSV file.
- **--sys-trace**: This allows programmers to trace both HIP and HSA calls. Since this option results in loading ``libamdhip64.so``, follow the prerequisite as mentioned above.
- **`--sys-trace`**: This allows programmers to trace both HIP and HSA calls. Since this option results in loading ``libamdhip64.so``, follow the prerequisite as mentioned above.
A CSV and a JSON file are produced by the above trace options. The CSV file presents the data in a tabular format, and the JSON file can be visualized using Google Chrome at chrome://tracing/ or [Perfetto](https://perfetto.dev/). Navigate to Chrome or Perfetto and load the JSON file to see the timeline of the HSA calls.
@@ -94,14 +94,14 @@ For more details on tracing, refer to the ROCm Profiling Tools document on [http
:::{table}
:widths: auto
| Environment Variable | Description |
| ----------- | ----------- |
| OMP_NUM_TEAMS | The implementation chooses the number of teams for kernel launch. The user can change this number for performance tuning using this environment variable, subject to implementation limits. |
| OMPX_DISABLE_MAPS | Under USM mode, the implementation automatically checks for correctness of the map clauses without performing any copying. The user can disable this check by setting this environment variable to 1. |
| LIBOMPTARGET_KERNEL_TRACE | This environment variable is used to print useful statistics for device operations. Setting it to 1 and running the program emits the name of every kernel launched, the number of teams and threads used, and the corresponding register usage. Setting it to 2 additionally emits timing information for kernel launches and data transfer operations between the host and the device. |
| LIBOMPTARGET_INFO | This environment variable is used to print informational messages from the device runtime as the program executes. Users can request fine-grain information by setting it to the value of 1 or higher and can set the value of -1 for complete information. |
| LIBOMPTARGET_DEBUG | If a debug version of the device library is present, setting this environment variable to 1 and using that library emits further detailed debugging information about data transfer operations and kernel launch. |
| GPU_MAX_HW_QUEUES | This environment variable is used to set the number of HSA queues in the OpenMP runtime. |
| Environment Variable | Description |
| --------------------------- | ----------- |
| `OMP_NUM_TEAMS` | The implementation chooses the number of teams for kernel launch. The user can change this number for performance tuning using this environment variable, subject to implementation limits. |
| `OMPX_DISABLE_MAPS` | Under USM mode, the implementation automatically checks for correctness of the map clauses without performing any copying. The user can disable this check by setting this environment variable to 1. |
| `LIBOMPTARGET_KERNEL_TRACE` | This environment variable is used to print useful statistics for device operations. Setting it to 1 and running the program emits the name of every kernel launched, the number of teams and threads used, and the corresponding register usage. Setting it to 2 additionally emits timing information for kernel launches and data transfer operations between the host and the device. |
| `LIBOMPTARGET_INFO` | This environment variable is used to print informational messages from the device runtime as the program executes. Users can request fine-grain information by setting it to the value of 1 or higher and can set the value of -1 for complete information. |
| `LIBOMPTARGET_DEBUG` | If a debug version of the device library is present, setting this environment variable to 1 and using that library emits further detailed debugging information about data transfer operations and kernel launch. |
| `GPU_MAX_HW_QUEUES` | This environment variable is used to set the number of HSA queues in the OpenMP runtime. |
:::
## OpenMP: Features
@@ -110,9 +110,9 @@ The OpenMP programming model is greatly enhanced with the following new features
### Asynchronous Behavior in OpenMP Target Regions
- Multithreaded offloading on the same device
- Multi-threaded offloading on the same device
The libomptarget plugin for GPU offloading allows creation of separate configurable HSA queues per chiplet, which enables two or more threads to concurrently offload to the same device.
The `libomptarget` plugin for GPU offloading allows creation of separate configurable HSA queues per chiplet, which enables two or more threads to concurrently offload to the same device.
- Parallel memory copy invocations
@@ -146,7 +146,7 @@ xnack- with -offload-arch=gfx908:xnack-
#### Unified Shared Memory Pragma
This OpenMP pragma is available on MI200 through xnack+ support.
This OpenMP pragma is available on MI200 through `xnack+` support.
```bash
omp requires unified_shared_memory
@@ -192,20 +192,20 @@ The difference between the memory pages pointed to by these two variables is tha
The OpenMP runtime in ROCm implements a subset of the OMPT device APIs, as described in the OpenMP specification document. These APIs allow first-party tools to examine the profile and kernel traces that execute on a device. A tool can register callbacks for data transfer and kernel dispatch entry points or use APIs to start and stop tracing for device-related activities such as data transfer and kernel dispatch timings and associated metadata. If device tracing is enabled, trace records for device activities are collected during program execution and returned to the tool using the APIs described in the specification.
The following example demonstrates how a tool uses the supported OMPT target APIs. The README in /opt/rocm/llvm/examples/tools/ompt outlines the steps to be followed, and the provided example can be run as shown below:
The following example demonstrates how a tool uses the supported OMPT target APIs. The `README` in `/opt/rocm/llvm/examples/tools/ompt` outlines the steps to be followed, and the provided example can be run as shown below:
```bash
% cd $ROCM_PATH/share/openmp-extras/examples/tools/ompt/veccopy-ompt-target-tracing
% make run
```
The file veccopy-ompt-target-tracing.c simulates how a tool initiates device activity tracing. The file callbacks.h shows the callbacks registered and implemented by the tool.
The file `veccopy-ompt-target-tracing.c` simulates how a tool initiates device activity tracing. The file `callbacks.h` shows the callbacks registered and implemented by the tool.
### Floating Point Atomic Operations
The MI200-series GPUs support the generation of hardware floating-point atomics using the OpenMP atomic pragma. The support includes single- and double-precision floating-point atomic operations. The programmer must ensure that the memory subjected to the atomic operation is in coarse-grain memory by mapping it explicitly with the help of map clauses when not implicitly mapped by the compiler as per the [OpenMP specifications](https://www.openmp.org/specifications/). This makes these hardware floating-point atomic instructions “fast,” as they are faster than using a default compare-and-swap loop scheme, but at the same time “unsafe,” as they are not supported on fine-grain memory. The operation in unified_shared_memory mode also requires programmers to map the memory explicitly when not implicitly mapped by the compiler.
The MI200-series GPUs support the generation of hardware floating-point atomics using the OpenMP atomic pragma. The support includes single- and double-precision floating-point atomic operations. The programmer must ensure that the memory subjected to the atomic operation is in coarse-grain memory by mapping it explicitly with the help of map clauses when not implicitly mapped by the compiler as per the [OpenMP specifications](https://www.openmp.org/specifications/). This makes these hardware floating-point atomic instructions “fast,” as they are faster than using a default compare-and-swap loop scheme, but at the same time “unsafe,” as they are not supported on fine-grain memory. The operation in `unified_shared_memory` mode also requires programmers to map the memory explicitly when not implicitly mapped by the compiler.
To request fast floating-point atomic instructions at the file level, use compiler flag -munsafe-fp-atomics or a hint clause on a specific pragma:
To request fast floating-point atomic instructions at the file level, use compiler flag `-munsafe-fp-atomics` or a hint clause on a specific pragma:
```bash
double a = 0.0;
@@ -213,9 +213,9 @@ double a = 0.0;
a = a + 1.0;
```
NOTE AMD_unsafe_fp_atomics is an alias for AMD_fast_fp_atomics, and AMD_safe_fp_atomics is implemented with a compare-and-swap loop.
NOTE `AMD_unsafe_fp_atomics` is an alias for `AMD_fast_fp_atomics`, and `AMD_safe_fp_atomics` is implemented with a compare-and-swap loop.
To disable the generation of fast floating-point atomic instructions at the file level, build using the option -msafe-fp-atomics or use a hint clause on a specific pragma:
To disable the generation of fast floating-point atomic instructions at the file level, build using the option `-msafe-fp-atomics` or use a hint clause on a specific pragma:
```bash
double a = 0.0;
@@ -225,7 +225,7 @@ a = a + 1.0;
The hint clause value always has a precedence over the compiler flag, which allows programmers to create atomic constructs with a different behavior than the rest of the file.
See the example below, where the user builds the program using -msafe-fp-atomics to select a file-wide “safe atomic” compilation. However, the fast atomics hint clause over variable “a” takes precedence and operates on “a” using a fast/unsafe floating-point atomic, while the variable “b” in the absence of a hint clause is operated upon using safe floating-point atomics as per the compiler flag.
See the example below, where the user builds the program using `-msafe-fp-atomics` to select a file-wide “safe atomic” compilation. However, the fast atomics hint clause over variable “a” takes precedence and operates on “a” using a fast/unsafe floating-point atomic, while the variable “b” in the absence of a hint clause is operated upon using safe floating-point atomics as per the compiler flag.
```bash
double a = 0.0;.
@@ -239,7 +239,7 @@ b = b + 1.0;
### Address Sanitizer (ASan) Tool
Address Sanitizer is a memory error detector tool utilized by applications to detect various errors ranging from spatial issues such as out-of-bound access to temporal issues such as use-after-free. The AOMP compiler supports ASan for AMDGPUs with applications written in both HIP and OpenMP.
Address Sanitizer is a memory error detector tool utilized by applications to detect various errors ranging from spatial issues such as out-of-bound access to temporal issues such as use-after-free. The AOMP compiler supports ASan for AMD GPUs with applications written in both HIP and OpenMP.
**Features Supported on Host Platform (Target x86_64):**
@@ -259,7 +259,7 @@ Address Sanitizer is a memory error detector tool utilized by applications to de
- Initialization order bugs
**Features Supported on AMDGPU Platform (amdgcn-amd-amdhsa):**
**Features Supported on AMDGPU Platform (`amdgcn-amd-amdhsa`):**
- Heap buffer overflow
@@ -318,11 +318,11 @@ The No-loop kernel generation feature optimizes the compiler performance by gene
To enable the generation of the specialized kernel, follow these guidelines:
- Do not specify teams, threads, and schedule-related environment variables. The num_teams or a thread_limit clause in an OpenMP target construct acts as an override and prevents the generation of the specialized kernel. As the user is unable to specify the number of teams and threads used within target regions in the absence of the above-mentioned environment variables, the runtime will select the best values for the launch configuration based on runtime knowledge of the program.
- Do not specify teams, threads, and schedule-related environment variables. The `num_teams` or a `thread_limit` clause in an OpenMP target construct acts as an override and prevents the generation of the specialized kernel. As the user is unable to specify the number of teams and threads used within target regions in the absence of the above-mentioned environment variables, the runtime will select the best values for the launch configuration based on runtime knowledge of the program.
- Assert the absence of the above-mentioned environment variables by adding the command-line option fopenmp-target-ignore-env-vars. This option also allows programmers to enable the No-loop functionality at lower optimization levels.
- Assert the absence of the above-mentioned environment variables by adding the command-line option `-fopenmp-target-ignore-env-vars`. This option also allows programmers to enable the No-loop functionality at lower optimization levels.
- Also, the No-loop functionality is automatically enabled when -O3 or -Ofast is used for compilation. To disable this feature, use -fno-openmp-target-ignore-env-vars.
- Also, the No-loop functionality is automatically enabled when `-O3` or `-Ofast` is used for compilation. To disable this feature, use `-fno-openmp-target-ignore-env-vars`.
Note The compiler might not generate the No-loop kernel in certain scenarios where the performance improvement is not substantial.

View File

@@ -11,10 +11,10 @@ The differences are listed in [the table below](rocm-llvm-vs-alt).
:::{table} Differences between `rocm-llvm` and `rocm-llvm-alt`
:name: rocm-llvm-vs-alt
| **rocm-llvm** | **rocm-llvm-alt** |
|:---------------------------------------------------:|:-----------------------------------------------------------------------------------------------------------------------------:|
| Installed by default when ROCm™ itself is installed | An optional package |
| Provides an open-source compiler | Provides an additional closed-source compiler for users interested in additional CPU optimizations not available in rocm-llvm |
| **`rocm-llvm`** | **`rocm-llvm-alt`** |
|:---------------------------------------------------:|:-------------------------------------------------------------------------------------------------------------------------------:|
| Installed by default when ROCm™ itself is installed | An optional package |
| Provides an open-source compiler | Provides an additional closed-source compiler for users interested in additional CPU optimizations not available in `rocm-llvm` |
:::
For more details, see:
@@ -30,25 +30,25 @@ ROCm currently provides two compiler interfaces for compiling HIP programs:
- `/opt/rocm/bin/amdclang++`
Both leverage the same LLVM compiler technology with the AMD GCN GPU support;
however, they offer a slightly different user experience. The hipcc command-line
however, they offer a slightly different user experience. The `hipcc` command-line
interface aims to provide a more familiar user interface to users who are
experienced in CUDA but relatively new to the ROCm/HIP development environment.
On the other hand, amdclang++ provides a user interface identical to the clang++
On the other hand, `amdclang++` provides a user interface identical to the clang++
compiler. It is more suitable for experienced developers who want to directly
interact with the clang compiler and gain full control of their applications
build process.
The major differences between hipcc and amdclang++ are listed below:
The major differences between `hipcc` and `amdclang++` are listed below:
::::{table} Differences between hipcc and amdclang++
::::{table} Differences between `hipcc` and `amdclang++`
:name: hipcc-vs-amdclang
| * | **hipcc** | **amdclang++** |
|:----------------------------------:|:------------------------------------------------------------------------------------------------------------------------:|:--------------:|
| Compiling HIP source files | Treats all source files as HIP language source files | Enables the HIP language support for files with the .hip extension or through the -x hip compiler option |
| Detecting GPU architecture | Auto-detects the GPUs available on the system and generates code for those devices when no GPU architecture is specified | Has AMD GCN gfx803 as the default GPU architecture. The --offload-arch compiler option may be used to target other GPU architectures |
| Finding a HIP installation | Finds the HIP installation based on its own location and its knowledge about the ROCm directory structure | First looks for HIP under the same parent directory as its own LLVM directory and then falls back on /opt/rocm. Users can use the --rocm-path option to instruct the compiler to use HIP from the specified ROCm installation. |
| Linking to the HIP runtime library | Is configured to automatically link to the HIP runtime from the detected HIP installation | Requires the --hip-link flag to be specified to link to the HIP runtime. Alternatively, users can use the -l`<dir>` -lamdhip64 option to link to a HIP runtime library. |
| Device function inlining | Inlines all GPU device functions, which provide greater performance and compatibility for codes that contain file scoped or device function scoped `__shared__` variables. However, it may increase compile time. | Relies on inlining heuristics to control inlining. Users experiencing performance or compilation issues with code using file scoped or device function scoped `__shared__` variables could try -mllvm -amdgpu-early-inline-all=true -mllvm -amdgpu-function-calls=false to work around the issue. There are plans to address these issues with future compiler improvements. |
| * | **`hipcc`** | **`amdclang++`** |
|:----------------------------------:|:------------------------------------------------------------------------------------------------------------------------:|:----------------:|
| Compiling HIP source files | Treats all source files as HIP language source files | Enables the HIP language support for files with the `.hip` extension or through the `-x hip` compiler option |
| Detecting GPU architecture | Auto-detects the GPUs available on the system and generates code for those devices when no GPU architecture is specified | Has AMD GCN gfx803 as the default GPU architecture. The `--offload-arch` compiler option may be used to target other GPU architectures |
| Finding a HIP installation | Finds the HIP installation based on its own location and its knowledge about the ROCm directory structure | First looks for HIP under the same parent directory as its own LLVM directory and then falls back on `/opt/rocm`. Users can use the `--rocm-path` option to instruct the compiler to use HIP from the specified ROCm installation. |
| Linking to the HIP runtime library | Is configured to automatically link to the HIP runtime from the detected HIP installation | Requires the `--hip-link` flag to be specified to link to the HIP runtime. Alternatively, users can use the `-l<dir> -lamdhip64` option to link to a HIP runtime library. |
| Device function inlining | Inlines all GPU device functions, which provide greater performance and compatibility for codes that contain file scoped or device function scoped `__shared__` variables. However, it may increase compile time. | Relies on inlining heuristics to control inlining. Users experiencing performance or compilation issues with code using file scoped or device function scoped `__shared__` variables could try `-mllvm -amdgpu-early-inline-all=true -mllvm -amdgpu-function-calls=false` to work around the issue. There are plans to address these issues with future compiler improvements. |
| Source code location | <https://github.com/ROCm-Developer-Tools/HIPCC> | <https://github.com/RadeonOpenCompute/llvm-project> |
::::
@@ -115,8 +115,8 @@ This section outlines commonly used compiler flags for `hipcc` and `amdclang++`.
The CPU compiler optimizations described in this chapter originate from the AMD
Optimizing C/C++ Compiler (AOCC) compiler. They are available in ROCmCC if the
optional rocm-llvm-alt package is installed. The users interaction with the
compiler does not change once rocm-llvm-alt is installed. The user should use
optional `rocm-llvm-alt` package is installed. The users interaction with the
compiler does not change once `rocm-llvm-alt` is installed. The user should use
the same compiler entry point, provided AMD provides high-performance compiler
optimizations for Zen-based processors in AOCC.
@@ -149,13 +149,13 @@ feasible, this optimization transforms the code to enable these improvements.
This transformation is likely to improve cache utilization and memory bandwidth.
It is expected to improve the scalability of programs executed on multiple cores.
This is effective only under `flto`, as the whole program analysis is required
This is effective only under `-flto`, as the whole program analysis is required
to perform this optimization. Users can choose different levels of
aggressiveness with which this optimization can be applied to the application,
with 1 being the least aggressive and 7 being the most aggressive level.
:::{table} -fstruct-layout Values and Their Effects
| -fstruct-layout value | Structure peeling | Pointer size after selective compression of self-referential pointers in structures, wherever safe | Type of structure fields eligible for compression | Whether compression performed under safety check |
| `-fstruct-layout` value | Structure peeling | Pointer size after selective compression of self-referential pointers in structures, wherever safe | Type of structure fields eligible for compression | Whether compression performed under safety check |
| ----------- | ----------- | ----------- | ----------- | ----------- |
| 1 | Enabled | NA | NA | NA |
| 2 | Enabled | 32-bit | NA | NA |
@@ -191,14 +191,14 @@ optimization, which is invoked as `-flto -fitodcallsbyclone`.
#### `-fremap-arrays`
Transforms the data layout of a single dimensional array to provide better cache
locality. This optimization is effective only under `flto`, as the whole program
locality. This optimization is effective only under `-flto`, as the whole program
needs to be analyzed to perform this optimization, which can be invoked as
`-flto -fremap-arrays`.
#### `-finline-aggressive`
Enables improved inlining capability through better heuristics. This
optimization is more effective when used with `flto`, as the whole program
optimization is more effective when used with `-flto`, as the whole program
analysis is required to perform this optimization, which can be invoked as
`-flto -finline-aggressive`.
@@ -282,7 +282,7 @@ or factor of 16. This vectorization width of 16 may be overwritten by
##### `-enable-redundant-movs`
Removes any redundant mov operations including redundant loads from memory and
Removes any redundant `mov` operations including redundant loads from memory and
stores to memory. This can be invoked using
`-Wl,-plugin-opt=-enable-redundant-movs`.
@@ -322,13 +322,13 @@ functions at call sites.
| 4 | 10 |
:::
This is more effective with flto as the whole program needs to be analyzed to
This is more effective with `-flto` as the whole program needs to be analyzed to
perform this optimization, which can be invoked as
`-flto -inline-recursion=[1,2,3,4]`.
##### `-reduce-array-computations=[1,2,3]`
Performs array dataflow analysis and optimizes the unused array computations.
Performs array data flow analysis and optimizes the unused array computations.
:::{table} -reduce-array-computations Values and Their Effects
| -reduce-array-computations value | Array elements eligible for elimination of computations |
@@ -338,7 +338,7 @@ Performs array dataflow analysis and optimizes the unused array computations.
| 3 | Both unused and zero valued |
:::
This optimization is effective with flto as the whole program needs to be
This optimization is effective with `-flto` as the whole program needs to be
analyzed to perform this optimization, which can be invoked as
`-flto -reduce-array-computations=[1,2,3]`.
@@ -352,7 +352,7 @@ vector operations. This option is set to **true** by default.
Experimental flag for enabling vectorization on certain loops with complex
control flow, which the normal vectorizer cannot handle.
This optimization is effective with flto as the whole program needs to be
This optimization is effective with `-flto` as the whole program needs to be
analyzed to perform this optimization, which can be invoked as
`-flto -region-vectorize`.
@@ -423,12 +423,12 @@ This option is set to false by default.
##### `-Hz,1,0x1 [Fortran]`
Helps to preserve array index information for array access expressions which get
linearized in the compiler frontend. The preserved information is used by the
linearized in the compiler front end. The preserved information is used by the
compiler optimization phase in performing optimizations such as loop
transformations. It is recommended that any user who is using optimizations
such as loop transformations and other optimizations requiring de-linearized
index expressions should use the Hz option. This option has no impact on any
other aspects of the Flang frontend.
other aspects of the Flang front end.
### Inline ASM Statements
@@ -467,7 +467,7 @@ compiler.
An LLVM library and tool that is used to query the execution capability of the
current system as well as to query requirements of a binary file. It is used by
OpenMP device runtime to ensure compatibility of an image with the current
system while loading it. It is compatible with TargetID support and multi-image
system while loading it. It is compatible with target ID support and multi-image
fat binary support.
**Usage:**
@@ -478,7 +478,7 @@ offload-arch [Options] [Optional lookup-value]
When used without an option, offload-arch prints the value of the first offload
arch found in the underlying system. This can be used by various clang
frontends. For example, to compile for OpenMP offloading on your current system,
front ends. For example, to compile for OpenMP offloading on your current system,
invoke clang with the following command:
```bash
@@ -507,11 +507,11 @@ The options are listed below:
:::
:::{option} -m
Prints device code name (often found in pci.ids file).
Prints device code name (often found in `pci.ids` file).
:::
:::{option} -n
Prints numeric pci-id.
Prints numeric `pci-id`.
:::
:::{option} -t
@@ -530,12 +530,12 @@ The options are listed below:
Prints offload capabilities of the underlying system. This option is used by the language runtime to select an image when multiple images are available. A capability must exist for each requirement of the selected image.
:::
There are symbolic link aliases amdgpu-offload-arch and nvidia-arch for
offload-arch. These aliases return 1 if no amdgcn GPU or cuda GPU is found.
There are symbolic link aliases `amdgpu-offload-arch` and `nvidia-arch` for
`offload-arch`. These aliases return 1 if no AMD GCN GPU or CUDA GPU is found.
These aliases are useful in determining whether architecture-specific tests
should be run or to conditionally load architecture-specific software.
#### Command-Line Simplification Using offload-arch Flag
#### Command-Line Simplification Using `offload-arch` Flag
Legacy mechanism of specifying offloading target for OpenMP involves using three
flags, `-fopenmp-targets`, `-Xopenmp-target`, and `-march`. The first two flags
@@ -562,14 +562,14 @@ clang -fopenmp -target x86_64-linux-gnu \
```
To ensure backward compatibility, both styles are supported. This option is
compatible with TargetID support and multi-image fat binaries.
compatible with target ID support and multi-image fat binaries.
#### TargetID Support for OpenMP
#### Target ID Support for OpenMP
The ROCmCC compiler supports specification of target features along with the GPU
name while specifying a target offload device in the command line, using
`-march` or `--offload-arch` options. The compiled image in such cases is
specialized for a given configuration of device and target features (TargetID).
specialized for a given configuration of device and target features (target ID).
**Example:**
@@ -598,8 +598,8 @@ clang -fopenmp -target x86_64-linux-gnu \
-march=gfx908:sramecc+:xnack- helloworld.c -o helloworld
```
The TargetID specified on the command line is passed to the clang driver using
target-feature flag, to the LLVM optimizer and backend using `-mattr` flag, and
The target ID specified on the command line is passed to the clang driver using
`target-feature` flag, to the LLVM optimizer and back end using `-mattr` flag, and
to linker using `-plugin-opt=-mattr` flag. This feature is compatible with
offload-arch command-line option and multi-image binaries for multiple
architectures.
@@ -609,14 +609,14 @@ architectures.
The ROCmCC compiler is enhanced to generate binaries that can contain
heterogenous images. This heterogeneity could be in terms of:
- Images of different architectures, like amdgcn and nvptx
- Images of different architectures, like AMD GCN and NVPTX
- Images of same architectures but for different GPUs, like gfx906 and gfx908
- Images of same architecture and same GPU but for different target features,
like gfx908:xnack+ and gfx908:xnack-
like `gfx908:xnack+` and `gfx908:xnack-`
An appropriate image is selected by the OpenMP device runtime for execution
depending on the capability of the current system. This feature is compatible
with TargetID support and offload-arch command-line options and uses
with target ID support and offload-arch command-line options and uses
offload-arch tool to determine capability of the current system.
**Example:**
@@ -660,7 +660,7 @@ capability of the current system.
#### Unified Shared Memory (USM)
The following OpenMP pragma is available on MI200, and it must be executed with
xnack+ support.
`xnack+` support.
```cpp
omp requires unified_shared_memory
@@ -674,6 +674,8 @@ refer to the OpenMP Support Guide at [https://docs.amd.com](https://docs.amd.com
The following table lists the other Clang options and their support status.
<!-- spellcheck-disable -->
:::{table} Clang Options
:name: clang-options
:widths: auto
@@ -1440,3 +1442,4 @@ The following table lists the other Clang options and their support status.
|-x \<language\>|Supported|Assumes subsequent input files to have the given type \<language\>|
|-z \<arg\>|Supported|Passes -z \<arg\> to the linker|
:::
<!-- spellcheck-enable -->