# Compiler reference guide ## Introduction to compiler reference guide ROCmCC is a Clang/LLVM-based compiler. It is optimized for high-performance computing on AMD GPUs and CPUs and supports various heterogeneous programming models such as HIP, OpenMP, and OpenCL. ROCmCC is made available via two packages: `rocm-llvm` and `rocm-llvm-alt`. The differences are listed in [the table below](rocm-llvm-vs-alt). :::{table} Differences between `rocm-llvm` and `rocm-llvm-alt` :name: rocm-llvm-vs-alt | **`rocm-llvm`** | **`rocm-llvm-alt`** | |:---------------------------------------------------:|:-------------------------------------------------------------------------------------------------------------------------------:| | Installed by default when ROCm™ itself is installed | An optional package | | Provides an open-source compiler | Provides an additional closed-source compiler for users interested in additional CPU optimizations not available in `rocm-llvm` | ::: For more details, see: * AMD GPU usage: [llvm.org/docs/AMDGPUUsage.html](https://llvm.org/docs/AMDGPUUsage.html) * Releases and source: ### ROCm compiler interfaces ROCm currently provides two compiler interfaces for compiling HIP programs: * `/opt/rocm/bin/hipcc` * `/opt/rocm/bin/amdclang++` Both leverage the same LLVM compiler technology with the AMD GCN GPU support; however, they offer a slightly different user experience. The `hipcc` command-line interface aims to provide a more familiar user interface to users who are experienced in CUDA but relatively new to the ROCm/HIP development environment. On the other hand, `amdclang++` provides a user interface identical to the clang++ compiler. It is more suitable for experienced developers who want to directly interact with the clang compiler and gain full control of their application’s build process. The major differences between `hipcc` and `amdclang++` are listed below: ::::{table} Differences between `hipcc` and `amdclang++` :name: hipcc-vs-amdclang | * | **`hipcc`** | **`amdclang++`** | |:----------------------------------:|:------------------------------------------------------------------------------------------------------------------------:|:----------------:| | Compiling HIP source files | Treats all source files as HIP language source files | Enables the HIP language support for files with the `.hip` extension or through the `-x hip` compiler option | | Detecting GPU architecture | Auto-detects the GPUs available on the system and generates code for those devices when no GPU architecture is specified | Has AMD GCN gfx803 as the default GPU architecture. The `--offload-arch` compiler option may be used to target other GPU architectures | | Finding a HIP installation | Finds the HIP installation based on its own location and its knowledge about the ROCm directory structure | First looks for HIP under the same parent directory as its own LLVM directory and then falls back on `/opt/rocm`. Users can use the `--rocm-path` option to instruct the compiler to use HIP from the specified ROCm installation. | | Linking to the HIP runtime library | Is configured to automatically link to the HIP runtime from the detected HIP installation | Requires the `--hip-link` flag to be specified to link to the HIP runtime. Alternatively, users can use the `-l -lamdhip64` option to link to a HIP runtime library. | | Device function inlining | Inlines all GPU device functions, which provide greater performance and compatibility for codes that contain file scoped or device function scoped `__shared__` variables. However, it may increase compile time. | Relies on inlining heuristics to control inlining. Users experiencing performance or compilation issues with code using file scoped or device function scoped `__shared__` variables could try `-mllvm -amdgpu-early-inline-all=true -mllvm -amdgpu-function-calls=false` to work around the issue. There are plans to address these issues with future compiler improvements. | | Source code location | | | :::: ## Compiler options and features This chapter discusses compiler options and features. ### AMD GPU compilation This section outlines commonly used compiler flags for `hipcc` and `amdclang++`. :::{option} -x hip Compiles the source file as a HIP program. ::: :::{option} -fopenmp Enables the OpenMP support. ::: :::{option} -fopenmp-targets= Enables the OpenMP target offload support of the specified GPU architecture. :gpu: The GPU architecture. E.g. gfx908. ::: :::{option} --gpu-max-threads-per-block=: Sets the default limit of threads per block. Also referred to as the launch bounds. :value: The default maximum amount of threads per block. ::: :::{option} -munsafe-fp-atomics Enables unsafe floating point atomic instructions (AMDGPU only). ::: :::{option} -ffast-math Allows aggressive, lossy floating-point optimizations. ::: :::{option} -mwavefrontsize64, -mno-wavefrontsize64 Sets wavefront size to be 64 or 32 on RDNA architectures. ::: :::{option} -mcumode Switches between CU and WGP modes on RDNA architectures. ::: :::{option} --offload-arch= HIP offloading target ID. May be specified more than once. :gpu: The a device architecture followed by target ID features delimited by a colon. Each target ID feature is a predefined string followed by a plus or minus sign (e.g. `gfx908:xnack+:sramecc-`). ::: :::{option} -g Generates source-level debug information. ::: :::{option} -fgpu-rdc, -fno-gpu-rdc Generates relocatable device code, also known as separate compilation mode. ::: ### AMD optimizations for zen architectures The CPU compiler optimizations described in this chapter originate from the AMD Optimizing C/C++ Compiler (AOCC) compiler. They are available in ROCmCC if the optional `rocm-llvm-alt` package is installed. The user’s interaction with the compiler does not change once `rocm-llvm-alt` is installed. The user should use the same compiler entry point, provided AMD provides high-performance compiler optimizations for Zen-based processors in AOCC. For more information, refer to [https://www.amd.com/en/developer/aocc.html](https://www.amd.com/en/developer/aocc.html). #### `-famd-opt` Enables a default set of AMD proprietary optimizations for the AMD Zen CPU architectures. `-fno-amd-opt` disables the AMD proprietary optimizations. The `-famd-opt` flag is useful when a user wants to build with the proprietary optimization compiler and not have to depend on setting any of the other proprietary optimization flags. :::{note} `-famd-opt` can be used in addition to the other proprietary CPU optimization flags. The table of optimizations below implicitly enables the invocation of the AMD proprietary optimizations compiler, whereas the `-famd-opt` flag requires this to be handled explicitly. ::: #### `-fstruct-layout=[1,2,3,4,5,6,7]` Analyzes the whole program to determine if the structures in the code can be peeled and the pointer or integer fields in the structure can be compressed. If feasible, this optimization transforms the code to enable these improvements. This transformation is likely to improve cache utilization and memory bandwidth. It is expected to improve the scalability of programs executed on multiple cores. This is effective only under `-flto`, as the whole program analysis is required to perform this optimization. Users can choose different levels of aggressiveness with which this optimization can be applied to the application, with 1 being the least aggressive and 7 being the most aggressive level. :::{table} -fstruct-layout Values and Their Effects | `-fstruct-layout` value | Structure peeling | Pointer size after selective compression of self-referential pointers in structures, wherever safe | Type of structure fields eligible for compression | Whether compression performed under safety check | | ----------- | ----------- | ----------- | ----------- | ----------- | | 1 | Enabled | NA | NA | NA | | 2 | Enabled | 32-bit | NA | NA | | 3 | Enabled | 16-bit | NA | NA | | 4 | Enabled | 32-bit | Integer | Yes | | 5 | Enabled | 16-bit | Integer | Yes | | 6 | Enabled | 32-bit | 64-bit signed int or unsigned int. Users must ensure that the values assigned to 64-bit signed int fields are in range -(2^31 - 1) to +(2^31 - 1) and 64-bit unsigned int fields are in the range 0 to +(2^31 - 1). Otherwise, you may obtain incorrect results. | No. Users must ensure the safety based on the program compiled. | | 7 | Enabled | 16-bit | 64-bit signed int or unsigned int. Users must ensure that the values assigned to 64-bit signed int fields are in range -(2^31 - 1) to +(2^31 - 1) and 64-bit unsigned int fields are in the range 0 to +(2^31 - 1). Otherwise, you may obtain incorrect results. | No. Users must ensure the safety based on the program compiled. | ::: #### `-fitodcalls` Promotes indirect-to-direct calls by placing conditional calls. Application or benchmarks that have a small and deterministic set of target functions for function pointers passed as call parameters benefit from this optimization. Indirect-to-direct call promotion transforms the code to use all possible determined targets under runtime checks and falls back to the original code for all the other cases. Runtime checks are introduced by the compiler for each of these possible function pointer targets followed by direct calls to the targets. This is a link time optimization, which is invoked as `-flto -fitodcalls` #### `-fitodcallsbyclone` Performs value specialization for functions with function pointers passed as an argument. It does this specialization by generating a clone of the function. The cloning of the function happens in the call chain as needed, to allow conversion of indirect function call to direct call. This complements `-fitodcalls` optimization and is also a link time optimization, which is invoked as `-flto -fitodcallsbyclone`. #### `-fremap-arrays` Transforms the data layout of a single dimensional array to provide better cache locality. This optimization is effective only under `-flto`, as the whole program needs to be analyzed to perform this optimization, which can be invoked as `-flto -fremap-arrays`. #### `-finline-aggressive` Enables improved inlining capability through better heuristics. This optimization is more effective when used with `-flto`, as the whole program analysis is required to perform this optimization, which can be invoked as `-flto -finline-aggressive`. #### `-fnt-store (non-temporal store)` Generates a non-temporal store instruction for array accesses in a loop with a large trip count. #### `-fnt-store=aggressive` This is an experimental option to generate non-temporal store instruction for array accesses in a loop, whose iteration count cannot be determined at compile time. In this case, the compiler assumes the iteration count to be huge. #### Optimizations through driver `-mllvm ` The following optimization options must be invoked through driver `-mllvm `: ##### `-enable-partial-unswitch` Enables partial loop unswitching, which is an enhancement to the existing loop unswitching optimization in LLVM. Partial loop unswitching hoists a condition inside a loop from a path for which the execution condition remains invariant, whereas the original loop unswitching works for a condition that is completely loop invariant. The condition inside the loop gets hoisted out from the invariant path, and the original loop is retained for the path where the condition is variant. ##### `-aggressive-loop-unswitch` Experimental option that enables aggressive loop unswitching heuristic (including `-enable-partial-unswitch`) based on the usage of the branch conditional values. Loop unswitching leads to code bloat. Code bloat can be minimized if the hoisted condition is executed more often. This heuristic prioritizes the conditions based on the number of times they are used within the loop. The heuristic can be controlled with the following options: * `-unswitch-identical-branches-min-count=` * Enables unswitching of a loop with respect to a branch conditional value (B), where B appears in at least `` compares in the loop. This option is enabled with `-aggressive-loop-unswitch`. The default value is 3. **Usage:** `-mllvm -aggressive-loop-unswitch -mllvm -unswitch-identical-branches-min-count=` Where, `n` is a positive integer and lower value of `` facilitates more unswitching. * `-unswitch-identical-branches-max-count=` * Enables unswitching of a loop with respect to a branch conditional value (B), where B appears in at most `` compares in the loop. This option is enabled with `-aggressive-loop-unswitch`. The default value is 6. **Usage:** `-mllvm -aggressive-loop-unswitch -mllvm -unswitch-identical-branches-max-count=` Where, `n` is a positive integer and higher value of `` facilitates more unswitching. :::{note} These options may facilitate more unswitching under some workloads. Since loop-unswitching inherently leads to code bloat, facilitating more unswitching may significantly increase the code size. Hence, it may also lead to longer compilation times. ::: ##### `-enable-strided-vectorization` Enables strided memory vectorization as an enhancement to the interleaved vectorization framework present in LLVM. It enables the effective use of gather and scatter kind of instruction patterns. This flag must be used along with the interleave vectorization flag. ##### `-enable-epilog-vectorization` Enables vectorization of epilog-iterations as an enhancement to existing vectorization framework. This enables generation of an additional epilog vector loop version for the remainder iterations of the original vector loop. The vector size or factor of the original loop should be large enough to allow an effective epilog vectorization of the remaining iterations. This optimization takes place only when the original vector loop is vectorized with a vector width or factor of 16. This vectorization width of 16 may be overwritten by `-min-width-epilog-vectorization` command-line option. ##### `-enable-redundant-movs` Removes any redundant `mov` operations including redundant loads from memory and stores to memory. This can be invoked using `-Wl,-plugin-opt=-enable-redundant-movs`. ##### `-merge-constant` Attempts to promote frequently occurring constants to registers. The aim is to reduce the size of the instruction encoding for instructions using constants and obtain a performance improvement. ##### `-function-specialize` Optimizes the functions with compile time constant formal arguments. ##### `-lv-function-specialization` Generates specialized function versions when the loops inside function are vectorizable and the arguments are not aliased with each other. ##### `-enable-vectorize-compares` Enables vectorization on certain loops with conditional breaks assuming the memory accesses are safely bound within the page boundary. ##### `-inline-recursion=[1,2,3,4]` Enables inlining for recursive functions based on heuristics where the aggressiveness of heuristics increases with the level (1-4). The default level is 2. Higher levels may lead to code bloat due to expansion of recursive functions at call sites. :::{table} -inline-recursion Level and Their Effects | `-inline-recursion` **value** | **Inline depth of heuristics used to enable inlining for recursive functions** | |:-----------------------------:|:------------------------------------------------------------------------------:| | 1 | 1 | | 2 | 1 | | 3 | 1 | | 4 | 10 | ::: This is more effective with `-flto` as the whole program needs to be analyzed to perform this optimization, which can be invoked as `-flto -inline-recursion=[1,2,3,4]`. ##### `-reduce-array-computations=[1,2,3]` Performs array data flow analysis and optimizes the unused array computations. :::{table} -reduce-array-computations Values and Their Effects | -reduce-array-computations value | Array elements eligible for elimination of computations | | -------------------------------- | --------------------------- | | 1 | Unused | | 2 | Zero valued | | 3 | Both unused and zero valued | ::: This optimization is effective with `-flto` as the whole program needs to be analyzed to perform this optimization, which can be invoked as `-flto -reduce-array-computations=[1,2,3]`. ##### `-global-vectorize-slp={true,false}` Vectorizes the straight-line code inside a basic block with data reordering vector operations. This option is set to **true** by default. ##### `-region-vectorize` Experimental flag for enabling vectorization on certain loops with complex control flow, which the normal vectorizer cannot handle. This optimization is effective with `-flto` as the whole program needs to be analyzed to perform this optimization, which can be invoked as `-flto -region-vectorize`. ##### `-enable-x86-prefetching` Enables the generation of x86 prefetch instruction for the memory references inside a loop or inside an innermost loop of a loop nest to prefetch the second dimension of multidimensional array/memory references in the innermost loop of a loop nest. This is an experimental pass; its profitability is being improved. ##### `-suppress-fmas` Identifies the reduction patterns on FMA and suppresses the FMA generation, as it is not profitable on the reduction patterns. ##### `-enable-icm-vrp` Enables estimation of the virtual register pressure before performing loop invariant code motion. This estimation is used to control the number of loop invariants that will be hoisted during the loop invariant code motion. ##### `-loop-splitting` Enables splitting of loops into multiple loops to eliminate the branches, which compare the loop induction with an invariant or constant expression. This option is enabled under `-O3` by default. To disable this optimization, use `-loop-splitting=false`. ##### `-enable-ipo-loop-split` Enables splitting of loops into multiple loops to eliminate the branches, which compares the loop induction with a constant expression. This constant expression can be derived through inter-procedural analysis. This option is enabled under `-O3` by default. To disable this optimization, use `-enable-ipo-loop-split=false`. ##### `-compute-interchange-order` Enables heuristic for finding the best possible interchange order for a loop nest. To enable this option, use `-enable-loopinterchange`. This option is set to **false** by default. **Usage:** ```bash -mllvm -enable-loopinterchange -mllvm -compute-interchange-order ``` ##### `-convert-pow-exp-to-int={true,false}` Converts the call to floating point exponent version of pow to its integer exponent version if the floating-point exponent can be converted to integer. This option is set to **true** by default. ##### `-do-lock-reordering={none,normal,aggressive}` Reorders the control predicates in increasing order of complexity from outer predicate to inner when it is safe. The **normal** mode reorders simple expressions, while the **aggressive** mode reorders predicates involving function calls if no side effects are determined. This option is set to **normal** by default. ##### `-fuse-tile-inner-loop` Enables fusion of adjacent tiled loops as a part of loop tiling transformation. This option is set to false by default. ##### `-Hz,1,0x1 [Fortran]` Helps to preserve array index information for array access expressions which get linearized in the compiler front end. The preserved information is used by the compiler optimization phase in performing optimizations such as loop transformations. It is recommended that any user who is using optimizations such as loop transformations and other optimizations requiring de-linearized index expressions should use the Hz option. This option has no impact on any other aspects of the Flang front end. ### Inline ASM statements Inline assembly (ASM) statements allow a developer to include assembly instructions directly in either host or device code. While the ROCm compiler supports ASM statements, their use is not recommended for the following reasons: * The compiler's ability to produce both correct code and to optimize surrounding code is impeded. * The compiler does not parse the content of the ASM statements and so cannot "see" its contents. * The compiler must make conservative assumptions in an effort to retain correctness. * The conservative assumptions may yield code that, on the whole, is less performant compared to code without ASM statements. It is possible that a syntactically correct ASM statement may cause incorrect runtime behavior. * ASM statements are often ASIC-specific; code containing them is less portable and adds a maintenance burden to the developer if different ASICs are targeted. * Writing correct ASM statements is often difficult; we strongly recommend thorough testing of any use of ASM statements. :::{note} For developers who choose to include ASM statements in the code, AMD is interested in understanding the use case and appreciates feedback at [https://github.com/ROCm/ROCm/issues](https://github.com/ROCm/ROCm/issues) ::: ### Miscellaneous OpenMP compiler features This section discusses features that have been added or enhanced in the OpenMP compiler. #### Offload-arch tool An LLVM library and tool that is used to query the execution capability of the current system as well as to query requirements of a binary file. It is used by OpenMP device runtime to ensure compatibility of an image with the current system while loading it. It is compatible with target ID support and multi-image fat binary support. **Usage:** ```bash offload-arch [Options] [Optional lookup-value] ``` When used without an option, offload-arch prints the value of the first offload arch found in the underlying system. This can be used by various clang front ends. For example, to compile for OpenMP offloading on your current system, invoke clang with the following command: ```bash clang -fopenmp -fopenmp-targets=`offload-arch` foo.c ``` If an optional lookup-value is specified, offload-arch will check if the value is either a valid offload-arch or a codename and look up requested additional information. The following command provides all the information for offload-arch gfx906: ```bash offload-arch gfx906 -v ``` The options are listed below: :::{option} -h Prints the help message. ::: :::{option} -a Prints values for all devices. Do not stop at the first device found. ::: :::{option} -m Prints device code name (often found in `pci.ids` file). ::: :::{option} -n Prints numeric `pci-id`. ::: :::{option} -t Prints clang offload triple to use for the offload arch. ::: :::{option} -v Verbose. Implies: `-a -m -n -t`. For: all devices, prints codename, numeric value, and triple. ::: :::{option} -f Prints offload requirements including offload-arch for each compiled offload image built into an application binary file. ::: :::{option} -c Prints offload capabilities of the underlying system. This option is used by the language runtime to select an image when multiple images are available. A capability must exist for each requirement of the selected image. ::: There are symbolic link aliases `amdgpu-offload-arch` and `nvidia-arch` for `offload-arch`. These aliases return 1 if no AMD GCN GPU or CUDA GPU is found. These aliases are useful in determining whether architecture-specific tests should be run or to conditionally load architecture-specific software. #### Command-line simplification using `offload-arch` flag Legacy mechanism of specifying offloading target for OpenMP involves using three flags, `-fopenmp-targets`, `-Xopenmp-target`, and `-march`. The first two flags take a target triple (like `amdgcn-amd-amdhsa` or `nvptx64-nvidia-cuda`), while the last flag takes device name (like `gfx908` or `sm_70`) as input. Alternatively, users of ROCmCC compiler can use the flag `—offload-arch` for a combined effect of the above three flags. **Example:** ```bash # Legacy mechanism clang -fopenmp -target x86_64-linux-gnu \ -fopenmp-targets=amdgcn-amd-amdhsa -Xopenmp-target=amdgcn-amd-amdhsa \ -march=gfx906 helloworld.c -o helloworld ``` **Example:** ```bash # Using offload-arch flag clang -fopenmp -target x86_64-linux-gnu \ --offload-arch=gfx906 helloworld.c -o helloworld. ``` To ensure backward compatibility, both styles are supported. This option is compatible with target ID support and multi-image fat binaries. #### Target ID support for OpenMP The ROCmCC compiler supports specification of target features along with the GPU name while specifying a target offload device in the command line, using `-march` or `--offload-arch` options. The compiled image in such cases is specialized for a given configuration of device and target features (target ID). **Example:** ```bash # compiling for a gfx908 device with XNACK paging support turned ON clang -fopenmp -target x86_64-linux-gnu \ -fopenmp-targets=amdgcn-amd-amdhsa -Xopenmp-target=amdgcn-amd-amdhsa \ -march=gfx908:xnack+ helloworld.c -o helloworld ``` **Example:** ```bash # compiling for a gfx908 device with SRAMECC support turned OFF clang -fopenmp -target x86_64-linux-gnu \ -fopenmp-targets=amdgcn-amd-amdhsa -Xopenmp-target=amdgcn-amd-amdhsa \ -march=gfx908:sramecc- helloworld.c -o helloworld ``` **Example:** ```bash # compiling for a gfx908 device with SRAMECC support turned ON and XNACK paging support turned OFF clang -fopenmp -target x86_64-linux-gnu \ -fopenmp-targets=amdgcn-amd-amdhsa -Xopenmp-target=amdgcn-amd-amdhsa \ -march=gfx908:sramecc+:xnack- helloworld.c -o helloworld ``` The target ID specified on the command line is passed to the clang driver using `target-feature` flag, to the LLVM optimizer and back end using `-mattr` flag, and to linker using `-plugin-opt=-mattr` flag. This feature is compatible with offload-arch command-line option and multi-image binaries for multiple architectures. #### Multi-image fat binary for OpenMP The ROCmCC compiler is enhanced to generate binaries that can contain heterogenous images. This heterogeneity could be in terms of: * Images of different architectures, like AMD GCN and NVPTX * Images of same architectures but for different GPUs, like gfx906 and gfx908 * Images of same architecture and same GPU but for different target features, like `gfx908:xnack+` and `gfx908:xnack-` An appropriate image is selected by the OpenMP device runtime for execution depending on the capability of the current system. This feature is compatible with target ID support and offload-arch command-line options and uses offload-arch tool to determine capability of the current system. **Example:** ```bash clang -fopenmp -target x86_64-linux-gnu \ -fopenmp-targets=amdgcn-amd-amdhsa,amdgcn-amd-amdhsa \ -Xopenmp-target=amdgcn-amd-amdhsa -march=gfx906 \ -Xopenmp-target=amdgcn-amd-amdhsa -march=gfx908 \ helloworld.c -o helloworld ``` **Example:** ```bash clang -fopenmp -target x86_64-linux-gnu \ --offload-arch=gfx906 \ --offload-arch=gfx908 \ helloworld.c -o helloworld ``` **Example:** ```bash clang -fopenmp -target x86_64-linux-gnu \ -fopenmp-targets=amdgcn-amd-amdhsa,amdgcn-amd-amdhsa,amdgcn-amd-amdhsa,amdgcn-amd-amdhsa \ -Xopenmp-target=amdgcn-amd-amdhsa -march=gfx908:sramecc+:xnack+ \ -Xopenmp-target=amdgcn-amd-amdhsa -march=gfx908:sramecc-:xnack+ \ -Xopenmp-target=amdgcn-amd-amdhsa -march=gfx908:sramecc+:xnack- \ -Xopenmp-target=amdgcn-amd-amdhsa -march=gfx908:sramecc-:xnack- \ helloworld.c -o helloworld ``` The ROCmCC compiler creates an instance of toolchain for each unique combination of target triple and the target GPU (along with the associated target features). `clang-offload-wrapper` tool is modified to insert a new structure `__tgt_image_info` along with each image in the binary. Device runtime is also modified to query this structure to identify a compatible image based on the capability of the current system. #### Unified shared memory The following OpenMP pragma is available on MI200, and it must be executed with `xnack+` support. ```cpp omp requires unified_shared_memory ``` For more details on unified shared memory refer to the {ref}`openmp_usm` section of the OpenMP Guide. ### Support status of other Clang options The following table lists the other Clang options and their support status. :::{table} Clang Options :name: clang-options :widths: auto :align: center | **Option** | **Support Status** | **Description** | |--------------|:-----------------------:|-------------------------| | `-###` | Supported | Prints (but does not run) the commands to run for this compilation | | `--analyzer-output ` | Supported | "Static analyzer report output format (`html|plist|plist-multi-file|plist-html|sarif|text`)" | | `--analyze` | Supported | Runs the static analyzer | | `-arcmt-migrate-emit-errors` | Unsupported | Emits ARC errors even if the migrator can fix them | | `-arcmt-migrate-report-output ` | Unsupported | Output path for the plist report | | `-byteswapio` | Supported | Swaps byte-order for unformatted input/output | | `-B ` | Supported | Adds `` to search path for binaries and object files used implicitly | | `-CC` | Supported | Includes comments from within the macros in the preprocessed output | | `-cl-denorms-are-zero` | Supported | OpenCL only. Allows denormals to be flushed to zero | | `-cl-fast-relaxed-math` | Supported | OpenCL only. Sets `-cl-finite-math-only` and `-cl-unsafe-math-optimizations` and defines `__FAST_RELAXED_MATH__` | | `-cl-finite-math-only` | Supported | OpenCL only. Allows floating-point optimizations that assume arguments and results are not `NaN`s or `+-Inf` | | `-cl-fp32-correctly-rounded-divide-sqrt` | Supported | OpenCL only. Specifies that single-precision floating-point divide and `sqrt` used in the program source are correctly rounded | | `-cl-kernel-arg-info` | Supported | OpenCL only. Generates kernel argument metadata | | `-cl-mad-enable` | Supported | OpenCL only. Allows use of less precise MAD computations in the generated binary | | `-cl-no-signed-zeros` | Supported | OpenCL only. Allows use of less precise no-signed-zeros computations in the generated binary | | `-cl-opt-disable` | Supported | OpenCL only. Disables all optimizations. By default, optimizations are enabled. | | `-cl-single-precision-constant` | Supported | OpenCL only. Treats double-precision floating-point constant as single precision constant | | `-cl-std= ` | Supported | OpenCL language standard to compile for | | `-cl-strict-aliasing` | Supported | OpenCL only. This option is added for compatibility with OpenCL 1.0. | | `-cl-uniform-work-group-size` | Supported | OpenCL only. Defines the global work-size to be a multiple of the work-group size specified for `clEnqueueNDRangeKernel` | | `-cl-unsafe-math-optimizations` | Supported | OpenCL only. Allows unsafe floating-point optimizations. Also implies `-cl-no-signed-zeros` and `-cl-mad-enable` | | `--config ` | Supported | Specifies configuration file | | `--cuda-compile-host-device` | Supported | Compiles CUDA code for both host and device (default). Has no effect on non-CUDA compilations | | `--cuda-device-only` | Supported | Compiles CUDA code for device only | | `--cuda-host-only` | Supported | Compiles CUDA code for host only. Has no effect on non-CUDA compilations | | `--cuda-include-ptx=` | Unsupported | Includes PTX for the following GPU architecture (e.g. `sm_35`) or "all." May be specified more than once | | `--cuda-noopt-device-debug` | Unsupported | Enables device-side debug info generation. Disables ptxas optimizations | | `--cuda-path-ignore-env` | Unsupported | Ignores environment variables to detect CUDA installation | | `--cuda-path=` | Unsupported | CUDA installation path | | `-cxx-isystem ` | Supported | Adds a directory to the C++ SYSTEM include search path | | `-C` | Supported | Includes comments in the preprocessed output | | `-c` | Supported | Runs only preprocess, compile, and assemble steps | | `-dD` | Supported | Prints macro definitions in `-E` mode in addition to the normal output | | `-dependency-dot ` | Supported | Writes DOT-formatted header dependencies to the specified filename | | `-dependency-file ` | Supported | Writes dependency output to the specified filename (or `-`) | | `-dI` | Supported | Prints include directives in `-E` mode in addition to the normal output | | `-dM` | Supported | Prints macro definitions in `-E` mode instead of the normal output | | `-dsym-dir ` | Unsupported | Outputs dSYMs (if any) to the specified directory | | `-D ` | Supported | `=`. Defines `` to `` (or `1` if `` omitted) | | `-emit-ast` | Supported | Emits Clang AST files for source inputs | | `-emit-interface-stubs` | Supported | Generates interface stub files | | `-emit-llvm` | Supported | Uses the LLVM representation for assembler and object files | | `-emit-merged-ifs` | Supported | Generates interface stub files and emits merged text not binary | | `--emit-static-lib` | Supported | Enables linker job to emit a static library | | `-enable-trivial-auto-var-init-zero-knowing-it-will-be-removed-from-clang` | Supported | Declares enabling trivial automatic variable initialization to zero for benchmarking purpose with the knowledge that it will eventually be removed | | `-E` | Supported | Runs the preprocessor only | | `-fAAPCSBitfieldLoad` | Unsupported | Follows the AAPCS standard where all volatile bit-field writes generate at least one load (ARM only) | | `-faddrsig` | Supported | Emits an address-significance table | | `-faligned-allocation` | Supported | Enables C++17 aligned allocation functions | | `-fallow-editor-placeholders` | Supported | Treats editor placeholders as valid source code | | `-fallow-fortran-gnu-ext` | Supported | Allows Fortran GNU extensions | | `-fansi-escape-codes` | Supported | Uses ANSI escape codes for diagnostics | | `-fapple-kext` | Unsupported | Uses Apple's kernel extensions ABI | | `-fapple-link-rtlib` | Unsupported | Forces linking of the clang built-ins runtime library | |-fapple-pragma-pack|Unsupported|Enables Apple gcc-compatible #pragma pack handling| |-fapplication-extension|Unsupported|Restricts code to those available for App Extensions| |-fbackslash|Supported|Treats backslash as C-style escape character| |-fbasic-block-sections= \|Supported|"Places each function's basic blocks in unique sections (ELF Only) : all \| labels \| none \| list= \"| |-fblocks|Supported|Enables the 'blocks' language feature| |-fborland-extensions|Unsupported|Accepts non-standard constructs supported by the Borland compile| |-fbuild-session-file= \|Supported|Uses the last modification time of \ as the build session timestamp| |-fbuild-session-timestamp= \