diff --git a/docs/index.md b/docs/index.md index 397ef5134..5ed488e8e 100644 --- a/docs/index.md +++ b/docs/index.md @@ -64,6 +64,7 @@ ROCm documentation is organized into the following categories: * [ROCm libraries](./reference/api-libraries.md) * [ROCm tools, compilers, and runtimes](./reference/rocm-tools.md) +* [ROCm glossary](./reference/glossary.rst) * [GPU hardware specifications](./reference/gpu-arch-specs.rst) * [Hardware atomics operation support](./reference/gpu-atomics-operation.rst) * [Environment variables](./reference/env-variables.rst) diff --git a/docs/reference/glossary.rst b/docs/reference/glossary.rst new file mode 100644 index 000000000..e80941551 --- /dev/null +++ b/docs/reference/glossary.rst @@ -0,0 +1,24 @@ +.. meta:: + :description: AMD ROCm Glossary + :keywords: AMD, ROCm, glossary, terminology, device hardware, + device software, host software, performance + +.. _glossary: + +******************************************************************************** +Glossary +******************************************************************************** + +This glossary provides concise definitions of key terms and concepts in AMD ROCm +programming. Each entry includes a brief description and a link to detailed +documentation for in-depth information. + +The glossary is organized into four sections: + +* :doc:`glossary/device-hardware` — Hardware components (Compute Units, cores, + memory) +* :doc:`glossary/device-software` — Software abstractions (programming model, + ISA, thread hierarchy) +* :doc:`glossary/host-software` — Development tools (HIP, compilers, libraries, + profilers) +* :doc:`glossary/performance` — Performance metrics and optimization concepts diff --git a/docs/reference/glossary/device-hardware.rst b/docs/reference/glossary/device-hardware.rst new file mode 100644 index 000000000..d664b2e74 --- /dev/null +++ b/docs/reference/glossary/device-hardware.rst @@ -0,0 +1,84 @@ +.. meta:: + :description: Device hardware glossary for AMD GPUs + :keywords: AMD, ROCm, GPU, device hardware, compute units, cores, MFMA, + architecture, register file, cache, HBM + +.. _glossary-device-hardware: + +*************** +Device hardware +*************** + +This section provides brief definitions of hardware components and architectural +features of AMD GPUs. + +.. glossary:: + :sorted: + + AMD device architecture + AMD's device architecture is based on unified, programmable compute + engines called Compute Units. See :ref:`hip:hardware_implementation` for + details. + + Compute units + Compute Units (CUs) are the fundamental programmable execution engines + in AMD GPUs that manage thousands of lightweight threads. See + :ref:`hip:compute_unit` for details. + + Vector arithmetic logic units + Vector arithmetic logic units (VALUs) are the primary arithmetic engines + that execute mathematical and logical operations within AMD Compute + Units. See :ref:`hip:valu` for details. + + Special function unit + Special Function Units (SFUs) accelerate transcendental and reciprocal + mathematical functions such as ``exp``, ``log``, ``sin``, and ``cos``. + See :ref:`hip:sfu` for details. + + Load and store unit + Load/Store Units (LSUs) handle data transfer between Compute Units and + the GPU's memory subsystems, managing thousands of concurrent memory + operations. See :ref:`hip:lsu` for details. + + Wavefront scheduler + The Wavefront Scheduler in each Compute Unit decides which group of + threads to execute each clock cycle, enabling rapid context switching + for latency hiding. See :ref:`hip:wave-scheduling` for details. + + SIMD core + SIMD Cores are execution lanes that perform scalar and vector arithmetic + operations inside each Compute Unit. See :ref:`hip:cdna_architecture` + and :ref:`hip:rdna_architecture` for details. + + Matrix core and MFMA + Matrix Cores (MFMA units) are specialized execution units that perform + large-scale matrix operations in a single instruction, delivering high + throughput for AI and HPC workloads. See :ref:`hip:mfma_units` for + details. + + Data movement engine + Data Movement Engines (DMEs) are specialized hardware units in CDNA3 and + CDNA4 that accelerate multi-dimensional tensor data copies between + global memory and on-chip memory. See :ref:`hip:dme` for details. + + Compute unit versioning + Compute Units are versioned with GFX IP identifiers that define their + microarchitectural features and instruction set compatibility. See + :ref:`hip:gfx_ip` for details. + + Register file + The register file is the primary on-chip memory store in each Compute + Unit, holding data between arithmetic and memory operations. See + :ref:`hip:memory_hierarchy` for details. + + L1 data cache + The L1 Data Cache is the private on-chip memory associated with each + Compute Unit, providing fast access to recently used data. See + :ref:`hip:vl1`, :ref:`hip:sl1` and :ref:`hip:memory_coherence` for + details. + + GPU RAM and HBM + GPU RAM, also known as global memory in the HIP programming model, is + the large, high-capacity High Bandwidth Memory (HBM) subsystem + accessible by all Compute Units, forming the foundation of the device's + data storage hierarchy. See :ref:hip:hbm for details. \ No newline at end of file diff --git a/docs/reference/glossary/device-software.rst b/docs/reference/glossary/device-software.rst new file mode 100644 index 000000000..c38b5a595 --- /dev/null +++ b/docs/reference/glossary/device-software.rst @@ -0,0 +1,92 @@ +.. meta:: + :description: Device software glossary for AMD GPUs + :keywords: AMD, ROCm, GPU, device software, programming model, AMDGPU, + assembly, IR, GFX IP, wavefront, work-group, HIP kernel, thread hierarchy + +.. _glossary-device-software: + +*************** +Device software +*************** + +This section provides brief definitions of software abstractions and programming +models that run on AMD GPUs. + +.. glossary:: + :sorted: + + ROCm programming model + The ROCm programming model defines how AMD GPUs execute massively + parallel programs through hierarchical work-groups, memory scopes, and + barrier synchronization. See :ref:`hip:programming_model` for complete + details. + + AMDGPU assembly + AMDGPU Assembly (GFX ISA) is the low-level assembly format for programs + running on AMD GPUs, generated by the ROCm compiler toolchain. See + :ref:`hip:amdgpu_assembly` for instruction set details. + + AMDGPU intermediate representation + AMDGPU IR is an intermediate representation for GPU code, serving as a + virtual instruction set between high-level languages and + architecture-specific assembly. See :ref:`hip:amdgpu_ir` for compilation + details. + + GFX IP + GFX IP versions are identifiers that specify which instruction formats, + memory models, and compute features are supported by each AMD GPU + generation. See :ref:`hip:gfx_ip` for versioning information. + + Work-item + A work-item (also called a thread) is the smallest unit of execution in + the AMD GPU programming model. See :ref:`hip:work-item` for thread + hierarchy details. + + Wavefront + A wavefront is a group of threads that execute together in parallel on a + single Compute Unit, sharing one instruction stream. See + :ref:`hip:wavefront` for execution details. + + Work-group + A work-group is a collection of threads scheduled together on a single + Compute Unit that can coordinate through Local Data Share memory. A + work-group may consist of multiple wavefronts that execute in parallel + on the same Compute Unit. See + :ref:`hip:inherent_thread_hierarchy_block` for work-group details. + + Grid + A grid represents the collection of all work-groups executing a single + kernel across the entire GPU. See :ref:`hip:inherent_thread_hierarchy_` + for grid execution details. + + HIP kernel + A kernel is the unit of GPU code that executes in parallel across many + threads, distributed across the GPU's Compute Units. See + :ref:`hip:device_program` for kernel programming details. + + HIP thread hierarchy + The thread hierarchy structures parallel work from individual threads to + work-groups to grids, mapping onto hardware from SIMD lanes to Compute + Units to the entire GPU. See :ref:`hip:inherent_thread_model` for + complete details. + + HIP memory hierarchy + The memory hierarchy pairs each thread hierarchy level with + corresponding memory scopes, from private registers to shared LDS to + global HBM. See :ref:`hip:memory_hierarchy` for memory architecture + details. + + Registers + Registers are the lowest level of the memory hierarchy, storing per-thread + temporary variables and intermediate results. See + :ref:`hip:memory_hierarchy` for register usage details. + + Local data share + Local Data Share (LDS) is fast on-chip memory shared among threads in a + work-group, enabling efficient coordination and data reuse. See + :ref:`hip:lds` for LDS programming details. + + Global memory + Global memory is the device-wide memory accessible to all threads, + physically implemented in HBM or GDDR. See + :ref:`hip:memory_hierarchy` for global memory details. diff --git a/docs/reference/glossary/host-software.rst b/docs/reference/glossary/host-software.rst new file mode 100644 index 000000000..7425b95be --- /dev/null +++ b/docs/reference/glossary/host-software.rst @@ -0,0 +1,66 @@ +.. meta:: + :description: Host software glossary for AMD GPUs + :keywords: AMD, ROCm, GPU, host software, HIP, compiler, runtime, libraries, + profiler, amd-smi + +.. _glossary-host-software: + +************* +Host software +************* + +This section provides brief definitions of development tools, compilers, +libraries, and runtime environments for programming AMD GPUs. + +.. glossary:: + :sorted: + + ROCm software platform + ROCm is AMD's GPU software stack, providing compiler + toolchains, runtime environments, and performance libraries for HPC and + AI applications. See :doc:`../../what-is-rocm` for a complete component + overview. + + HIP C++ language extension + HIP extends the C++ language with additional features designed for + programming heterogeneous applications. These extensions mostly relate + to the kernel language, but some can also be applied to host + functionality. See :doc:`hip:how-to/hip_cpp_language_extensions` for + language fundamentals. + + amd-smi + The ``amd-smi`` command-line utility queries, monitors, and manages AMD GPU + state, providing hardware information and performance metrics. See + :doc:`amdsmi:index` for detailed usage. + + HIP runtime API + The HIP runtime API provides an interface for GPU programming, offering + functions for memory management, kernel launches, and synchronization. See + :ref:`hip:hip_runtime_api_how-to` for API overview. + + HIP compiler + The HIP compiler ``amdclang++`` compiles HIP C++ programs into binaries + containing both host CPU and device GPU code. See + :doc:`llvm-project:reference/rocmcc` for compiler flags and options. + + HIP runtime compiler + The HIP Runtime Compiler (HIPRTC) compiles HIP source code at runtime + into AMDGPU binary code objects, enabling just-in-time kernel generation, + device-specific optimization, and dynamic code creation for different + GPUs. See :ref:`hip:hip_runtime_compiler_how-to` for API details. + + ROCgdb + ROCgdb is AMD's source-level debugger for HIP and ROCm applications, + enabling debugging of both host CPU and GPU device code, including + kernel breakpoints, stepping, and variable inspection. See + :doc:`rocgdb:index` for usage and command reference. + + ROCm profiler + The ROCm profiler (``rocprofv3``) is AMD's primary performance analysis + tool, providing profiling, tracing, and performance counter collection. + See :ref:`rocprofiler-sdk:using-rocprofv3` for profiling workflows. + + ROCm and LLVM binary utilities + ROCm and LLVM binary utilities are command-line tools for examining and + manipulating GPU binaries and code objects. See + :ref:`hip:binary_utilities` for utility details. diff --git a/docs/reference/glossary/performance.rst b/docs/reference/glossary/performance.rst new file mode 100644 index 000000000..5e62291b3 --- /dev/null +++ b/docs/reference/glossary/performance.rst @@ -0,0 +1,121 @@ +.. meta:: + :description: Performance glossary for AMD GPUs + :keywords: AMD, ROCm, GPU, performance, optimization, roofline, bottleneck, + occupancy, bandwidth, latency hiding, divergence + +.. _glossary-performance: + +*********** +Performance +*********** + +This section provides brief definitions of performance analysis concepts and +optimization techniques. + +.. glossary:: + :sorted: + + Roofline model + The roofline model is a visual performance model that determines whether + a program is compute-bound or memory-bound. See + :ref:`hip:roofline_model` for roofline analysis. + + Compute-bound + Compute-bound kernels are limited by the arithmetic bandwidth of the + GPU's compute units rather than memory bandwidth. See + :ref:`hip:compute_bound` for compute-bound analysis. + + Memory-bound + Memory-bound kernels are limited by memory bandwidth rather than + arithmetic throughput, typically due to low arithmetic intensity. See + :ref:`hip:memory_bound` for memory-bound analysis. + + Arithmetic intensity + Arithmetic intensity is the ratio of arithmetic operations to memory + operations in a kernel, determining performance characteristics. See + :ref:`hip:arithmetic_intensity` for intensity analysis. + + Overhead + Overhead latency is time spent with no useful work being done, often + from CPU-side bottlenecks or kernel launch delays. See + :ref:`hip:performance_bottlenecks` for details. + + Little's Law + Little's Law relates concurrency, latency, and throughput, determining + how much independent work must be in flight to hide latency. See + :ref:`hip:littles_law` for latency hiding details. + + Memory bandwidth + Memory bandwidth is the maximum rate at which data can be transferred + between memory hierarchy levels, typically measured in bytes per + second. See :ref:`hip:memory_bound` for details. + + Arithmetic bandwidth + Arithmetic bandwidth is the peak rate at which arithmetic work can be + performed, defining the compute roof in roofline models. See + :ref:`hip:compute_bound` for details. + + Latency hiding + Latency hiding masks long-latency operations by running many concurrent + threads, keeping execution pipelines busy. See :ref:`hip:latency_hiding` + for details. + + Wavefront execution state + Wavefront execution states (*active*, *stalled*, *eligible*, *selected*) + describe the scheduling status of wavefronts on AMD GPUs. See + :ref:`hip:wavefront_execution` for state definitions. + + Active cycle + An active cycle is a clock cycle in which a Compute Unit has at least + one active wavefront resident. See :ref:`hip:wavefront_execution` for + details. + + Occupancy + Occupancy is the ratio of active wavefronts to the maximum number of + wavefronts that can be active on a Compute Unit. See + :ref:`hip:occupancy` for occupancy analysis. + + Pipe utilization + Pipe utilization measures how effectively a kernel uses the execution + pipelines within each Compute Unit. See :ref:`hip:pipe_utilization` for + utilization details. + + Peak rate + Peak rate is the theoretical maximum throughput at which a hardware + system can complete work under ideal conditions. See + :ref:`hip:theoretical_performance_limits` for details. + + Issue efficiency + Issue efficiency measures how effectively the wavefront scheduler keeps + execution pipelines busy by issuing instructions. See + :ref:`hip:issue_efficiency` for efficiency metrics. + + CU utilization + CU utilization measures the percentage of time that Compute Units are + actively executing instructions. See :ref:`hip:cu_utilization` for + utilization analysis. + + Wavefront divergence + Wavefront divergence occurs when threads within a wavefront take + different execution paths due to conditional statements. See + :ref:`hip:branch_efficiency` for divergence handling details. + + Branch efficiency + Branch efficiency measures how often all threads within a wavefront take + the same execution path, quantifying control flow uniformity. See + :ref:`hip:branch_efficiency` for branch analysis. + + Memory coalescing + Memory coalescing improves memory bandwidth by servicing many logical + loads or stores with fewer physical memory transactions. See + :ref:`hip:memory_coalescing_theory` for coalescing patterns. + + Bank conflict + A bank conflict occurs when multiple threads simultaneously access + different addresses in the same LDS bank, serializing accesses. See + :ref:`hip:bank_conflicts_theory` for details. + + Register pressure + Register pressure occurs when excessive register demand limits the + number of active wavefronts per Compute Unit, reducing occupancy. See + :ref:`hip:register_pressure_theory` for details. diff --git a/docs/sphinx/_toc.yml.in b/docs/sphinx/_toc.yml.in index 87cee514f..58bff52fe 100644 --- a/docs/sphinx/_toc.yml.in +++ b/docs/sphinx/_toc.yml.in @@ -224,6 +224,18 @@ subtrees: title: ROCm libraries - file: reference/rocm-tools.md title: ROCm tools, compilers, and runtimes + - file: reference/glossary.rst + title: ROCm glossary + subtrees: + - entries: + - file: reference/glossary/device-hardware.rst + title: Device hardware + - file: reference/glossary/device-software.rst + title: Device software + - file: reference/glossary/host-software.rst + title: Host software + - file: reference/glossary/performance.rst + title: Performance - file: reference/gpu-arch-specs.rst - file: reference/gpu-atomics-operation.rst - file: reference/env-variables.rst