This commit is contained in:
Istvan Kiss
2024-11-04 14:54:03 +01:00
parent fb5c08138e
commit 6fe63192a9

View File

@@ -29,72 +29,4 @@ ROCm/HIP also has python bindings that can be found at https://github.com/ROCm/h
In modern use cases, Python with TensorFlow and PyTorch is popular due to its
role in AI and Machine Learning.
Programming in HIP
================================================================================
When programming a heterogeneous application to run on a host CPU and offload
kernels to GPUs, the
following are key steps and considerations to ensure efficient execution and
performance:
#. Understand the Target Architecture (CPU + GPU): CPUs are designed to excel at
executing a sequence of operations and control logic as fast as possible,
while GPUs excel at parallel execution of large workloads across many threads.
You must target specific tasks to the appropriate architecture to optimize
your application performance. Target computationally intensive,
parallelizable parts at the GPU, while running control-heavy and sequential
logic on the CPU. For more information, see :doc:`Hardware Implementation <hip:understand/hardware_implementation>`.
#. Write GPU Kernels for Parallel Execution: Efficient GPU kernels can greatly
speed up computation by leveraging massive parallelism. Write kernels that
can take advantage of GPU SIMD (Single Instruction, Multiple Data)
architecture. Ensure that each thread operates on
independent memory locations to avoid memory contention. Avoid branching
(e.g., if-else statements) inside kernels as much as possible, since it can
lead to divergence, which slows down parallel execution. For more
information, see :doc:`Programming Model <hip:understand/programming_model>`.
#. Optimize the Thread and Block Sizes: Correctly configuring the threads in the
kernel launch configuration (e.g., threads per block, blocks per grid) is crucial for
maximizing GPU performance. Choose an optimal number of threads per block and
blocks per grid based on the specific hardware capabilities (e.g., the number
of streaming multiprocessors (SMs) and cores on the GPU). Ensure that the
number of threads per block is a multiple of the warp size (typically 32 for
most GPUs) for efficient execution. Test different configurations, as the
best combination can vary depending on the specific problem size and hardware.
#. Data Management and Transfer Between CPU and GPU: GPUs have their own memory
(device memory), separate from CPU memory (host memory). Transferring data
between the host CPU and the device GPU is one of the most expensive
operations. Managing data movement is crucial to optimize performance.
Minimize data transfers between the CPU and GPU by keeping data on the GPU
for as long as possible. Use asynchronous data transfer functions where
available, like ``hipMemcpyAsync()``, to overlap data transfer with kernel
execution. For more information, see :doc:`HIP Programming Manual <hip:how-to/hip_runtime_api/memory_management>`.
#. Memory Management on the GPU: GPU memory accesses can be a performance
bottleneck if not handled correctly. Use the different GPU memory types
effectively (e.g., global, shared, constant, and local memory). Shared memory
is faster than global memory but limited in size. Shared memory is ideal for
reusing data across threads in a block. Ensure memory accesses are coalesced
(i.e., threads in a warp access consecutive memory locations), as uncoalesced
memory access patterns can significantly degrade performance.
#. Synchronize CPU and GPU Workloads: Host (CPU) and device (GPU) execute tasks
run asynchronously, but proper synchronization is needed to ensure correct
results. Use synchronization functions like ``hipDeviceSynchronize()`` or
``hipStreamSynchronize()`` to ensure that kernels have completed execution
before using their results. Take advantage of asynchronous execution to
overlap data transfers, kernel execution, and CPU tasks where possible.
#. Error Handling: Check for errors after memory transfers
and kernel launches, for example ``hipGetLastError()``. Catch and handle errors to allow the application to gracefully exit, with appropriate messaging. For more information,
see `Error Handling <https://rocm.docs.amd.com/projects/HIP/en/latest/how-to/error_handling.html>`_.
#. Multi-GPU and Load Balancing: Large-scale applications that need more compute
power can use multiple GPUs in the system. This requires distributing
workloads across multiple GPUs to balance the load to prevent some GPUs from
being overutilized while others are idle.
For a complete description of the HIP programming language, see the :doc:`HIP documentation <hip:index>`.