mirror of
https://github.com/ROCm/ROCm.git
synced 2026-01-10 07:08:08 -05:00
WIP
This commit is contained in:
@@ -29,72 +29,4 @@ ROCm/HIP also has python bindings that can be found at https://github.com/ROCm/h
|
||||
In modern use cases, Python with TensorFlow and PyTorch is popular due to its
|
||||
role in AI and Machine Learning.
|
||||
|
||||
Programming in HIP
|
||||
================================================================================
|
||||
|
||||
When programming a heterogeneous application to run on a host CPU and offload
|
||||
kernels to GPUs, the
|
||||
following are key steps and considerations to ensure efficient execution and
|
||||
performance:
|
||||
|
||||
|
||||
#. Understand the Target Architecture (CPU + GPU): CPUs are designed to excel at
|
||||
executing a sequence of operations and control logic as fast as possible,
|
||||
while GPUs excel at parallel execution of large workloads across many threads.
|
||||
You must target specific tasks to the appropriate architecture to optimize
|
||||
your application performance. Target computationally intensive,
|
||||
parallelizable parts at the GPU, while running control-heavy and sequential
|
||||
logic on the CPU. For more information, see :doc:`Hardware Implementation <hip:understand/hardware_implementation>`.
|
||||
|
||||
#. Write GPU Kernels for Parallel Execution: Efficient GPU kernels can greatly
|
||||
speed up computation by leveraging massive parallelism. Write kernels that
|
||||
can take advantage of GPU SIMD (Single Instruction, Multiple Data)
|
||||
architecture. Ensure that each thread operates on
|
||||
independent memory locations to avoid memory contention. Avoid branching
|
||||
(e.g., if-else statements) inside kernels as much as possible, since it can
|
||||
lead to divergence, which slows down parallel execution. For more
|
||||
information, see :doc:`Programming Model <hip:understand/programming_model>`.
|
||||
|
||||
#. Optimize the Thread and Block Sizes: Correctly configuring the threads in the
|
||||
kernel launch configuration (e.g., threads per block, blocks per grid) is crucial for
|
||||
maximizing GPU performance. Choose an optimal number of threads per block and
|
||||
blocks per grid based on the specific hardware capabilities (e.g., the number
|
||||
of streaming multiprocessors (SMs) and cores on the GPU). Ensure that the
|
||||
number of threads per block is a multiple of the warp size (typically 32 for
|
||||
most GPUs) for efficient execution. Test different configurations, as the
|
||||
best combination can vary depending on the specific problem size and hardware.
|
||||
|
||||
#. Data Management and Transfer Between CPU and GPU: GPUs have their own memory
|
||||
(device memory), separate from CPU memory (host memory). Transferring data
|
||||
between the host CPU and the device GPU is one of the most expensive
|
||||
operations. Managing data movement is crucial to optimize performance.
|
||||
Minimize data transfers between the CPU and GPU by keeping data on the GPU
|
||||
for as long as possible. Use asynchronous data transfer functions where
|
||||
available, like ``hipMemcpyAsync()``, to overlap data transfer with kernel
|
||||
execution. For more information, see :doc:`HIP Programming Manual <hip:how-to/hip_runtime_api/memory_management>`.
|
||||
|
||||
#. Memory Management on the GPU: GPU memory accesses can be a performance
|
||||
bottleneck if not handled correctly. Use the different GPU memory types
|
||||
effectively (e.g., global, shared, constant, and local memory). Shared memory
|
||||
is faster than global memory but limited in size. Shared memory is ideal for
|
||||
reusing data across threads in a block. Ensure memory accesses are coalesced
|
||||
(i.e., threads in a warp access consecutive memory locations), as uncoalesced
|
||||
memory access patterns can significantly degrade performance.
|
||||
|
||||
#. Synchronize CPU and GPU Workloads: Host (CPU) and device (GPU) execute tasks
|
||||
run asynchronously, but proper synchronization is needed to ensure correct
|
||||
results. Use synchronization functions like ``hipDeviceSynchronize()`` or
|
||||
``hipStreamSynchronize()`` to ensure that kernels have completed execution
|
||||
before using their results. Take advantage of asynchronous execution to
|
||||
overlap data transfers, kernel execution, and CPU tasks where possible.
|
||||
|
||||
#. Error Handling: Check for errors after memory transfers
|
||||
and kernel launches, for example ``hipGetLastError()``. Catch and handle errors to allow the application to gracefully exit, with appropriate messaging. For more information,
|
||||
see `Error Handling <https://rocm.docs.amd.com/projects/HIP/en/latest/how-to/error_handling.html>`_.
|
||||
|
||||
#. Multi-GPU and Load Balancing: Large-scale applications that need more compute
|
||||
power can use multiple GPUs in the system. This requires distributing
|
||||
workloads across multiple GPUs to balance the load to prevent some GPUs from
|
||||
being overutilized while others are idle.
|
||||
|
||||
For a complete description of the HIP programming language, see the :doc:`HIP documentation <hip:index>`.
|
||||
|
||||
Reference in New Issue
Block a user