WIP

2026-01-10 07:08:08 -05:00 · 2024-11-04 14:54:03 +01:00
parent fb5c08138e
commit 6fe63192a9
1 changed files with 0 additions and 68 deletions
--- a/docs/how-to/hip_programming_guide.rst
+++ b/docs/how-to/hip_programming_guide.rst
@@ -29,72 +29,4 @@ ROCm/HIP also has python bindings that can be found at https://github.com/ROCm/h
 In modern use cases, Python with TensorFlow and PyTorch is popular due to its
 role in AI and Machine Learning.

-Programming in HIP
-================================================================================
-
-When programming a heterogeneous application to run on a host CPU and offload
-kernels to GPUs, the
-following are key steps and considerations to ensure efficient execution and
-performance:
-
-
-#. Understand the Target Architecture (CPU + GPU): CPUs are designed to excel at
-   executing a sequence of operations and control logic as fast as possible,
-   while GPUs excel at parallel execution of large workloads across many threads.
-   You must target specific tasks to the appropriate architecture to optimize
-   your application performance. Target computationally intensive,
-   parallelizable parts at the GPU, while running control-heavy and sequential
-   logic on the CPU. For more information, see :doc:`Hardware Implementation <hip:understand/hardware_implementation>`.
-
-#. Write GPU Kernels for Parallel Execution: Efficient GPU kernels can greatly
-   speed up computation by leveraging massive parallelism. Write kernels that
-   can take advantage of GPU SIMD (Single Instruction, Multiple Data)
-   architecture. Ensure that each thread operates on
-   independent memory locations to avoid memory contention. Avoid branching
-   (e.g., if-else statements) inside kernels as much as possible, since it can
-   lead to divergence, which slows down parallel execution. For more
-   information, see :doc:`Programming Model <hip:understand/programming_model>`.
-
-#. Optimize the Thread and Block Sizes: Correctly configuring the threads in the
-   kernel launch configuration (e.g., threads per block, blocks per grid) is crucial for
-   maximizing GPU performance. Choose an optimal number of threads per block and
-   blocks per grid based on the specific hardware capabilities (e.g., the number
-   of streaming multiprocessors (SMs) and cores on the GPU). Ensure that the
-   number of threads per block is a multiple of the warp size (typically 32 for
-   most GPUs) for efficient execution. Test different configurations, as the
-   best combination can vary depending on the specific problem size and hardware.
-
-#. Data Management and Transfer Between CPU and GPU: GPUs have their own memory
-   (device memory), separate from CPU memory (host memory). Transferring data
-   between the host CPU and the device GPU is one of the most expensive
-   operations. Managing data movement is crucial to optimize performance.
-   Minimize data transfers between the CPU and GPU by keeping data on the GPU
-   for as long as possible. Use asynchronous data transfer functions where
-   available, like ``hipMemcpyAsync()``, to overlap data transfer with kernel
-   execution. For more information, see :doc:`HIP Programming Manual <hip:how-to/hip_runtime_api/memory_management>`.
-
-#. Memory Management on the GPU: GPU memory accesses can be a performance
-   bottleneck if not handled correctly. Use the different GPU memory types
-   effectively (e.g., global, shared, constant, and local memory). Shared memory
-   is faster than global memory but limited in size. Shared memory is ideal for
-   reusing data across threads in a block. Ensure memory accesses are coalesced
-   (i.e., threads in a warp access consecutive memory locations), as uncoalesced
-   memory access patterns can significantly degrade performance.
-
-#. Synchronize CPU and GPU Workloads: Host (CPU) and device (GPU) execute tasks
-   run asynchronously, but proper synchronization is needed to ensure correct
-   results. Use synchronization functions like  ``hipDeviceSynchronize()`` or
-   ``hipStreamSynchronize()`` to ensure that kernels have completed execution
-   before using their results. Take advantage of asynchronous execution to
-   overlap data transfers, kernel execution, and CPU tasks where possible.
-
-#. Error Handling: Check for errors after memory transfers
-   and kernel launches, for example ``hipGetLastError()``. Catch and handle errors to allow the application to gracefully exit, with appropriate messaging. For more information,
-   see `Error Handling <https://rocm.docs.amd.com/projects/HIP/en/latest/how-to/error_handling.html>`_.
-
-#. Multi-GPU and Load Balancing: Large-scale applications that need more compute
-   power can use multiple GPUs in the system. This requires distributing
-   workloads across multiple GPUs to balance the load to prevent some GPUs from
-   being overutilized while others are idle.
-
 For a complete description of the HIP programming language, see the :doc:`HIP documentation <hip:index>`.