start hcq docs (#5411)

* start hcq docs * more hcq docs * docs * docs * linter * correct args * linter * ts returns int
2026-01-09 15:08:02 -05:00 · 2024-07-15 21:31:11 +03:00
parent 9a7d5a148e
commit c9ec7ce070
7 changed files with 371 additions and 32 deletions
--- a/docs/developer.md
+++ b/docs/developer.md
@@ -44,3 +44,13 @@ Creating `ExecItem`, which has a run method
        members: true

 Lists of `ExecItem` can be condensed into a single ExecItem with the Graph API (rename to Queue?)
+
+## Runtime
+
+Runtimes are responsible for device-specific interactions. They handle tasks such as initializing devices, allocating memory, loading/launching programs, and more. You can find more information about the runtimes API on the [runtime overview page](runtime/overview.md).
+
+All runtime implementations can be found in the [runtime directory](https://github.com/tinygrad/tinygrad/tree/master/tinygrad/runtime).
+
+### HCQ Compatible Runtimes
+
+HCQ API is a lower-level API for defining runtimes. Interaction with HCQ-compatible devices occurs at a lower level, with commands issued directly to hardware queues. Some examples of such backends are [NV](https://github.com/tinygrad/tinygrad/tree/master/tinygrad/runtime/ops_nv.py) and [AMD](https://github.com/tinygrad/tinygrad/tree/master/tinygrad/runtime/ops_amd.py), which are userspace drivers for NVIDIA and AMD devices respectively. You can find more information about the API on [HCQ overview page](runtime/hcq.md)
--- a/docs/runtime/hcq.md
+++ b/docs/runtime/hcq.md
@@ -0,0 +1,137 @@
+# HCQ Compatible Runtime
+
+## Overview
+
+The main aspect of HCQ-compatible runtimes is how they interact with devices. In HCQ, all interactions with devices occur in a hardware-friendly manner using [command queues](#commandqueues). This approach allows commands to be issued directly to devices, bypassing runtime overhead such as HIP, or CUDA. Additionally, by using the HCQ API, these runtimes can benefit from various optimizations and features, including [HCQGraph](#hcqgraph) and built-in profiling capabilities.
+
+### Command Queues
+
+To interact with devices, there are 2 types of queues: `HWComputeQueue` and `HWCopyQueue`. Commands which are defined in a base `HWCommandQueue` class should be supported by both queues. These methods are timestamp and synchronization methods like [signal](#tinygrad.device.HWCommandQueue.signal) and [wait](#tinygrad.device.HWCommandQueue.wait).
+
+For example, the following Python code enqueues a wait, execute, and signal command on the HCQ-compatible device:
+```python
+HWComputeQueue().wait(signal_to_wait, value_to_wait) \
+                .exec(program, kernargs_ptr, global_dims, local_dims) \
+                .signal(signal_to_fire, value_to_fire) \
+                .submit(your_device)
+```
+
+Each runtime should implement the required functions that are defined in the `HWCommandQueue`, `HWComputeQueue`, and `HWCopyQueue` classes.
+
+::: tinygrad.device.HWCommandQueue
+    options:
+        members: [
+            "signal",
+            "wait",
+            "timestamp",
+            "update_signal",
+            "update_wait",
+            "submit",
+        ]
+        show_source: false
+
+::: tinygrad.device.HWComputeQueue
+    options:
+        members: [
+            "memory_barrier",
+            "exec",
+            "update_exec",
+        ]
+        show_source: false
+
+::: tinygrad.device.HWCopyQueue
+    options:
+        members: [
+            "copy",
+            "update_copy",
+        ]
+        show_source: false
+
+#### Implementing custom commands
+
+To implement custom commands in the queue, use the @hcq_command decorator for your command implementations.
+
+::: tinygrad.device.hcq_command
+    options:
+        members: [
+            "copy",
+            "update_copy",
+        ]
+        show_source: false
+
+### HCQ Compatible Device
+
+The `HCQCompatCompiled` class defines the API for HCQ-compatible devices. This class serves as an abstract base class that device-specific implementations should inherit from and implement.
+
+::: tinygrad.device.HCQCompatCompiled
+    options:
+        members: [
+            "_alloc_signal",
+            "_free_signal",
+            "_read_signal",
+            "_read_timestamp",
+            "_set_signal",
+            "_wait_signal",
+            "_gpu2cpu_time",
+        ]
+        show_source: false
+
+#### Signals
+
+Signals are device-dependent structures used for synchronization and timing in HCQ-compatible devices. They should be designed to record both a `value` and a `timestamp` within the same signal. The following Python code demonstrates the usage of signals:
+
+```python
+signal = your_device._alloc_signal()
+
+HWComputeQueue().timestamp(signal) \
+                .signal(signal, value_to_fire) \
+                .submit(your_device)
+
+your_device._wait_signal(signal, value_to_fire)
+timestamp = your_device._read_timestamp()
+```
+
+##### Synchronization signals
+
+Each HCQ-compatible device must allocate two signals for global synchronization purposes. These signals are passed to the `HCQCompatCompiled` base class during initialization: an active timeline signal `self.timeline_signal` and a shadow timeline signal `self._shadow_timeline_signal` which helps to handle signal value overflow issues. You can find more about synchronization in the [synchronization section](#synchronization)
+
+### HCQ Compatible Allocator
+
+The `HCQCompatAllocator` base class simplifies allocator logic by leveraging [command queues](#commandqueues) abstractions. This class efficiently handles copy and transfer operations, leaving only the alloc and free functions to be implemented by individual backends.
+
+::: tinygrad.device.HCQCompatAllocator
+    options:
+        members: [
+            "_alloc",
+            "_free",
+        ]
+        show_source: false
+
+#### HCQ Allocator Result Protocol
+
+Backends must adhere to the `HCQCompatAllocRes` protocol when returning allocation results.
+
+::: tinygrad.device.HCQCompatAllocRes
+    options:
+        members: true
+        show_source: false
+
+### Synchronization
+
+HCQ-compatible devices use a global timeline signal for synchronizing all operations. This mechanism ensures proper ordering and completion of tasks across the device. By convention, `self.timeline_value` points to the next value to signal. So, to wait for all previous operations on the device to complete, wait for `self.timeline_value - 1` value. The following Python code demonstrates the typical usage of signals to synchronize execution to other operations on the device:
+
+```python
+HWComputeQueue().wait(your_device.timeline_signal, your_device.timeline_value - 1) \
+                .exec(...)
+                .signal(your_device.timeline_signal, your_device.timeline_value) \
+                .submit(your_device)
+your_device.timeline_value += 1
+
+# Optionally wait for execution
+your_device._wait_signal(your_device.timeline_signal, your_device.timeline_value - 1)
+```
+
+## HCQGraph
+
+[HCQGraph](https://github.com/tinygrad/tinygrad/tree/master/tinygrad/runtime/graph/hcq.py) is a core feature that implements `GraphRunner` for HCQ-compatible devices. `HCQGraph` builds a static `HWComputeQueue` and `HWCopyQueue` for all operations per device. To optimize enqueue time, only the necessary parts of the queues are updated for each run using the update APIs of the queues, avoiding a complete rebuild.
+Optionally, queues can implement a `bind` API, which allows further optimization by eliminating the need to copy the queues into the device ring.
--- a/docs/runtime/overview.md
+++ b/docs/runtime/overview.md
@@ -0,0 +1,51 @@
+# Runtime Overview
+
+## Overview
+
+A typical runtime consists of the following parts:
+
+- [Compiled](#device)
+- [Allocator](#allocator)
+- [Program](#program)
+- [Compiler](#compiler)
+
+### Compiled
+
+The `Compiled` class is responsible for initializing and managing a device.
+
+::: tinygrad.device.Compiled
+    options:
+        members: [
+            "synchronize"
+        ]
+        show_source: false
+
+### Allocator
+
+The `Allocator` class is responsible for managing memory on the device. There is also a version called the `LRUAllocator`, which caches allocated buffers to optimize performance.
+
+::: tinygrad.device.Allocator
+    options:
+        members: true
+        show_source: false
+
+::: tinygrad.device.LRUAllocator
+    options:
+        members: true
+        show_source: false
+
+### Program
+
+The `Program` class is created for each loaded program. It is responsible for compiling and executing the program on the device. As an exmaple, here is a `ClangProgram` implmentation which loads program and runs it.
+
+::: tinygrad.runtime.ops_clang.ClangProgram
+    options:
+        members: true
+
+### Compiler
+
+The `Compiler` class compiles the output from the `Renderer` and produces it in a device-specific format.
+
+::: tinygrad.device.Compiler
+    options:
+        members: true
--- a/mkdocs.yml
+++ b/mkdocs.yml
@@ -18,7 +18,11 @@ nav:
    - dtypes: dtypes.md
    - nn (Neural Networks): nn.md
    - Environment Variables: env_vars.md
-  - Developer: developer.md
+  - Developer:
+    - developer.md
+    - Runtime:
+      - runtime/overview.md
+      - HCQ: runtime/hcq.md
  - tinybox: tinybox.md
 #- tinygrad: reference/

--- a/tinygrad/device.py
+++ b/tinygrad/device.py
@@ -140,6 +140,10 @@ class Allocator:
  def copyout(self, dest:memoryview, src): raise NotImplementedError("need copyout")

 class LRUAllocator(Allocator):  # pylint: disable=abstract-method
+  """
+  The LRU Allocator is responsible for caching buffers.
+  It ensures that buffers are not freed until it is absolutely necessary, optimizing performance.
+  """
  def __init__(self): self.cache: Dict[Tuple[int, Optional[BufferOptions]], Any] = defaultdict(list)
  def alloc(self, size:int, options:Optional[BufferOptions]=None):
    if len(c := self.cache[(size, options)]): return c.pop()
@@ -182,19 +186,25 @@ class Compiled:
  def __init__(self, device:str, allocator:Allocator, renderer:Optional[Renderer], compiler:Optional[Compiler], runtime, graph=None):
    self.dname, self.allocator, self.compiler, self.runtime, self.graph = device, allocator, compiler or Compiler(), runtime, graph
    self.renderer = renderer or Renderer()
-  def synchronize(self): pass  # override this in your device
+  def synchronize(self):
+    """
+    Synchronize all pending operations on the device.
+
+    This method ensures that all previously queued operations on the device have been completed before proceeding.
+    """
+    # override this in your device implementation

 # **************** for HCQ Compatible Devices ****************

 def hcq_command(func):
  """
-  Decorator for HWCommandQueue commands.
+  Decorator for HWCommandQueue commands. Enables command indexing and stores metadata for command updates.

-  Enables command indexing and stores metadata for command updates.
-
-  Usage:
-  @hcq_command
-  def command_method(self, ...): ...
+  For example:
+    ```python
+      @hcq_command
+      def command_method(self, ...): ...
+    ```
  """
  def __wrapper(self, *args, **kwargs):
    self.cmds_offset.append(len(self.q))
@@ -205,49 +215,121 @@ def hcq_command(func):
  return __wrapper

 class HWCommandQueue:
+  """
+  A base class for hardware command queues in the HCQ (Hardware Command Queue) API.
+  Both compute and copy queues should have the following commands implemented.
+  """
+
  def __init__(self): self.q, self.binded_device, self.cmds_offset, self.cmds_len, self.cmds_meta = [], None, [], [], []
  def __len__(self): return len(self.cmds_offset)
  def _patch(self, cmd_idx, offset, data): self.q[(st:=self.cmds_offset[cmd_idx]+offset):st+len(data)] = array.array('I', data)

  @hcq_command
-  def signal(self, signal, value): self._signal(signal, value)
+  def signal(self, signal, value):
+    """
+    Enqueues a signal command which sets the signal to the given value, ensuring all previous operations are completed.
+
+    Args:
+      signal: The signal to set
+      value: The value to set the signal to
+    """
+    self._signal(signal, value)
  def _signal(self, signal, value): raise NotImplementedError("backend should overload this function")

  @hcq_command
-  def wait(self, signal, value): self._wait(signal, value)
+  def wait(self, signal, value):
+    """
+    Enqueues a wait command which halts execution until the signal is greater than or equal to a specific value.
+
+    Args:
+      signal: The signal to wait on
+      value: The value to wait for
+    """
+    self._wait(signal, value)
  def _wait(self, signal, value): raise NotImplementedError("backend should overload this function")

  @hcq_command
-  def timestamp(self, signal): self._timestamp(signal)
+  def timestamp(self, signal):
+    """
+    Enqueues a timestamp command which records the current time in a signal after all previously enqueued commands are completed.
+
+    Args:
+      signal: The signal to store the timestamp
+    """
+    self._timestamp(signal)
  def _timestamp(self, signal): raise NotImplementedError("backend should overload this function")

  def update_signal(self, cmd_idx, signal=None, value=None):
+    """
+    Updates a previously queued signal command.
+
+    Args:
+      cmd_idx: Index of the signal command to update
+      signal: New signal to set (if None, keeps the original)
+      value: New value to set (if None, keeps the original)
+    """
    if self.cmds_meta[cmd_idx] != "signal": raise RuntimeError("called update_signal not on a signal command")
    self._update_signal(cmd_idx, signal, value)
    return self
  def _update_signal(self, cmd_idx, signal, value): raise NotImplementedError("backend should overload this function")

  def update_wait(self, cmd_idx, signal=None, value=None):
+    """
+    Updates a previously queued wait command.
+
+    Args:
+      cmd_idx: Index of the wait command to update
+      signal: New signal to wait on (if None, keeps the original)
+      value: New value to wait for (if None, keeps the original)
+    """
    if self.cmds_meta[cmd_idx] != "wait": raise RuntimeError("called update_wait not on a wait command")
    self._update_wait(cmd_idx, signal, value)
    return self
  def _update_wait(self, cmd_idx, signal, value): raise NotImplementedError("backend should overload this function")

  def submit(self, device:HCQCompatCompiled):
+    """
+    Submits the command queue to a specific device for execution.
+
+    Args:
+      device: The device to submit the queue to
+    """
    self._submit(device)
    return self
  def _submit(self, device:HCQCompatCompiled): raise NotImplementedError("backend should overload this function")

 class HWComputeQueue(HWCommandQueue):
  @hcq_command
-  def memory_barrier(self): self._memory_barrier()
+  def memory_barrier(self):
+    """
+    Enqueues a memory barrier command to ensure memory coherence between agents.
+    """
+    self._memory_barrier()
  def _memory_barrier(self): pass

  @hcq_command
-  def exec(self, prg, kernargs, global_size, local_size): self._exec(prg, kernargs, global_size, local_size)
+  def exec(self, prg, kernargs, global_size, local_size):
+    """
+    Enqueues an execution command for a kernel program.
+
+    Args:
+      prg: The program to execute
+      kernargs: The pointer to kernel arguments
+      global_size: The global work size
+      local_size: The local work size
+    """
+    self._exec(prg, kernargs, global_size, local_size)
  def _exec(self, prg, kernargs, global_size, local_size): raise NotImplementedError("backend should overload this function")

  def update_exec(self, cmd_idx, global_size, local_size):
+    """
+    Updates a previously queued execution command.
+
+    Args:
+      cmd_idx: Index of the execution command to update
+      global_size: New global work size
+      local_size: New local work size
+    """
    if self.cmds_meta[cmd_idx] != "exec": raise RuntimeError("called update_exec not on an exec command")
    self._update_exec(cmd_idx, global_size, local_size)
    return self
@@ -255,10 +337,27 @@ class HWComputeQueue(HWCommandQueue):

 class HWCopyQueue(HWCommandQueue):
  @hcq_command
-  def copy(self, dest, src, copy_size): self._copy(dest, src, copy_size)
+  def copy(self, dest, src, copy_size):
+    """
+    Enqueues a copy command to transfer data.
+
+    Args:
+      dest: The destination of the copy
+      src: The source of the copy
+      copy_size: The size of data to copy
+    """
+    self._copy(dest, src, copy_size)
  def _copy(self, dest, src, copy_size): raise NotImplementedError("backend should overload this function")

  def update_copy(self, cmd_idx, dest=None, src=None):
+    """
+    Updates a previously queued copy command.
+
+    Args:
+      cmd_idx: Index of the copy command to update
+      dest: New destination of the copy (if None, keeps the original)
+      src: New source of the copy (if None, keeps the original)
+    """
    if self.cmds_meta[cmd_idx] != "copy": raise RuntimeError("called update_copy not on an copy command")
    self._update_copy(cmd_idx, dest, src)
    return self
@@ -284,36 +383,68 @@ class HCQCompatProgram:
  def fill_kernargs(self, kernargs_ptr:int, bufs:Tuple[Any, ...], vals:Tuple[int, ...]=()): raise NotImplementedError("need fill_kernargs")

 class HCQCompatCompiled(Compiled):
+  """
+  A base class for devices compatible with the HCQ (Hardware Command Queue) API.
+  """
+
  def __init__(self, device:str, allocator:Allocator, renderer:Renderer, compiler:Compiler, runtime, comp_queue_t, copy_queue_t, timeline_signals):
    self.hw_compute_queue_t, self.hw_copy_queue_t = comp_queue_t, copy_queue_t
-    self.timeline_value: int = 1
+    self.timeline_value:int = 1
    self.timeline_signal, self._shadow_timeline_signal = timeline_signals
-    self.sig_prof_records: List[Tuple[Any, Any, str, bool]] = []
-    self.raw_prof_records: List[Tuple[int, int, str, bool]] = []
+    self.sig_prof_records:List[Tuple[Any, Any, str, bool]] = []
+    self.raw_prof_records:List[Tuple[int, int, str, bool]] = []
    if PROFILE: self._prof_setup()

    from tinygrad.runtime.graph.hcq import HCQGraph
    super().__init__(device, allocator, renderer, compiler, runtime, HCQGraph)

  @classmethod
-  def _read_signal(self, sig): raise NotImplementedError("need _read_signal") # reads a value for a signal
+  def _read_signal(cls, signal:Any) -> Any:
+    """
+    Read a value for a signal.
+    """
+    raise NotImplementedError("_read_signal needs to be implemented")

  @classmethod
-  def _read_timestamp(self, sig): raise NotImplementedError("need _read_timestamp") # reads a timestamp for a signal
+  def _read_timestamp(cls, signal:Any) -> int:
+    """
+    Read a timestamp for a signal.
+    """
+    raise NotImplementedError("_read_timestamp needs to be implemented")

  @classmethod
-  def _set_signal(self, sig, value): raise NotImplementedError("need _set_signal") # sets a value for a signal
+  def _set_signal(cls, signal:Any, value:Any) -> None:
+    """
+    Set a value for a signal.
+    """
+    raise NotImplementedError("_set_signal needs to be implemented")

  @classmethod
-  def _alloc_signal(self, value=0, **kwargs): raise NotImplementedError("need _alloc_signal") # allocates a new signal
+  def _alloc_signal(cls, value:Any = 0, **kwargs) -> Any:
+    """
+    Allocate a new signal.
+    """
+    raise NotImplementedError("_alloc_signal needs to be implemented")

  @classmethod
-  def _free_signal(self, sig): raise NotImplementedError("need _free_signal") # frees a signal
+  def _free_signal(cls, signal:Any) -> None:
+    """
+    Free a signal.
+    """
+    raise NotImplementedError("_free_signal needs to be implemented")

  @classmethod
-  def _wait_signal(self, signal, value=0, timeout=10000): raise NotImplementedError("need _wait_signal") # waits for a signal value
+  def _wait_signal(cls, signal:Any, value:Any = 0, timeout:int = 10000) -> None:
+    """
+    Wait for a signal to reach a specific value. Signals
+    """
+    raise NotImplementedError("_wait_signal needs to be implemented")

-  def _gpu2cpu_time(self, gpu_time, is_copy): raise NotImplementedError("need _gpu2cpu_time")
+  def _gpu2cpu_time(self, gpu_time:float, is_copy:bool) -> float:
+    """
+    Convert GPU time to CPU time. `is_copy` flag indicating if this is a copy queue.
+    """
+    raise NotImplementedError("_gpu2cpu_time needs to be implemented")

  def _prof_setup(self):
    if not hasattr(self, 'profile_logger'): atexit.register(self._prof_finalize)
@@ -347,6 +478,12 @@ class HCQCompatCompiled(Compiled):
 class HCQCompatAllocRes(Protocol): va_addr: int; size: int # noqa: E702

 class HCQCompatAllocator(LRUAllocator): # pylint: disable=abstract-method
+  """
+  A base allocator class compatible with the HCQ (Hardware Command Queue) API.
+
+  This class implements basic copy operations following the HCQ API, utilizing both `HWComputeQueue` and `HWCopyQueue`.
+  """
+
  def __init__(self, device, batch_size=(2 << 20), batch_cnt=32):
    self.device = device
    self.b = [self._alloc(batch_size, BufferOptions(host=True)) for _ in range(batch_cnt)]
--- a/tinygrad/runtime/ops_amd.py
+++ b/tinygrad/runtime/ops_amd.py
@@ -410,13 +410,13 @@ class AMDDevice(HCQCompatCompiled):
    kio.free_memory_of_gpu(self.kfd, handle=mem.handle)

  @classmethod
-  def _read_signal(self, sig): return sig.value
+  def _read_signal(self, signal): return signal.value

  @classmethod
-  def _read_timestamp(self, sig): return sig.start_ts
+  def _read_timestamp(self, signal): return signal.start_ts

  @classmethod
-  def _set_signal(self, sig, value): sig.value = value
+  def _set_signal(self, signal, value): signal.value = value

  @classmethod
  def _alloc_signal(self, value=0, **kwargs) -> hsa.amd_signal_t:
@@ -428,7 +428,7 @@ class AMDDevice(HCQCompatCompiled):
    return ret

  @classmethod
-  def _free_signal(self, sig): self.signals_pool.append(sig)
+  def _free_signal(self, signal): self.signals_pool.append(signal)

  @classmethod
  def _wait_signal(self, signal:hsa.amd_signal_t, value=0, timeout=10000):
--- a/tinygrad/runtime/ops_nv.py
+++ b/tinygrad/runtime/ops_nv.py
@@ -552,13 +552,13 @@ class NVDevice(HCQCompatCompiled):
    NVDevice.devices.append(self)

  @classmethod
-  def _read_signal(self, sig): return sig[0]
+  def _read_signal(self, signal): return signal[0]

  @classmethod
-  def _read_timestamp(self, sig): return sig[1]
+  def _read_timestamp(self, signal): return signal[1]

  @classmethod
-  def _set_signal(self, sig, value): sig[0] = value
+  def _set_signal(self, signal, value): signal[0] = value

  @classmethod
  def _alloc_signal(self, value=0, **kwargs) -> memoryview:
@@ -566,7 +566,7 @@ class NVDevice(HCQCompatCompiled):
    return sig

  @classmethod
-  def _free_signal(self, sig): self.signals_pool.append(sig)
+  def _free_signal(self, signal): self.signals_pool.append(signal)

  @classmethod
  def _wait_signal(self, signal, value=0, timeout=10000):