docs: user runtime docs (#5756)

2026-04-07 03:00:26 -04:00 · 2024-07-27 23:21:54 +03:00
parent 5d53fa491b
commit fff19b961b
7 changed files with 22 additions and 8 deletions
--- a/docs/developer/developer.md
+++ b/docs/developer/developer.md
@@ -0,0 +1,56 @@
+The tinygrad framework has four pieces
+
+* a PyTorch like <b>frontend</b>.
+* a <b>scheduler</b> which breaks the compute into kernels.
+* a <b>lowering</b> engine which converts ASTs into code that can run on the accelerator.
+* an <b>execution</b> engine which can run that code.
+
+There is a good [bunch of tutorials](https://mesozoic-egg.github.io/tinygrad-notes/) by Di Zhu that go over tinygrad internals.
+
+## Frontend
+
+Everything in [Tensor](../tensor/index.md) is syntactic sugar around [function.py](function.md), where the forwards and backwards passes are implemented for the different functions. There's about 25 of them, implemented using about 20 basic ops. Those basic ops go on to construct a graph of:
+
+::: tinygrad.lazy.LazyBuffer
+    options:
+        show_source: false
+
+The `LazyBuffer` graph specifies the compute in terms of low level tinygrad ops. Not all LazyBuffers will actually become realized. There's two types of LazyBuffers, base and view. base contains compute into a contiguous buffer, and view is a view (specified by a ShapeTracker). Inputs to a base can be either base or view, inputs to a view can only be a single base.
+
+## Scheduling
+
+The [scheduler](https://github.com/tinygrad/tinygrad/tree/master/tinygrad/engine/schedule.py) converts the graph of LazyBuffers into a list of `ScheduleItem`. One `ScheduleItem` is one kernel on the GPU, and the scheduler is responsible for breaking the large compute graph into subgraphs that can fit in a kernel. `ast` specifies what compute to run, and `bufs` specifies what buffers to run it on.
+
+::: tinygrad.engine.schedule.ScheduleItem
+
+## Lowering
+
+The code in [realize](https://github.com/tinygrad/tinygrad/tree/master/tinygrad/engine/realize.py) lowers `ScheduleItem` to `ExecItem` with
+
+::: tinygrad.engine.realize.lower_schedule
+
+There's a ton of complexity hidden behind this, see the `codegen/` directory.
+
+First we lower the AST to UOps, which is a linear list of the compute to be run. This is where the BEAM search happens.
+
+Then we render the UOps into code with a `Renderer`, then we compile the code to binary with a `Compiler`.
+
+## Execution
+
+Creating `ExecItem`, which has a run method
+
+::: tinygrad.engine.realize.ExecItem
+    options:
+        members: true
+
+Lists of `ExecItem` can be condensed into a single ExecItem with the Graph API (rename to Queue?)
+
+## Runtime
+
+Runtimes are responsible for device-specific interactions. They handle tasks such as initializing devices, allocating memory, loading/launching programs, and more. You can find more information about the runtimes API on the [runtime overview page](runtime.md).
+
+All runtime implementations can be found in the [runtime directory](https://github.com/tinygrad/tinygrad/tree/master/tinygrad/runtime).
+
+### HCQ Compatible Runtimes
+
+HCQ API is a lower-level API for defining runtimes. Interaction with HCQ-compatible devices occurs at a lower level, with commands issued directly to hardware queues. Some examples of such backends are [NV](https://github.com/tinygrad/tinygrad/tree/master/tinygrad/runtime/ops_nv.py) and [AMD](https://github.com/tinygrad/tinygrad/tree/master/tinygrad/runtime/ops_amd.py), which are userspace drivers for NVIDIA and AMD devices respectively. You can find more information about the API on [HCQ overview page](hcq.md)
--- a/docs/developer/function.md
+++ b/docs/developer/function.md
@@ -0,0 +1,33 @@
+::: tinygrad.function
+    options:
+        members: [
+            "Contiguous",
+            "ContiguousBackward",
+            "Cast",
+            "Neg",
+            "Reciprocal",
+            "Sin",
+            "Relu",
+            "Log",
+            "Exp",
+            "Sqrt",
+            "Sigmoid",
+            "Sign",
+            "Less",
+            "Eq",
+            "Xor",
+            "Add",
+            "Sub",
+            "Mul",
+            "Div",
+            "Where",
+            "Sum",
+            "Max",
+            "Expand",
+            "Reshape",
+            "Permute",
+            "Pad",
+            "Shrink",
+            "Flip",
+        ]
+        show_source: false
--- a/docs/developer/hcq.md
+++ b/docs/developer/hcq.md
@@ -0,0 +1,145 @@
+# HCQ Compatible Runtime
+
+## Overview
+
+The main aspect of HCQ-compatible runtimes is how they interact with devices. In HCQ, all interactions with devices occur in a hardware-friendly manner using [command queues](#commandqueues). This approach allows commands to be issued directly to devices, bypassing runtime overhead such as HIP or CUDA. Additionally, by using the HCQ API, these runtimes can benefit from various optimizations and features, including [HCQGraph](#hcqgraph) and built-in profiling capabilities.
+
+### Command Queues
+
+To interact with devices, there are 2 types of queues: `HWComputeQueue` and `HWCopyQueue`. Commands which are defined in a base `HWCommandQueue` class should be supported by both queues. These methods are timestamp and synchronization methods like [signal](#tinygrad.device.HWCommandQueue.signal) and [wait](#tinygrad.device.HWCommandQueue.wait).
+
+For example, the following Python code enqueues a wait, execute, and signal command on the HCQ-compatible device:
+```python
+HWComputeQueue().wait(signal_to_wait, value_to_wait) \
+                .exec(program, kernargs_ptr, global_dims, local_dims) \
+                .signal(signal_to_fire, value_to_fire) \
+                .submit(your_device)
+```
+
+Each runtime should implement the required functions that are defined in the `HWCommandQueue`, `HWComputeQueue`, and `HWCopyQueue` classes.
+
+::: tinygrad.device.HWCommandQueue
+    options:
+        members: [
+            "signal",
+            "wait",
+            "timestamp",
+            "update_signal",
+            "update_wait",
+            "submit",
+        ]
+        show_source: false
+
+::: tinygrad.device.HWComputeQueue
+    options:
+        members: [
+            "memory_barrier",
+            "exec",
+            "update_exec",
+        ]
+        show_source: false
+
+::: tinygrad.device.HWCopyQueue
+    options:
+        members: [
+            "copy",
+            "update_copy",
+        ]
+        show_source: false
+
+#### Implementing custom commands
+
+To implement custom commands in the queue, use the @hcq_command decorator for your command implementations.
+
+::: tinygrad.device.hcq_command
+    options:
+        members: [
+            "copy",
+            "update_copy",
+        ]
+        show_source: false
+
+### HCQ Compatible Device
+
+The `HCQCompiled` class defines the API for HCQ-compatible devices. This class serves as an abstract base class that device-specific implementations should inherit from and implement.
+
+::: tinygrad.device.HCQCompiled
+    options:
+        show_source: false
+
+#### Signals
+
+Signals are device-dependent structures used for synchronization and timing in HCQ-compatible devices. They should be designed to record both a `value` and a `timestamp` within the same signal. HCQ-compatible backend implementations should use `HCQSignal` as a base class.
+
+::: tinygrad.device.HCQSignal
+    options:
+        members: [value, timestamp, wait]
+        show_source: false
+
+The following Python code demonstrates the usage of signals:
+
+```python
+signal = your_device.signal_t()
+
+HWComputeQueue().timestamp(signal) \
+                .signal(signal, value_to_fire) \
+                .submit(your_device)
+
+signal.wait(value_to_fire)
+signaled_value = signal.value # should be the same as `value_to_fire`
+timestamp = signal.timestamp
+```
+
+##### Synchronization signals
+
+Each HCQ-compatible device must allocate two signals for global synchronization purposes. These signals are passed to the `HCQCompiled` base class during initialization: an active timeline signal `self.timeline_signal` and a shadow timeline signal `self._shadow_timeline_signal` which helps to handle signal value overflow issues. You can find more about synchronization in the [synchronization section](#synchronization)
+
+### HCQ Compatible Allocator
+
+The `HCQAllocator` base class simplifies allocator logic by leveraging [command queues](#commandqueues) abstractions. This class efficiently handles copy and transfer operations, leaving only the alloc and free functions to be implemented by individual backends.
+
+::: tinygrad.device.HCQAllocator
+    options:
+        members: [
+            "_alloc",
+            "_free",
+        ]
+        show_source: false
+
+#### HCQ Allocator Result Protocol
+
+Backends must adhere to the `HCQBuffer` protocol when returning allocation results.
+
+::: tinygrad.device.HCQBuffer
+    options:
+        members: true
+        show_source: false
+
+### HCQ Compatible Program
+
+The `HCQProgram` is a helper base class for defining programs compatible with HCQ-compatible devices. Currently, the arguments consist of pointers to buffers, followed by `vals` fields. The convention expects a packed struct containing the passed pointers, followed by `vals` located at `kernargs_args_offset`.
+
+::: tinygrad.device.HCQProgram
+    options:
+        members: true
+        show_source: false
+
+### Synchronization
+
+HCQ-compatible devices use a global timeline signal for synchronizing all operations. This mechanism ensures proper ordering and completion of tasks across the device. By convention, `self.timeline_value` points to the next value to signal. So, to wait for all previous operations on the device to complete, wait for `self.timeline_value - 1` value. The following Python code demonstrates the typical usage of signals to synchronize execution to other operations on the device:
+
+```python
+HWComputeQueue().wait(your_device.timeline_signal, your_device.timeline_value - 1) \
+                .exec(...)
+                .signal(your_device.timeline_signal, your_device.timeline_value) \
+                .submit(your_device)
+your_device.timeline_value += 1
+
+# Optionally wait for execution
+your_device.timeline_signal.wait(your_device.timeline_value - 1)
+```
+
+## HCQGraph
+
+[HCQGraph](https://github.com/tinygrad/tinygrad/tree/master/tinygrad/runtime/graph/hcq.py) is a core feature that implements `GraphRunner` for HCQ-compatible devices. `HCQGraph` builds a static `HWComputeQueue` and `HWCopyQueue` for all operations per device. To optimize enqueue time, only the necessary parts of the queues are updated for each run using the update APIs of the queues, avoiding a complete rebuild.
+Optionally, queues can implement a `bind` API, which allows further optimization by eliminating the need to copy the queues into the device ring.
--- a/docs/developer/runtime.md
+++ b/docs/developer/runtime.md
@@ -0,0 +1,51 @@
+# Runtime Overview
+
+## Overview
+
+A typical runtime consists of the following parts:
+
+- [Compiled](#device)
+- [Allocator](#allocator)
+- [Program](#program)
+- [Compiler](#compiler)
+
+### Compiled
+
+The `Compiled` class is responsible for initializing and managing a device.
+
+::: tinygrad.device.Compiled
+    options:
+        members: [
+            "synchronize"
+        ]
+        show_source: false
+
+### Allocator
+
+The `Allocator` class is responsible for managing memory on the device. There is also a version called the `LRUAllocator`, which caches allocated buffers to optimize performance.
+
+::: tinygrad.device.Allocator
+    options:
+        members: true
+        show_source: false
+
+::: tinygrad.device.LRUAllocator
+    options:
+        members: true
+        show_source: false
+
+### Program
+
+The `Program` class is created for each loaded program. It is responsible for compiling and executing the program on the device. As an example, here is a `ClangProgram` implementation which loads program and runs it.
+
+::: tinygrad.runtime.ops_clang.ClangProgram
+    options:
+        members: true
+
+### Compiler
+
+The `Compiler` class compiles the output from the `Renderer` and produces it in a device-specific format.
+
+::: tinygrad.device.Compiler
+    options:
+        members: true