start hcq docs (#5411)

* start hcq docs

* more hcq docs

* docs

* docs

* linter

* correct args

* linter

* ts returns int
This commit is contained in:
nimlgen
2024-07-15 21:31:11 +03:00
committed by GitHub
parent 9a7d5a148e
commit c9ec7ce070
7 changed files with 371 additions and 32 deletions

View File

@@ -44,3 +44,13 @@ Creating `ExecItem`, which has a run method
members: true
Lists of `ExecItem` can be condensed into a single ExecItem with the Graph API (rename to Queue?)
## Runtime
Runtimes are responsible for device-specific interactions. They handle tasks such as initializing devices, allocating memory, loading/launching programs, and more. You can find more information about the runtimes API on the [runtime overview page](runtime/overview.md).
All runtime implementations can be found in the [runtime directory](https://github.com/tinygrad/tinygrad/tree/master/tinygrad/runtime).
### HCQ Compatible Runtimes
HCQ API is a lower-level API for defining runtimes. Interaction with HCQ-compatible devices occurs at a lower level, with commands issued directly to hardware queues. Some examples of such backends are [NV](https://github.com/tinygrad/tinygrad/tree/master/tinygrad/runtime/ops_nv.py) and [AMD](https://github.com/tinygrad/tinygrad/tree/master/tinygrad/runtime/ops_amd.py), which are userspace drivers for NVIDIA and AMD devices respectively. You can find more information about the API on [HCQ overview page](runtime/hcq.md)

137
docs/runtime/hcq.md Normal file
View File

@@ -0,0 +1,137 @@
# HCQ Compatible Runtime
## Overview
The main aspect of HCQ-compatible runtimes is how they interact with devices. In HCQ, all interactions with devices occur in a hardware-friendly manner using [command queues](#commandqueues). This approach allows commands to be issued directly to devices, bypassing runtime overhead such as HIP, or CUDA. Additionally, by using the HCQ API, these runtimes can benefit from various optimizations and features, including [HCQGraph](#hcqgraph) and built-in profiling capabilities.
### Command Queues
To interact with devices, there are 2 types of queues: `HWComputeQueue` and `HWCopyQueue`. Commands which are defined in a base `HWCommandQueue` class should be supported by both queues. These methods are timestamp and synchronization methods like [signal](#tinygrad.device.HWCommandQueue.signal) and [wait](#tinygrad.device.HWCommandQueue.wait).
For example, the following Python code enqueues a wait, execute, and signal command on the HCQ-compatible device:
```python
HWComputeQueue().wait(signal_to_wait, value_to_wait) \
.exec(program, kernargs_ptr, global_dims, local_dims) \
.signal(signal_to_fire, value_to_fire) \
.submit(your_device)
```
Each runtime should implement the required functions that are defined in the `HWCommandQueue`, `HWComputeQueue`, and `HWCopyQueue` classes.
::: tinygrad.device.HWCommandQueue
options:
members: [
"signal",
"wait",
"timestamp",
"update_signal",
"update_wait",
"submit",
]
show_source: false
::: tinygrad.device.HWComputeQueue
options:
members: [
"memory_barrier",
"exec",
"update_exec",
]
show_source: false
::: tinygrad.device.HWCopyQueue
options:
members: [
"copy",
"update_copy",
]
show_source: false
#### Implementing custom commands
To implement custom commands in the queue, use the @hcq_command decorator for your command implementations.
::: tinygrad.device.hcq_command
options:
members: [
"copy",
"update_copy",
]
show_source: false
### HCQ Compatible Device
The `HCQCompatCompiled` class defines the API for HCQ-compatible devices. This class serves as an abstract base class that device-specific implementations should inherit from and implement.
::: tinygrad.device.HCQCompatCompiled
options:
members: [
"_alloc_signal",
"_free_signal",
"_read_signal",
"_read_timestamp",
"_set_signal",
"_wait_signal",
"_gpu2cpu_time",
]
show_source: false
#### Signals
Signals are device-dependent structures used for synchronization and timing in HCQ-compatible devices. They should be designed to record both a `value` and a `timestamp` within the same signal. The following Python code demonstrates the usage of signals:
```python
signal = your_device._alloc_signal()
HWComputeQueue().timestamp(signal) \
.signal(signal, value_to_fire) \
.submit(your_device)
your_device._wait_signal(signal, value_to_fire)
timestamp = your_device._read_timestamp()
```
##### Synchronization signals
Each HCQ-compatible device must allocate two signals for global synchronization purposes. These signals are passed to the `HCQCompatCompiled` base class during initialization: an active timeline signal `self.timeline_signal` and a shadow timeline signal `self._shadow_timeline_signal` which helps to handle signal value overflow issues. You can find more about synchronization in the [synchronization section](#synchronization)
### HCQ Compatible Allocator
The `HCQCompatAllocator` base class simplifies allocator logic by leveraging [command queues](#commandqueues) abstractions. This class efficiently handles copy and transfer operations, leaving only the alloc and free functions to be implemented by individual backends.
::: tinygrad.device.HCQCompatAllocator
options:
members: [
"_alloc",
"_free",
]
show_source: false
#### HCQ Allocator Result Protocol
Backends must adhere to the `HCQCompatAllocRes` protocol when returning allocation results.
::: tinygrad.device.HCQCompatAllocRes
options:
members: true
show_source: false
### Synchronization
HCQ-compatible devices use a global timeline signal for synchronizing all operations. This mechanism ensures proper ordering and completion of tasks across the device. By convention, `self.timeline_value` points to the next value to signal. So, to wait for all previous operations on the device to complete, wait for `self.timeline_value - 1` value. The following Python code demonstrates the typical usage of signals to synchronize execution to other operations on the device:
```python
HWComputeQueue().wait(your_device.timeline_signal, your_device.timeline_value - 1) \
.exec(...)
.signal(your_device.timeline_signal, your_device.timeline_value) \
.submit(your_device)
your_device.timeline_value += 1
# Optionally wait for execution
your_device._wait_signal(your_device.timeline_signal, your_device.timeline_value - 1)
```
## HCQGraph
[HCQGraph](https://github.com/tinygrad/tinygrad/tree/master/tinygrad/runtime/graph/hcq.py) is a core feature that implements `GraphRunner` for HCQ-compatible devices. `HCQGraph` builds a static `HWComputeQueue` and `HWCopyQueue` for all operations per device. To optimize enqueue time, only the necessary parts of the queues are updated for each run using the update APIs of the queues, avoiding a complete rebuild.
Optionally, queues can implement a `bind` API, which allows further optimization by eliminating the need to copy the queues into the device ring.

51
docs/runtime/overview.md Normal file
View File

@@ -0,0 +1,51 @@
# Runtime Overview
## Overview
A typical runtime consists of the following parts:
- [Compiled](#device)
- [Allocator](#allocator)
- [Program](#program)
- [Compiler](#compiler)
### Compiled
The `Compiled` class is responsible for initializing and managing a device.
::: tinygrad.device.Compiled
options:
members: [
"synchronize"
]
show_source: false
### Allocator
The `Allocator` class is responsible for managing memory on the device. There is also a version called the `LRUAllocator`, which caches allocated buffers to optimize performance.
::: tinygrad.device.Allocator
options:
members: true
show_source: false
::: tinygrad.device.LRUAllocator
options:
members: true
show_source: false
### Program
The `Program` class is created for each loaded program. It is responsible for compiling and executing the program on the device. As an exmaple, here is a `ClangProgram` implmentation which loads program and runs it.
::: tinygrad.runtime.ops_clang.ClangProgram
options:
members: true
### Compiler
The `Compiler` class compiles the output from the `Renderer` and produces it in a device-specific format.
::: tinygrad.device.Compiler
options:
members: true

View File

@@ -18,7 +18,11 @@ nav:
- dtypes: dtypes.md
- nn (Neural Networks): nn.md
- Environment Variables: env_vars.md
- Developer: developer.md
- Developer:
- developer.md
- Runtime:
- runtime/overview.md
- HCQ: runtime/hcq.md
- tinybox: tinybox.md
#- tinygrad: reference/

View File

@@ -140,6 +140,10 @@ class Allocator:
def copyout(self, dest:memoryview, src): raise NotImplementedError("need copyout")
class LRUAllocator(Allocator): # pylint: disable=abstract-method
"""
The LRU Allocator is responsible for caching buffers.
It ensures that buffers are not freed until it is absolutely necessary, optimizing performance.
"""
def __init__(self): self.cache: Dict[Tuple[int, Optional[BufferOptions]], Any] = defaultdict(list)
def alloc(self, size:int, options:Optional[BufferOptions]=None):
if len(c := self.cache[(size, options)]): return c.pop()
@@ -182,19 +186,25 @@ class Compiled:
def __init__(self, device:str, allocator:Allocator, renderer:Optional[Renderer], compiler:Optional[Compiler], runtime, graph=None):
self.dname, self.allocator, self.compiler, self.runtime, self.graph = device, allocator, compiler or Compiler(), runtime, graph
self.renderer = renderer or Renderer()
def synchronize(self): pass # override this in your device
def synchronize(self):
"""
Synchronize all pending operations on the device.
This method ensures that all previously queued operations on the device have been completed before proceeding.
"""
# override this in your device implementation
# **************** for HCQ Compatible Devices ****************
def hcq_command(func):
"""
Decorator for HWCommandQueue commands.
Decorator for HWCommandQueue commands. Enables command indexing and stores metadata for command updates.
Enables command indexing and stores metadata for command updates.
Usage:
@hcq_command
def command_method(self, ...): ...
For example:
```python
@hcq_command
def command_method(self, ...): ...
```
"""
def __wrapper(self, *args, **kwargs):
self.cmds_offset.append(len(self.q))
@@ -205,49 +215,121 @@ def hcq_command(func):
return __wrapper
class HWCommandQueue:
"""
A base class for hardware command queues in the HCQ (Hardware Command Queue) API.
Both compute and copy queues should have the following commands implemented.
"""
def __init__(self): self.q, self.binded_device, self.cmds_offset, self.cmds_len, self.cmds_meta = [], None, [], [], []
def __len__(self): return len(self.cmds_offset)
def _patch(self, cmd_idx, offset, data): self.q[(st:=self.cmds_offset[cmd_idx]+offset):st+len(data)] = array.array('I', data)
@hcq_command
def signal(self, signal, value): self._signal(signal, value)
def signal(self, signal, value):
"""
Enqueues a signal command which sets the signal to the given value, ensuring all previous operations are completed.
Args:
signal: The signal to set
value: The value to set the signal to
"""
self._signal(signal, value)
def _signal(self, signal, value): raise NotImplementedError("backend should overload this function")
@hcq_command
def wait(self, signal, value): self._wait(signal, value)
def wait(self, signal, value):
"""
Enqueues a wait command which halts execution until the signal is greater than or equal to a specific value.
Args:
signal: The signal to wait on
value: The value to wait for
"""
self._wait(signal, value)
def _wait(self, signal, value): raise NotImplementedError("backend should overload this function")
@hcq_command
def timestamp(self, signal): self._timestamp(signal)
def timestamp(self, signal):
"""
Enqueues a timestamp command which records the current time in a signal after all previously enqueued commands are completed.
Args:
signal: The signal to store the timestamp
"""
self._timestamp(signal)
def _timestamp(self, signal): raise NotImplementedError("backend should overload this function")
def update_signal(self, cmd_idx, signal=None, value=None):
"""
Updates a previously queued signal command.
Args:
cmd_idx: Index of the signal command to update
signal: New signal to set (if None, keeps the original)
value: New value to set (if None, keeps the original)
"""
if self.cmds_meta[cmd_idx] != "signal": raise RuntimeError("called update_signal not on a signal command")
self._update_signal(cmd_idx, signal, value)
return self
def _update_signal(self, cmd_idx, signal, value): raise NotImplementedError("backend should overload this function")
def update_wait(self, cmd_idx, signal=None, value=None):
"""
Updates a previously queued wait command.
Args:
cmd_idx: Index of the wait command to update
signal: New signal to wait on (if None, keeps the original)
value: New value to wait for (if None, keeps the original)
"""
if self.cmds_meta[cmd_idx] != "wait": raise RuntimeError("called update_wait not on a wait command")
self._update_wait(cmd_idx, signal, value)
return self
def _update_wait(self, cmd_idx, signal, value): raise NotImplementedError("backend should overload this function")
def submit(self, device:HCQCompatCompiled):
"""
Submits the command queue to a specific device for execution.
Args:
device: The device to submit the queue to
"""
self._submit(device)
return self
def _submit(self, device:HCQCompatCompiled): raise NotImplementedError("backend should overload this function")
class HWComputeQueue(HWCommandQueue):
@hcq_command
def memory_barrier(self): self._memory_barrier()
def memory_barrier(self):
"""
Enqueues a memory barrier command to ensure memory coherence between agents.
"""
self._memory_barrier()
def _memory_barrier(self): pass
@hcq_command
def exec(self, prg, kernargs, global_size, local_size): self._exec(prg, kernargs, global_size, local_size)
def exec(self, prg, kernargs, global_size, local_size):
"""
Enqueues an execution command for a kernel program.
Args:
prg: The program to execute
kernargs: The pointer to kernel arguments
global_size: The global work size
local_size: The local work size
"""
self._exec(prg, kernargs, global_size, local_size)
def _exec(self, prg, kernargs, global_size, local_size): raise NotImplementedError("backend should overload this function")
def update_exec(self, cmd_idx, global_size, local_size):
"""
Updates a previously queued execution command.
Args:
cmd_idx: Index of the execution command to update
global_size: New global work size
local_size: New local work size
"""
if self.cmds_meta[cmd_idx] != "exec": raise RuntimeError("called update_exec not on an exec command")
self._update_exec(cmd_idx, global_size, local_size)
return self
@@ -255,10 +337,27 @@ class HWComputeQueue(HWCommandQueue):
class HWCopyQueue(HWCommandQueue):
@hcq_command
def copy(self, dest, src, copy_size): self._copy(dest, src, copy_size)
def copy(self, dest, src, copy_size):
"""
Enqueues a copy command to transfer data.
Args:
dest: The destination of the copy
src: The source of the copy
copy_size: The size of data to copy
"""
self._copy(dest, src, copy_size)
def _copy(self, dest, src, copy_size): raise NotImplementedError("backend should overload this function")
def update_copy(self, cmd_idx, dest=None, src=None):
"""
Updates a previously queued copy command.
Args:
cmd_idx: Index of the copy command to update
dest: New destination of the copy (if None, keeps the original)
src: New source of the copy (if None, keeps the original)
"""
if self.cmds_meta[cmd_idx] != "copy": raise RuntimeError("called update_copy not on an copy command")
self._update_copy(cmd_idx, dest, src)
return self
@@ -284,36 +383,68 @@ class HCQCompatProgram:
def fill_kernargs(self, kernargs_ptr:int, bufs:Tuple[Any, ...], vals:Tuple[int, ...]=()): raise NotImplementedError("need fill_kernargs")
class HCQCompatCompiled(Compiled):
"""
A base class for devices compatible with the HCQ (Hardware Command Queue) API.
"""
def __init__(self, device:str, allocator:Allocator, renderer:Renderer, compiler:Compiler, runtime, comp_queue_t, copy_queue_t, timeline_signals):
self.hw_compute_queue_t, self.hw_copy_queue_t = comp_queue_t, copy_queue_t
self.timeline_value: int = 1
self.timeline_value:int = 1
self.timeline_signal, self._shadow_timeline_signal = timeline_signals
self.sig_prof_records: List[Tuple[Any, Any, str, bool]] = []
self.raw_prof_records: List[Tuple[int, int, str, bool]] = []
self.sig_prof_records:List[Tuple[Any, Any, str, bool]] = []
self.raw_prof_records:List[Tuple[int, int, str, bool]] = []
if PROFILE: self._prof_setup()
from tinygrad.runtime.graph.hcq import HCQGraph
super().__init__(device, allocator, renderer, compiler, runtime, HCQGraph)
@classmethod
def _read_signal(self, sig): raise NotImplementedError("need _read_signal") # reads a value for a signal
def _read_signal(cls, signal:Any) -> Any:
"""
Read a value for a signal.
"""
raise NotImplementedError("_read_signal needs to be implemented")
@classmethod
def _read_timestamp(self, sig): raise NotImplementedError("need _read_timestamp") # reads a timestamp for a signal
def _read_timestamp(cls, signal:Any) -> int:
"""
Read a timestamp for a signal.
"""
raise NotImplementedError("_read_timestamp needs to be implemented")
@classmethod
def _set_signal(self, sig, value): raise NotImplementedError("need _set_signal") # sets a value for a signal
def _set_signal(cls, signal:Any, value:Any) -> None:
"""
Set a value for a signal.
"""
raise NotImplementedError("_set_signal needs to be implemented")
@classmethod
def _alloc_signal(self, value=0, **kwargs): raise NotImplementedError("need _alloc_signal") # allocates a new signal
def _alloc_signal(cls, value:Any = 0, **kwargs) -> Any:
"""
Allocate a new signal.
"""
raise NotImplementedError("_alloc_signal needs to be implemented")
@classmethod
def _free_signal(self, sig): raise NotImplementedError("need _free_signal") # frees a signal
def _free_signal(cls, signal:Any) -> None:
"""
Free a signal.
"""
raise NotImplementedError("_free_signal needs to be implemented")
@classmethod
def _wait_signal(self, signal, value=0, timeout=10000): raise NotImplementedError("need _wait_signal") # waits for a signal value
def _wait_signal(cls, signal:Any, value:Any = 0, timeout:int = 10000) -> None:
"""
Wait for a signal to reach a specific value. Signals
"""
raise NotImplementedError("_wait_signal needs to be implemented")
def _gpu2cpu_time(self, gpu_time, is_copy): raise NotImplementedError("need _gpu2cpu_time")
def _gpu2cpu_time(self, gpu_time:float, is_copy:bool) -> float:
"""
Convert GPU time to CPU time. `is_copy` flag indicating if this is a copy queue.
"""
raise NotImplementedError("_gpu2cpu_time needs to be implemented")
def _prof_setup(self):
if not hasattr(self, 'profile_logger'): atexit.register(self._prof_finalize)
@@ -347,6 +478,12 @@ class HCQCompatCompiled(Compiled):
class HCQCompatAllocRes(Protocol): va_addr: int; size: int # noqa: E702
class HCQCompatAllocator(LRUAllocator): # pylint: disable=abstract-method
"""
A base allocator class compatible with the HCQ (Hardware Command Queue) API.
This class implements basic copy operations following the HCQ API, utilizing both `HWComputeQueue` and `HWCopyQueue`.
"""
def __init__(self, device, batch_size=(2 << 20), batch_cnt=32):
self.device = device
self.b = [self._alloc(batch_size, BufferOptions(host=True)) for _ in range(batch_cnt)]

View File

@@ -410,13 +410,13 @@ class AMDDevice(HCQCompatCompiled):
kio.free_memory_of_gpu(self.kfd, handle=mem.handle)
@classmethod
def _read_signal(self, sig): return sig.value
def _read_signal(self, signal): return signal.value
@classmethod
def _read_timestamp(self, sig): return sig.start_ts
def _read_timestamp(self, signal): return signal.start_ts
@classmethod
def _set_signal(self, sig, value): sig.value = value
def _set_signal(self, signal, value): signal.value = value
@classmethod
def _alloc_signal(self, value=0, **kwargs) -> hsa.amd_signal_t:
@@ -428,7 +428,7 @@ class AMDDevice(HCQCompatCompiled):
return ret
@classmethod
def _free_signal(self, sig): self.signals_pool.append(sig)
def _free_signal(self, signal): self.signals_pool.append(signal)
@classmethod
def _wait_signal(self, signal:hsa.amd_signal_t, value=0, timeout=10000):

View File

@@ -552,13 +552,13 @@ class NVDevice(HCQCompatCompiled):
NVDevice.devices.append(self)
@classmethod
def _read_signal(self, sig): return sig[0]
def _read_signal(self, signal): return signal[0]
@classmethod
def _read_timestamp(self, sig): return sig[1]
def _read_timestamp(self, signal): return signal[1]
@classmethod
def _set_signal(self, sig, value): sig[0] = value
def _set_signal(self, signal, value): signal[0] = value
@classmethod
def _alloc_signal(self, value=0, **kwargs) -> memoryview:
@@ -566,7 +566,7 @@ class NVDevice(HCQCompatCompiled):
return sig
@classmethod
def _free_signal(self, sig): self.signals_pool.append(sig)
def _free_signal(self, signal): self.signals_pool.append(signal)
@classmethod
def _wait_signal(self, signal, value=0, timeout=10000):