diff --git a/docs/developer/kernelize.md b/docs/developer/kernelize.md deleted file mode 100644 index b38db222b4..0000000000 --- a/docs/developer/kernelize.md +++ /dev/null @@ -1,109 +0,0 @@ -# Kernel Creation - -Tinygrad lazily builds up a graph of Tensor operations. The Tensor graph includes a mix of: - -- Buffer and Assignment Ops: `BUFFER`, `BUFFER_VIEW`, `COPY`, `ASSIGN` -- Movement Ops: `RESHAPE`, `EXPAND`, `PERMUTE`, `PAD`, `SHRINK`, `FLIP` -- Compute Ops: `ADD`, `MUL`, `REDUCE_AXIS`, ... - -`Tensor.kernelize` creates the kernels and buffers needed to realize the output Tensor(s). - -## Kernelize flow - -Let's see how a multiply add Tensor graph becomes a fused elementwise kernel. - -```py -# initialize 3 input buffers on the device -a = Tensor([1]).realize() -b = Tensor([2]).realize() -c = Tensor([3]).realize() - -# create the Tensor graph -mul = a*b -out = mul+c - -print(mul) # , None)> on METAL with grad None> -print(out) # , None)> on METAL with grad None> - -out.kernelize() - -print(mul) # , None)> on METAL with grad None> -print(out) # , None)> on METAL with grad None> -``` - -The multiply Tensor stays the same because it is fused. The output Tensor's UOp becomes a new ASSIGN UOp: - -```py -print(out.uop) -``` - -The first source is the output BUFFER: - -``` -UOp(Ops.BUFFER, dtypes.int, arg=1, src=( - UOp(Ops.DEVICE, dtypes.void, arg='METAL', src=()), - UOp(Ops.UNIQUE, dtypes.void, arg=6, src=()),)) -``` - -And the second source is the KERNEL and its 4 buffer edges (output_buffer, a, b, c): - -``` -UOp(Ops.KERNEL, dtypes.void, arg=,) (__add__, __mul__)>, src=( - UOp(Ops.BUFFER, dtypes.int, arg=1, src=( - x1:=UOp(Ops.DEVICE, dtypes.void, arg='METAL', src=()), - UOp(Ops.UNIQUE, dtypes.void, arg=6, src=()),)), - UOp(Ops.BUFFER, dtypes.int, arg=1, src=( - x1, - UOp(Ops.UNIQUE, dtypes.void, arg=1, src=()),)), - UOp(Ops.BUFFER, dtypes.int, arg=1, src=( - x1, - UOp(Ops.UNIQUE, dtypes.void, arg=3, src=()),)), - UOp(Ops.BUFFER, dtypes.int, arg=1, src=( - x1, - UOp(Ops.UNIQUE, dtypes.void, arg=5, src=()),)),)) -``` - -KERNEL describes the compute AST, metadata and memory dependencies. - -BUFFER holds a reference to the device memory where the output will be stored. - -Once a Tensor is kernelized, all children will LOAD its BUFFER, instead of fusing it: - -```py -child = out+2 -child.kernelize() -print(child.uop.src[1].arg.ast) -``` - -``` -UOp(Ops.SINK, dtypes.void, arg=None, src=( - UOp(Ops.STORE, dtypes.void, arg=None, src=( - UOp(Ops.DEFINE_GLOBAL, dtypes.int.ptr(1), arg=0, src=()), - x2:=UOp(Ops.VIEW, dtypes.void, arg=ShapeTracker(views=(View(shape=(1,), strides=(0,), offset=0, mask=None, contiguous=True),)), src=()), - UOp(Ops.ADD, dtypes.int, arg=None, src=( - UOp(Ops.LOAD, dtypes.int, arg=None, src=( - UOp(Ops.DEFINE_GLOBAL, dtypes.int.ptr(1), arg=1, src=()), - x2,)), - UOp(Ops.CONST, dtypes.int, arg=2, src=( - x2,)),)),)),)) -``` - -`Tensor.realize` will execute the kernels and write outputs to memory: - -```py -Tensor.realize(out) -print(out) # , )> on METAL with grad None> -print(out.item()) # 5 -``` - -
- -**Summary** - -- The large Tensor graph is built from a mix of data, compute and movement Ops. - -- `Tensor.kernelize` splits the Tensor graph into data (BUFFER), compute (KERNEL) and links dependencies with ASSIGN. - -- `Tensor.realize` executes KERNELs on device and replaces the Tensor graph with just a BUFFER. - -- Kernelize can be called multiple times on a Tensor. This allows for incrementally building the kernel fusion layout of a large Tensor graph, without having to call `realize` or `schedule`. diff --git a/mkdocs.yml b/mkdocs.yml index 1dd77b6867..ed3ee04251 100644 --- a/mkdocs.yml +++ b/mkdocs.yml @@ -25,8 +25,6 @@ nav: - Layout: developer/layout.md - Speed: developer/speed.md - UOp: developer/uop.md - - Grouper: - - developer/kernelize.md - Runtime: - developer/runtime.md - HCQ: developer/hcq.md