* start work on target
* add test
* update actions to use DEV
* update docs
* update readmes
* tests need that too
* update example
* update tests (comments)
* fix that test
* ruff
* mypy
* oops
* remove getenvs
* don't add Target yet
* and the test
* lint
* and docs
* more stuff
* assert
* few more fixes
* test assert
* bump to dagre 2.0.0
* transform to call
* cleanup names
* get kernel graph
* dagre recursion fix + better error
* add toggle to hide sink nodes
* no sink by default
* revert that
* only hide final sinks
* lol
* remove ExecItem and merge it with ScheduleItem
* less diff
* fix issues
* min diff
* don't change bufs in _lower
* min diff
* update
* revert
* fixes
* diff
* opt transforms the ast into an optimized ast
* fix get_kernel order and to_function_name
* function_name property
* update docs
* copy from kernel.py
* improve docs
* ci didn't trigger?
`AMDComputeQueue.__del__` frees `hw_page` which is safe because
`AMDAllocator._free` does `self.dev.synchronize()` which is supposed
to wait for execution of IB to finish, however that doesn't happen if
AMDComputeQueue is dropped right after submit before timeline signal is
incremented, which it is in most places leading to a race if .bind() is
also used (required for multi-xcc because bug in mec fw treats all
PACKET3_PRED_EXECs outside IBs as if they had EXEC_COUNT of zero).
* add some docs about speed [pr]
* better torch gemm
* enable locals on llvm/clang
* disable locals for beam speed on LLVM/CLANG
* 0x20 alignment in llvm allows ymm use
* start uop docs
* only need show_labels
* sink comes first
* hotfix: invalid
* touchups
* 2 space indent works
* limit some buffer uops
* better BARRIER doc, Op -> UOp when it makes sense.
* make KernelInfo optional
* more work
relative links don't work
* this can be local in multi reduce+pads
* add UOps.SHAPETRACKER details
* UOps.CONST both types
* nit: local buffer isn't device Buffer, habit
* nit2: dtype -> DType