`AMDComputeQueue.__del__` frees `hw_page` which is safe because
`AMDAllocator._free` does `self.dev.synchronize()` which is supposed
to wait for execution of IB to finish, however that doesn't happen if
AMDComputeQueue is dropped right after submit before timeline signal is
incremented, which it is in most places leading to a race if .bind() is
also used (required for multi-xcc because bug in mec fw treats all
PACKET3_PRED_EXECs outside IBs as if they had EXEC_COUNT of zero).
* add some docs about speed [pr]
* better torch gemm
* enable locals on llvm/clang
* disable locals for beam speed on LLVM/CLANG
* 0x20 alignment in llvm allows ymm use
* start uop docs
* only need show_labels
* sink comes first
* hotfix: invalid
* touchups
* 2 space indent works
* limit some buffer uops
* better BARRIER doc, Op -> UOp when it makes sense.
* make KernelInfo optional
* more work
relative links don't work
* this can be local in multi reduce+pads
* add UOps.SHAPETRACKER details
* UOps.CONST both types
* nit: local buffer isn't device Buffer, habit
* nit2: dtype -> DType