* WebGPU on Windows
* Fix dawn-python install
* New test
* pydeps
* Minor fix
* Only install dawn-python on windows webgpu
---------
Co-authored-by: George Hotz <72895+geohot@users.noreply.github.com>
* opt transforms the ast into an optimized ast
* fix get_kernel order and to_function_name
* function_name property
* update docs
* copy from kernel.py
* improve docs
* ci didn't trigger?
* add kernelize
* remove that
* kernelize returns self
* update abstractions2.py
* kernelize in test_schedule
* temp: assert BUFFER_VIEW's existence
* ASSIGN must have a buffer or subbuffer target
* assert and shrink
* fix
* padded setitem
* var
* toposort once
* extra
* base_buffer
* end with BUFFER_VIEW
* setitem for disk
* test_setitem_becomes_subbuffer
* mul slice test
* torch backend fix 1
* non-deterministic
* keep subbuffer
`AMDComputeQueue.__del__` frees `hw_page` which is safe because
`AMDAllocator._free` does `self.dev.synchronize()` which is supposed
to wait for execution of IB to finish, however that doesn't happen if
AMDComputeQueue is dropped right after submit before timeline signal is
incremented, which it is in most places leading to a race if .bind() is
also used (required for multi-xcc because bug in mec fw treats all
PACKET3_PRED_EXECs outside IBs as if they had EXEC_COUNT of zero).
* why does max_unpool2d feel slower than out.gradient ...
* slightly cleaner
* what happened to ruff
* need to think about this some more
* slightly faster now?
* clean up, 1 more failing edge case
* ok good
* working TINY_BACKEND
* nit doc wording
* retry CI
* poc
* repeated values fail, sigh
* is this being timed out?
* fix up down names
* bitonic v2, does this run?
* bitonic v3, faster
* bitonic v3.1, faster
* bitonic v3.1.1, same speed unlucky
* support dim and indices
* bitonic v3.2, simpler code, TODO repeated indices
* bruv gimme green for once cmon
* cat (stack) implementation, slow but maybe one day when cat is fast meow
* revert to v3.2
* bitonic v4, who let the cats out edition
* clean up variable names
* figured out repeated indices :D
* ruff check --fix
* use sort for topk
* add Tensor.sort everywhere
* fix docs and add some types
* slightly better variable names
* am I doing torch inplace correctly?
* delegate sort to values_stable
* add a contig, faster first sort
* maybe don't test_inplace
---------
Co-authored-by: chenyu <chenyu@fastmail.com>
* terrible but somewhat working impl
* linux behaves differently than macos?
* slightly better impl
* small clean up; haven't figured this out yet
* better
* torch has different behavior on linux and macos for duplicated values
* add sum docs
* fix test
* add torch return_type test
* add an exception test
* wrap_fxn instead, and move op lower in order
* better repeated values test
* rerun ci