* poc
* repeated values fail, sigh
* is this being timed out?
* fix up down names
* bitonic v2, does this run?
* bitonic v3, faster
* bitonic v3.1, faster
* bitonic v3.1.1, same speed unlucky
* support dim and indices
* bitonic v3.2, simpler code, TODO repeated indices
* bruv gimme green for once cmon
* cat (stack) implementation, slow but maybe one day when cat is fast meow
* revert to v3.2
* bitonic v4, who let the cats out edition
* clean up variable names
* figured out repeated indices :D
* ruff check --fix
* use sort for topk
* add Tensor.sort everywhere
* fix docs and add some types
* slightly better variable names
* am I doing torch inplace correctly?
* delegate sort to values_stable
* add a contig, faster first sort
* maybe don't test_inplace
---------
Co-authored-by: chenyu <chenyu@fastmail.com>
* terrible but somewhat working impl
* linux behaves differently than macos?
* slightly better impl
* small clean up; haven't figured this out yet
* better
* torch has different behavior on linux and macos for duplicated values
* add sum docs
* fix test
* add torch return_type test
* add an exception test
* wrap_fxn instead, and move op lower in order
* better repeated values test
* rerun ci
* add `Tensor.isclose()`
* support `equal_nan`
so as to match PyTorch's behavior
* update unit tests
* remove some tests temporarily
* re-enable one test
* re-enable other test
* try to fix failing tests during CI
* save one line of code
---------
Co-authored-by: George Hotz <72895+geohot@users.noreply.github.com>
* pytorch scatter -> scatter_reduce
* WIP scatter_reduce implementation
* _pre_scatter return type hint
* split out src, mask to satisfy linter
* Add src cast back in
* dict of lambdas instead of ifs
* sum and prod reduction ops with include_self
* add reduce arg error message
* add amax and amin reduction ops
* Fix include_self for higher dims
* Simplify
* Simplify amax and amin too
* Pull include_self logic out into _inv_mask function
* reduce arg cannot be None for scatter_reduce
* Fix self-mask issue
* Add mean reduce op
* Add tests
* any() not needed here
* remove comment
* End support for Tensor src with reduce arg in tinygrad scatter
* Process index, dim inside actual functions
* Add scatter_reduce to onnx
* Add excluded onnx ScatterElements reduction tests back in
* Save 2 lines on the mask helpers
* Update docs
* Add include_self=False tests
* cleanup
* Remove unneeded helper function
---------
Co-authored-by: chenyu <chenyu@fastmail.com>
* add some docs about speed [pr]
* better torch gemm
* enable locals on llvm/clang
* disable locals for beam speed on LLVM/CLANG
* 0x20 alignment in llvm allows ymm use
* Switch to dawn, all tests passing locally
* Use dawn-python
* Skip failing test
* Skip midcast and fix timestamp on metal ci
* Autogen webgpu
* Try fetch dawn lib again
* /usr/lib
* Without lib prefix
* Test autogen diff
* Delete webgpu support, move everything to ops_webgpu
* mypy fix
* Simplify, refactor
* Line savings
* No ResultContainer
* Type annotation for result
* Some more simplifications
* Why was this explicit sync used at all?
* Refactor: delete functions that are only used once
* Create shader module inline
* Clear unit tests cache, maybe that solves it
* That wasn't it
* Try deleting cache to pass failing weight compare
* weights_only=False for pytorch 2.6
* Simplify ctype array creation
* Remove nanosecond precision timestamps
* Simplify error handling
* Refactor, add back type annotations
* Deleted custom submit function, refactor
* read_buffer simplify
* Fix use after free, refactor
* Simplify supported_features
* Runtime docs
---------
Co-authored-by: George Hotz <72895+geohot@users.noreply.github.com>
* LLVM JIT
* Autogen LLVM
* Update autogen
* Move things around
* even more non-determinism
* windows
* more autogen weirdness
* more windows stuff
* blind windows development try 2
* more blind windows development
* even more blind windows development
* maybe i should just set up a windows vm...
* why can't everyone just use sysv abi?
* cleanup debugging stuff
* unused import
* icache flushing isn't required on x86
* merge jit_nt and jit_unix
* more
* Temporary hack to not segfault
* better error
* bad conflict resolution
* Attempt to simplify support/llvm.py
* More refactoring
---------
Co-authored-by: George Hotz <72895+geohot@users.noreply.github.com>
it's a python style mod. possibily can be cleaner with a floor div
relaxed the vmin for MOD slightly for cstyle negatives mod, it's more correct and might fix other bugs
* remove uop mutability [pr]
* test fixups
* most tests pass
* more tests pass
* lil test fixups
* them too
* fix test
* unneeded
* err, that
* fix test_hcq
* fix test failures
* fix that test
* tensor universe
* does this pass test
* Revert "does this pass test"
This reverts commit ed516b3169.
* Revert "tensor universe"
This reverts commit c21301852a.
* proper spidering for uops
* cleanups
* all tensors
* all tensors
* slow but correct
* fast
* no WeakSet
* faster
* no need for list
* revert that
* working I think
* where are my onnx scatter tests??
* forward_only for now
* try if nan hack fix NV
* looks like issue is different... CUDA WHY
* oops that was wrong. Try if this fixes CUDA
* simpler multiply
* actually finish this up tmrw morning :x
* fix tests?
* improve tests
* improve test and implementation
* fix ruff
* complete but lots of expected failure...
* reviewed tests
* add onnx tests
* is this a processing op?
* add return type to indicate that it's not in-place
* final cleanups
* use or and improve tests a little
* add masked_index_select
* call it masked_setitem instead
* try
* FIXED
---------
Co-authored-by: chenyu <chenyu@fastmail.com>
* implement inverse trig functions
* guess we should still test nans?
* magnitude as variable name :D
* reorder onnx_ops ops
* approximation -> x for consistency
* address feedback
* simpler acos
* improvement?
* actually just have asin depend on atan
* actually this is nicer
* remove a comment
---------
Co-authored-by: chenyu <chenyu@fastmail.com>
* initial implementation and test
* some other places that can use meshgrid
* revert the onnx_ops change
* add to docs
* revert interpolate too
* update
* improve edge case test
* might as well test grad
* add to test can improve docs
---------
Co-authored-by: chenyu <chenyu@fastmail.com>