* Make logcumsumexp numerically stable
* Refactor
* Refactor for special case ndim=0
* Refactor
* Use the correct device for mask
---------
Co-authored-by: chenyu <chenyu@fastmail.com>
* pytorch scatter -> scatter_reduce
* WIP scatter_reduce implementation
* _pre_scatter return type hint
* split out src, mask to satisfy linter
* Add src cast back in
* dict of lambdas instead of ifs
* sum and prod reduction ops with include_self
* add reduce arg error message
* add amax and amin reduction ops
* Fix include_self for higher dims
* Simplify
* Simplify amax and amin too
* Pull include_self logic out into _inv_mask function
* reduce arg cannot be None for scatter_reduce
* Fix self-mask issue
* Add mean reduce op
* Add tests
* any() not needed here
* remove comment
* End support for Tensor src with reduce arg in tinygrad scatter
* Process index, dim inside actual functions
* Add scatter_reduce to onnx
* Add excluded onnx ScatterElements reduction tests back in
* Save 2 lines on the mask helpers
* Update docs
* Add include_self=False tests
* cleanup
* Remove unneeded helper function
---------
Co-authored-by: chenyu <chenyu@fastmail.com>
* Switch to dawn, all tests passing locally
* Use dawn-python
* Skip failing test
* Skip midcast and fix timestamp on metal ci
* Autogen webgpu
* Try fetch dawn lib again
* /usr/lib
* Without lib prefix
* Test autogen diff
* Delete webgpu support, move everything to ops_webgpu
* mypy fix
* Simplify, refactor
* Line savings
* No ResultContainer
* Type annotation for result
* Some more simplifications
* Why was this explicit sync used at all?
* Refactor: delete functions that are only used once
* Create shader module inline
* Clear unit tests cache, maybe that solves it
* That wasn't it
* Try deleting cache to pass failing weight compare
* weights_only=False for pytorch 2.6
* Simplify ctype array creation
* Remove nanosecond precision timestamps
* Simplify error handling
* Refactor, add back type annotations
* Deleted custom submit function, refactor
* read_buffer simplify
* Fix use after free, refactor
* Simplify supported_features
* Runtime docs
---------
Co-authored-by: George Hotz <72895+geohot@users.noreply.github.com>
* more conditions for shift rewrite mul/idiv
* make ptx test uint so the new condition is true
* delete idiv test
* rewrite to 0 is wrong for idiv, as denominator is cast to 0 before division
* mul/div by 2**(large count) is unsupported anyway
* implemented in tensor
* apply onnx tests to asymmetrical pads
* better onnx op ordering
* correct ceil_mode asymmetrical
* fix onnx_ops comments
* a few more TODOs and fix some stupidity
* fix some typing
* fix test
* mypy still a little messed up
* refactor out pad struct transformation
* add simple docs for now
* add whatever tests possible
* add tests for _resolve_pool_pads
* better err msg
* whoops didn't mean to include this
* retry CI
* enable asymmetric pads onnx tests
* better docs
---------
Co-authored-by: chenyu <chenyu@fastmail.com>
* _padding2d -> _resolve_pool_pads
* rephrase err msg
* even better error msg
* check asymmetric first os people don't hit error twice
* test against torch
it's a python style mod. possibily can be cleaner with a floor div
relaxed the vmin for MOD slightly for cstyle negatives mod, it's more correct and might fix other bugs
* implemented
* this implementation is now correct
* this is fine I guess
* better variable names
* finally correct gathernd
* add a note
* eh just leave it at this for now
* teeny adjustment
* start work on new gradient
* more correct
* working tests
* more tests
* work
* add (faliing) gradient test
* add view and reduce gradient
* test_add works, many failing test_ops
* add max and reduce max
* add max and reduce max
* 129 failing
* 108 failed
* better view drawing
* 101 failed
* i got 99 failures
* 94 failures
* it's tons of terrible code, but only 50 tests fail
* only 19 failures
* same 19 but shorter
* minimal doesn't matter
* shorter
* lil simpler
* simpler
* simpler
* simpler
* 13 test failures
* nine tests fail
* all ops tests pass
* add contiguous gradient + fix sched tests
* faster by removing toposort calls
* missed one
* add jax to testing
* hacky fix for cast
* only float to uint8
* limit to float -> uint8
* touchup alu cast test
* improve tests and support more float to unsigned casts
* del one repeated test
* del 1 more repeated test
* try removing expected failure test
* hmmm try 1 more
* skip tests for flakiness
* uint64 super flaky
* clean up
* grammar
* just match numpy
* why is CI numpy different from local numpy
* increase verbosity
* try
* try2
* try3
* try4
* yeah idk
* new direction
* try again
* just don't support uint32 and uint64
* done?
* oops
* comment
* documentation
* it is what it is
---------
Co-authored-by: chenyu <chenyu@fastmail.com>
* wip pool
* check CI for remove alternative implementation
* Revert "check CI for remove alternative implementation"
This reverts commit 7b1bb900e5.
* fix test
* tests tests tests
* slap a resolve on it
* fix comment
* a little simpler pool
* check CI for removal again
* Revert "check CI for removal again"
This reverts commit be798b7857.
* small
* update
* some ez tests
* english
* clean up code
* fix ruff
* how did I +25 lines?
* small clean ups
* moar clean ups
* try test_avgpool2d_failure2 in CI
* final clean up
* exclude bug fix
* avg underscore pool
* no more edge case stuff
* add better comments for explanation
* add test cases for decreasing end padding
* address feedback
* improve test coverage
* tiny more polish as we wait for lines :D
* more readable code ordering
* add to documentation
* oops
* set to False instead
---------
Co-authored-by: chenyu <chenyu@fastmail.com>
* split into another branch
* polish
* try this
* Revert "try this"
This reverts commit 84f711b13e.
* try
* Revert "try"
This reverts commit 89c7a7649b.
* idk anymore
* it is what it is
---------
Co-authored-by: chenyu <chenyu@fastmail.com>