* add DynamicDequantizeLinear and corresponding tests
* wow qlinearops are round away from zero
* this passes locally...
* again
* try
* try separate test
* round to even again
* also add QLinearMul
---------
Co-authored-by: chenyu <chenyu@fastmail.com>
* boom
* fix webgpu
* use exact variable names in test so that AI can read easier
* add tag for specific test name like test a specific dtype
* fix ruff
* astype everything
* dtype in array creation
* just arange
* is 67% considered fixed?
* move test up
* small cleanups
* share function
* add qgemm as well
* add qgemm too
* make sure qgemm comes out as int
* take out qgemm for now
* fixed test
* add correct qgemm
* addressing feedback here too, early naive fix for now
* simplify bias and c to be minimalistic enough to test correctness
* refactored qlinearops
* maybe these asserts aren't the best..
* fix test
* updated tests to cover new ops
* try to add to CI
* move test_onnx_ops into testextra/
* more attention tests
* qlinear_add atol=1
* attention still not fullllllly correct
* it is what it is
---------
Co-authored-by: chenyu <chenyu@fastmail.com>
* pytorch scatter -> scatter_reduce
* WIP scatter_reduce implementation
* _pre_scatter return type hint
* split out src, mask to satisfy linter
* Add src cast back in
* dict of lambdas instead of ifs
* sum and prod reduction ops with include_self
* add reduce arg error message
* add amax and amin reduction ops
* Fix include_self for higher dims
* Simplify
* Simplify amax and amin too
* Pull include_self logic out into _inv_mask function
* reduce arg cannot be None for scatter_reduce
* Fix self-mask issue
* Add mean reduce op
* Add tests
* any() not needed here
* remove comment
* End support for Tensor src with reduce arg in tinygrad scatter
* Process index, dim inside actual functions
* Add scatter_reduce to onnx
* Add excluded onnx ScatterElements reduction tests back in
* Save 2 lines on the mask helpers
* Update docs
* Add include_self=False tests
* cleanup
* Remove unneeded helper function
---------
Co-authored-by: chenyu <chenyu@fastmail.com>
* benchmark kernel launch
* don't realize unneeded
* faster
* faster metal
* fix mypy
* new objc message style [pr]
* without sync
* no div 0
* lru cache that
* no sync in the profile
* fix
* update all to new style
* remove comment
* graph one kernel
* fix graph one kernel
* remove that sync
* benchmark kernel launch
* don't realize unneeded
* faster
* faster metal
* fix mypy
* without sync
* no div 0
* lru cache that
* no sync in the profile
* dsp simulator
* progress
* fix
* close on test tiny
* working
* less waste
* line savings
* Device DSP compiler
* mock DSP at the bottom
* DSP tests
* docker caching
* test update
* need load
* skip that test for CI DSP
* last touch
* ugh
* rename Opt amt to arg
* ignore_beam_cache for test_tiny
* move ignore_beam_cache to test_tiny
* move to separate pr
* revert space change
---------
Co-authored-by: chenyu <chenyu@fastmail.com>
* this
* clean up
* more clean ups and improve debug msg
* more correct training toggler
* remove manual training toggling
* change some variable names
* actually just add the training toggle for LIMIT envvar too
* more refinement
* __call__ and OnnxRunner
* fix half pylint, other half is importing from onnx while this file is onnx.py, figure out later
* ahhhh found another mistake
* remove limit from __call__
---------
Co-authored-by: chenyu <chenyu@fastmail.com>
* start
* progress
* fixes
* smth
* mini fixes
* fix2
* ugh, need this for now
* faster
* cleanups
* tiny linters
* make mypy happier
* test & free pts
* ops
* linter
* cleanup vm
* fix
* remove map_from
* tiny fixes
* add test to ci
* is 67% considered fixed?
* move test up
* share function
* add qgemm too
* make sure qgemm comes out as int
* actually that note is not right
* remove qgemm (I did it wrong) and add it later lol.