* different way to write torch backend
* both backends
* more work
* simpler code
* more work
* test both
* imply unwrap/wrap
* FORWARD_ONLY=1 TINY_BACKEND=1 python3 test/test_ops.py TestOps.test_add works
* ready to start making test_ops work in torch backend
* backward pass, TINY_BACKEND=1 python3 test/test_ops.py TestOps.test_add works
* FORWARD_ONLY=1 TINY_BACKEND=1 python3 test/test_ops.py TestOps.test_simple_conv2d works
* matmul backward is broken with as_strided
* pytorch scatter -> scatter_reduce
* WIP scatter_reduce implementation
* _pre_scatter return type hint
* split out src, mask to satisfy linter
* Add src cast back in
* dict of lambdas instead of ifs
* sum and prod reduction ops with include_self
* add reduce arg error message
* add amax and amin reduction ops
* Fix include_self for higher dims
* Simplify
* Simplify amax and amin too
* Pull include_self logic out into _inv_mask function
* reduce arg cannot be None for scatter_reduce
* Fix self-mask issue
* Add mean reduce op
* Add tests
* any() not needed here
* remove comment
* End support for Tensor src with reduce arg in tinygrad scatter
* Process index, dim inside actual functions
* Add scatter_reduce to onnx
* Add excluded onnx ScatterElements reduction tests back in
* Save 2 lines on the mask helpers
* Update docs
* Add include_self=False tests
* cleanup
* Remove unneeded helper function
---------
Co-authored-by: chenyu <chenyu@fastmail.com>
for unknown reasons, without this, when trying to execute "Llama 3.2 1B", I get the error below. Fwiw I do not know the performance impact for this change. I can't even get exo running, but this change allows me to /get further/ (before running into a separate issue with vram allocation? story for another day i suppose)
error:
```
Failed to fetch completions: Error processing prompt (see logs with DEBUG>=2): Nvrtc Error 6, NVRTC_ERROR_COMPILATION <null>(18): error: more than one user-defined conversion from "nv_bfloat16" to "half" applies:
function "__half::__half(float)" (declared at line 214 of /usr/include/cuda_fp16.hpp)
function "__half::__half(short)" (declared at line 227 of /usr/include/cuda_fp16.hpp)
function "__half::__half(unsigned short)" (declared at line 228 of /usr/include/cuda_fp16.hpp)
function "__half::__half(int)" (declared at line 229 of /usr/include/cuda_fp16.hpp)
function "__half::__half(unsigned int)" (declared at line 230 of /usr/include/cuda_fp16.hpp)
function "__half::__half(long long)" (declared at line 231 of /usr/include/cuda_fp16.hpp)
function "__half::__half(unsigned long long)" (declared at line 232 of /usr/include/cuda_fp16.hpp)
*((half4*)((data0+(alu0+(gidx1<<14)+(lidx0<<11)+alu1)))) = make_half4(((half)(val0)),((half)(val1)),((half)(val2)),((half)(val3)));
^
<null>(18): error: more than one user-defined conversion from "nv_bfloat16" to "half" applies:
function "__half::__half(float)" (declared at line 214 of /usr/include/cuda_fp16.hpp)
function "__half::__half(short)" (declared at line 227 of /usr/include/cuda_fp16.hpp)
function "__half::__half(unsigned short)" (declared at line 228 of /usr/include/cuda_fp16.hpp)
function "__half::__half(int)" (declared at line 229 of /usr/include/cuda_fp16.hpp)
function "__half::__half(unsigned int)" (declared at line 230 of /usr/include/cuda_fp16.hpp)
function "__half::__half(long long)" (declared at line 231 of /usr/include/cuda_fp16.hpp)
function "__half::__half(unsigned long long)" (declared at line 232 of /usr/include/cuda_fp16.hpp)
*((half4*)((data0+(alu0+(gidx1<<14)+(lidx0<<11)+alu1)))) = make_half4(((half)(val0)),((half)(val1)),((half)(val2)),((half)(val3)));
^
<null>(18): error: more than one user-defined conversion from "nv_bfloat16" to "half" applies:
function "__half::__half(float)" (declared at line 214 of /usr/include/cuda_fp16.hpp)
function "__half::__half(short)" (declared at line 227 of /usr/include/cuda_fp16.hpp)
function "__half::__half(unsigned short)" (declared at line 228 of /usr/include/cuda_fp16.hpp)
function "__half::__half(int)" (declared at line 229 of /usr/include/cuda_fp16.hpp)
function "__half::__half(unsigned int)" (declared at line 230 of /usr/include/cuda_fp16.hpp)
function "__half::__half(long long)" (declared at line 231 of /usr/include/cuda_fp16.hpp)
function "__half::__half(unsigned long long)" (declared at line 232 of /usr/include/cuda_fp16.hpp)
*((half4*)((data0+(alu0+(gidx1<<14)+(lidx0<<11)+alu1)))) = make_half4(((half)(val0)),((half)(val1)),((half)(val2)),((half)(val3)));
^
<null>(18): error: more than one user-defined conversion from "nv_bfloat16" to "half" applies:
function "__half::__half(float)" (declared at line 214 of /usr/include/cuda_fp16.hpp)
function "__half::__half(short)" (declared at line 227 of /usr/include/cuda_fp16.hpp)
function "__half::__half(unsigned short)" (declared at line 228 of /usr/include/cuda_fp16.hpp)
function "__half::__half(int)" (declared at line 229 of /usr/include/cuda_fp16.hpp)
function "__half::__half(unsigned int)" (declared at line 230 of /usr/include/cuda_fp16.hpp)
function "__half::__half(long long)" (declared at line 231 of /usr/include/cuda_fp16.hpp)
function "__half::__half(unsigned long long)" (declared at line 232 of /usr/include/cuda_fp16.hpp)
*((half4*)((data0+(alu0+(gidx1<<14)+(lidx0<<11)+alu1)))) = make_half4(((half)(val0)),((half)(val1)),((half)(val2)),((half)(val3)));
^
4 errors detected in the compilation of "<null>".
```
* add some docs about speed [pr]
* better torch gemm
* enable locals on llvm/clang
* disable locals for beam speed on LLVM/CLANG
* 0x20 alignment in llvm allows ymm use
* Switch to dawn, all tests passing locally
* Use dawn-python
* Skip failing test
* Skip midcast and fix timestamp on metal ci
* Autogen webgpu
* Try fetch dawn lib again
* /usr/lib
* Without lib prefix
* Test autogen diff
* Delete webgpu support, move everything to ops_webgpu
* mypy fix
* Simplify, refactor
* Line savings
* No ResultContainer
* Type annotation for result
* Some more simplifications
* Why was this explicit sync used at all?
* Refactor: delete functions that are only used once
* Create shader module inline
* Clear unit tests cache, maybe that solves it
* That wasn't it
* Try deleting cache to pass failing weight compare
* weights_only=False for pytorch 2.6
* Simplify ctype array creation
* Remove nanosecond precision timestamps
* Simplify error handling
* Refactor, add back type annotations
* Deleted custom submit function, refactor
* read_buffer simplify
* Fix use after free, refactor
* Simplify supported_features
* Runtime docs
---------
Co-authored-by: George Hotz <72895+geohot@users.noreply.github.com>
* start
* log severity
* only change this
* change abstraction so it's more usable for huggingface
---------
Co-authored-by: chenyu <chenyu@fastmail.com>
* dsp simulator
* progress
* fix
* close on test tiny
* working
* less waste
* line savings
* Device DSP compiler
* mock DSP at the bottom
* DSP tests
* docker caching
* test update
* need load
* skip that test for CI DSP
* last touch
* ugh
* new lines that exist in codegen/ops
* update tests
* update sops.gz (13071 -> 13070 asts)
* fix viz too
* remove that TODO
* diff pruning
* mask assert + device
* work
* diff pruning
* re: fix viz too
---------
Co-authored-by: George Hotz <72895+geohot@users.noreply.github.com>
* rename Opt amt to arg
* ignore_beam_cache for test_tiny
* move ignore_beam_cache to test_tiny
* move to separate pr
* revert space change
---------
Co-authored-by: chenyu <chenyu@fastmail.com>
* this
* clean up
* more clean ups and improve debug msg
* more correct training toggler
* remove manual training toggling
* change some variable names
* actually just add the training toggle for LIMIT envvar too
* more refinement
* __call__ and OnnxRunner
* fix half pylint, other half is importing from onnx while this file is onnx.py, figure out later
* ahhhh found another mistake
* remove limit from __call__
---------
Co-authored-by: chenyu <chenyu@fastmail.com>
* is 67% considered fixed?
* move test up
* share function
* add qgemm too
* make sure qgemm comes out as int
* actually that note is not right
* remove qgemm (I did it wrong) and add it later lol.
* QLinearEverything
* ok ort verify passes
* this should be int instead
* cast to int then char to do wraparound
* cleaner
* move contrib ops to microsoft ops
---------
Co-authored-by: George Hotz <72895+geohot@users.noreply.github.com>