* Solve dims too large errors on webgpu
* Simplify divisor find
* Test square root divisor
* Fix lint
* Refactor into group_dims and split_dims
* Refactor
* Fix lint
* Add back max check in _group_dims
* Prefer grouping over split
---------
Co-authored-by: chenyu <chenyu@fastmail.com>
* Make logcumsumexp numerically stable
* Refactor
* Refactor for special case ndim=0
* Refactor
* Use the correct device for mask
---------
Co-authored-by: chenyu <chenyu@fastmail.com>
* pytorch scatter -> scatter_reduce
* WIP scatter_reduce implementation
* _pre_scatter return type hint
* split out src, mask to satisfy linter
* Add src cast back in
* dict of lambdas instead of ifs
* sum and prod reduction ops with include_self
* add reduce arg error message
* add amax and amin reduction ops
* Fix include_self for higher dims
* Simplify
* Simplify amax and amin too
* Pull include_self logic out into _inv_mask function
* reduce arg cannot be None for scatter_reduce
* Fix self-mask issue
* Add mean reduce op
* Add tests
* any() not needed here
* remove comment
* End support for Tensor src with reduce arg in tinygrad scatter
* Process index, dim inside actual functions
* Add scatter_reduce to onnx
* Add excluded onnx ScatterElements reduction tests back in
* Save 2 lines on the mask helpers
* Update docs
* Add include_self=False tests
* cleanup
* Remove unneeded helper function
---------
Co-authored-by: chenyu <chenyu@fastmail.com>
* update rmsnorm to match torch implementation
* run all tests
* formatting
* formatting
* oneline
* default to 1e-6
* restore old test
* formatting
* don't save elementwise_affine
* your message
* ignore webgpu
---------
Co-authored-by: George Hotz <72895+geohot@users.noreply.github.com>
* float uop in sym_infer
* break line :(
* rerun mypy
* update GlobalCounters types
* revert type change and cast assignments to mem and ops
* cast inferred value to UOp in reshape
* cast hcq, update view reshape to handle inferred float
* rm extra space
* update error
* no type updates
* add patches
* add osx test in ci
* macos specific uvm, gpfifo mask
* only do that for now
* Revert "add patches"
This reverts commit 80d3112a57.
* use fork for now
* workflow only one worker
* merge osxtests with tests
* Revert "merge osxtests with tests"
This reverts commit 3461c8f46c.
* macos pagesize 16384
---------
Co-authored-by: nimlgen <138685161+nimlgen@users.noreply.github.com>
Co-authored-by: George Hotz <72895+geohot@users.noreply.github.com>
* WebGPU f16 support
* Don't enable f16 yet
* dtype tests passing after bitcast fix
* Maybe all WebGPU green?
* Require shader-f16 in examples
* Minor wgsl touchup
* 1 line shorter
* Simpler
* Add transcendetal support
* log2 nan location mismatch on Vulkan
* Nan skips
* fix tensor realization bug in #8975
* that's a reshape now
* work
* works
* give those tests better names
* test when multiple mops result in the same ShapeTracker
* test_become_existing_buf_complex is enough
* that too
* add some docs about speed [pr]
* better torch gemm
* enable locals on llvm/clang
* disable locals for beam speed on LLVM/CLANG
* 0x20 alignment in llvm allows ymm use