* add Tensor.split (#2677)
* fix mypy errors
* add list support for Tensor.split
* fix ruff comments
* match tensor.split api
* simplify split and test_split
---------
Co-authored-by: George Hotz <72895+geohot@users.noreply.github.com>
* remove type check for LazyOp.src now it's always LazyOp
also matched MULACC criteria between interpreted and compiled (that probably need to be refactored somewhere else)
* disable that test
* print DEBUG for TC=2 in CI
* enable TC=2
* no need to check src type
* LOAD has side effect
* don't push any local buffer
* update comment
* and BARRIER
* cleanup llama apply_rotary_emb and other helpers
used ellipsis and other higher level tensor function.
disabled the half @ half -> half tensor core as it fails uop dtype checks
* keep hip 8x8->8 wmma
the compiler error was due to `error: call to 'max' is ambiguous` when we have max(int, float) in kernel.
it was first fixed in 4380ccb1 the non fp32 math PR, and further solidified with dtype refactor
the correct condition is that PADTO cannot be applied to reduce axis, not Reduce.MAX in ops.
even for Reduce.SUM it's possible that the reduce axis had a div before, and the padded 0 became inf then sum over it is incorrect.
* return bool
* add tests to the type spec
* fix multinomial
* fix tril
* fix round
* fix NegativeLogLikelihoodLoss
* rm debug
* webgpu
* more webgpu
* bitwise or for adding two bools
* onnx ops dont need to cast anymore
* Revert "bitwise or for adding two bools"
This reverts commit b413babffa.
* workaround for metal neg
* just the tests in the type spec
* test dtypes of return values of cumsum, argmax/min, multinomial
cumsum behaves like sum, and functions that return an index return in dtypes.default_int
* because webgpu is different
* these asserts should pass
* fix that assert
* ALU dtypes
* acc dtype for group_for_reduce
* cast image ALUs to the base dtype
* remove all casts from linearizer
* fix argmax
* fix multinomial
* fix __getitem__
* Revert "fix __getitem__"
This reverts commit 62ad719bfa.
* fix MemBuffer outputs being wrong when there is an arange + ALU with a different dtype
eg. fancy slicing (int, float), bert embeddings (int, long)
this should be fixed in lazy instead of having to break the kernel
* cleanup argmax fix
* fix matmul in ints
cast in the end
* fix llama
* skip wrong hardcoded asts in the worlds dataset
* fix llama p2
* cleanup missing parts of the diff
---------
Co-authored-by: George Hotz <geohot@gmail.com>
* ww/Fixed Tensor.randint() to accept shape tuples ()
* ww/Wrote a test to cover this typo
* ww/Updated Tensor random objects to optionally take (,) or *() to be more consistent
* ww/no lint no worries
* ww/Made peace with linter
* ww/Added new line can't reduce line size without reducing readablitity
* ww/reverted to using .mul
* space removal in formula and a single test to cover it
* space in torch einsum as well
* replacing spaces in a var formula to support truncating all the spaces
* better support for platform dependent flags
* osx test support
* removed unused import and made line length <150
* changed osx ci shm
* lstrip in case SharedMemory._name is passed