* add a failing test for LR scheduler when using multigpu
* fix calculation order and unnecessary tensor created for float
* min_lr is no longer tensor
we previously only upcast uint and int, and half was using half for acc.
change to acc in float for precision. but cast the result back to half to match torch/jax output dtype
* updated most dtype hacks in onnx_ops
* temporarily revert dequantizelinear change
* I think this is right...
* MORE FIXES WOOOO NEW DTYPE IS AWESOME
* ok
* oops missed a print
* half -> float32 for CI
* is npdtype
* some more
* fix if ordering
* more clean ups
* final cleanups
* casting to half not allowed
* k nvm
* revert ArgMax change
* only GPU
* llvm begone
* teeny tiny change
* fix: attempt to add cast tests
* try this
* fix dequantizelinear
* revert some stuff
* tests pass pls
* less lines in onnx_tests
* oops missed string tensor tests
* clean up
* try: revert default behavior changes
* fix: disabled Cast and Castlike tests
* docs: small changes
* fix: fixed isNaN op and enabled associated tests
* fix: forgot about float16
* done
* update disabled test
* gah missed another float16
* disable rest of failing tests
* rm extra line
* try...
---------
Co-authored-by: chenyu <chenyu@fastmail.com>
* remove AndNode.__floordiv__
AndNode produces a Node that min/max is bounded by [0, 1] so `//` on top of that is almost always 0.
we don't really use that either
* keep the test
* simple multitensor API
* test multitensor
* mt work
* new api
* copies
* all but data parallel
* allreduce there
* works, but axis sharded
* fix all mt tests
* features/multi
* work
* backprop
* fix tests
* tests passing
* mt progress
* cleanups
* less lines
* tensor cleanup
* save more lines
* mypy passes
* fix tests
* skip for cuda too
* bump download cache
* add Tensor.split (#2677)
* fix mypy errors
* add list support for Tensor.split
* fix ruff comments
* match tensor.split api
* simplify split and test_split
---------
Co-authored-by: George Hotz <72895+geohot@users.noreply.github.com>
the correct condition is that PADTO cannot be applied to reduce axis, not Reduce.MAX in ops.
even for Reduce.SUM it's possible that the reduce axis had a div before, and the padded 0 became inf then sum over it is incorrect.
* return bool
* add tests to the type spec
* fix multinomial
* fix tril
* fix round
* fix NegativeLogLikelihoodLoss
* rm debug
* webgpu
* more webgpu
* bitwise or for adding two bools
* onnx ops dont need to cast anymore
* Revert "bitwise or for adding two bools"
This reverts commit b413babffa.
* workaround for metal neg
* just the tests in the type spec
* test dtypes of return values of cumsum, argmax/min, multinomial
cumsum behaves like sum, and functions that return an index return in dtypes.default_int
* because webgpu is different
* ww/Fixed Tensor.randint() to accept shape tuples ()
* ww/Wrote a test to cover this typo
* ww/Updated Tensor random objects to optionally take (,) or *() to be more consistent
* ww/no lint no worries
* ww/Made peace with linter
* ww/Added new line can't reduce line size without reducing readablitity
* ww/reverted to using .mul
* space removal in formula and a single test to cover it
* space in torch einsum as well
* replacing spaces in a var formula to support truncating all the spaces
* better support for platform dependent flags
* osx test support
* removed unused import and made line length <150
* changed osx ci shm
* lstrip in case SharedMemory._name is passed