* add a failing test for LR scheduler when using multigpu
* fix calculation order and unnecessary tensor created for float
* min_lr is no longer tensor
we previously only upcast uint and int, and half was using half for acc.
change to acc in float for precision. but cast the result back to half to match torch/jax output dtype
* updated most dtype hacks in onnx_ops
* temporarily revert dequantizelinear change
* I think this is right...
* MORE FIXES WOOOO NEW DTYPE IS AWESOME
* ok
* oops missed a print
* half -> float32 for CI
* is npdtype
* some more
* fix if ordering
* more clean ups
* final cleanups
* casting to half not allowed
* k nvm
* revert ArgMax change
* only GPU
* llvm begone
* teeny tiny change
* fix: attempt to add cast tests
* try this
* fix dequantizelinear
* revert some stuff
* tests pass pls
* less lines in onnx_tests
* oops missed string tensor tests
* clean up
* try: revert default behavior changes
* fix: disabled Cast and Castlike tests
* docs: small changes
* fix: fixed isNaN op and enabled associated tests
* fix: forgot about float16
* done
* update disabled test
* gah missed another float16
* disable rest of failing tests
* rm extra line
* try...
---------
Co-authored-by: chenyu <chenyu@fastmail.com>
* remove AndNode.__floordiv__
AndNode produces a Node that min/max is bounded by [0, 1] so `//` on top of that is almost always 0.
we don't really use that either
* keep the test
* simple multitensor API
* test multitensor
* mt work
* new api
* copies
* all but data parallel
* allreduce there
* works, but axis sharded
* fix all mt tests
* features/multi
* work
* backprop
* fix tests
* tests passing
* mt progress
* cleanups
* less lines
* tensor cleanup
* save more lines
* mypy passes
* fix tests
* skip for cuda too
* bump download cache