* simple multitensor API
* test multitensor
* mt work
* new api
* copies
* all but data parallel
* allreduce there
* works, but axis sharded
* fix all mt tests
* features/multi
* work
* backprop
* fix tests
* tests passing
* mt progress
* cleanups
* less lines
* tensor cleanup
* save more lines
* mypy passes
* fix tests
* skip for cuda too
* bump download cache