* start uop emu
* tiny_add passes
* more ops
* emulate the whole warp
* test_gemm passes
* metal gemm test pass
* works on big gemm
* works on big gemm
* more tests pass
* touch ups
* fix mypy
* cleanups
* exp2 mypy
* arch is where it belongs
* actually emulate tensor cores
* fix test
* new style
run on TORCH since it's the fastest one on CI.
caught a bug in multinomial, and update the behavior of fancy index and gather to move the indices Tensor to same device as self.
left ones in conv2d and wino, no E501 elsewhere in tensor.
three functions need general readability improvement: getitem and gather, conv2d and wino, and pow
fix when correction is too big. it seems to only work when input size is 0 though.
torch can output -inf in var when correction is too big, which does not make sense.
* fix Tensor.mean to compute the mean correctly with 0-length axes are selected
* add a regression test
* rename sum variable to sum_t to avoid conflict with built it function
* refactor Tensor.mean to has less lines
* skip matacc opt if the all src buffers of mul op are const buffers
* add noqa directive for long test
* unskip MALACC opt
* ensure that a_axes at least includes summation axes in order to perform np.einsum correctly
* add regression test for mulacc op
* compute a_slices using a_axes
* refactor helper of function to retrieve axes and slices for nonzero strides as well as summation axes
* include a regression test that uses and to test the behaviour indirectly
* PoC faster wino compile by catting consts across data expand dim
* fix fusions
* faster + golf it
* noqa 501
* implicit broadcast
* Revert "implicit broadcast"
This reverts commit 5915a9083d045ec1e6be84dcb492333325d48666.
* shorter
* shorter
* oops
* 216 upcasts is probably fine
* wino kernel count test
* test winograd number of sts
* specify device for apply_matrix mat elements
* shrink MLB on sharded axis
use onehot structure to store the real partition. goal is unsynced batchnorm2d that can be run on multigpu for training.
draft version in https://github.com/chenyuxyz/tinygrad/pull/109
* SYNCBN flag
* test unclean shrinks
* UnsyncedBatchNorm reuses BatchNorm
* more robust pad arg check
* better types
* more tests!
* 6 gpus in benchmark
* disable slow GPUS=6 benchmark
- removed noop a=0
- fixed integer div test
- added test for both python expression and Tensor method call
- reordered for consistency and added some spaces
* mockhip->hipcpu
* allocate buffers
* launch a kernel
read_asm api
* run remu in CI
* remu 0.0.2, real test ops
* simple driver
* 0.0.3, all test_ops
* run the latest emulator
* 9 minutes is way too long, drop backprop in CI
* bring back the backward pass
* Revert "bring back the backward pass"
This reverts commit 3781e1bc56.
* Print slowest tests
* emulated device directly in ops_hip
* fix ruff, override mypy for specific rules
* test in the same code path
- hip backend env variables
- install packages and verify autogen
- run certain tests
- remove the other hip tests path
- verify Device.DEFAULT
* remove the emulated hip in extra
---------
Co-authored-by: George Hotz <72895+geohot@users.noreply.github.com>
* Reapply "take merge views from corsix branch" (#3278)
This reverts commit d298916232.
* reintroduce merge views
* update second any
* isinstance -> not
* 25% less same but unequal