* support same uidx in multiple shape positions
* rename var
* update comment
* add contiguous index check to global_store too
* update comment
* small change
* is this better?
* smh
* smaller change?
* get rid of more changes
* get rid of more changes
* is this even making anything better
* comment
* fix test
---------
Co-authored-by: George Hotz <72895+geohot@users.noreply.github.com>
* remove float cast
* cast scalars to the correct value in creation time
* cast scalar in the correct place
* wrong, use y_dtype
* make consts have a unique cache key
* add cast_scalar back
* test_load_cache_const_bufs
* add bool dtype
* test_const_dtype
* fix linters
Fully UNROLLing the first_reduce should not change the number of
local_dims.
Fully UNROLLing a GROUP dim should reduce the number of
group_for_reduces by one.
Also changed group_for_reduces to be a count as the axis number
isn't used anywhere (they are always the first reduce dims).
* ops_python: add HIP tensor core mock and refactor METAL
* Add tests to CI
* add DEBUG=2 to full tests
---------
Co-authored-by: George Hotz <72895+geohot@users.noreply.github.com>
* start uop emu
* tiny_add passes
* more ops
* emulate the whole warp
* test_gemm passes
* metal gemm test pass
* works on big gemm
* works on big gemm
* more tests pass
* touch ups
* fix mypy
* cleanups
* exp2 mypy
* arch is where it belongs
* actually emulate tensor cores
* fix test
* new style
* PoC faster wino compile by catting consts across data expand dim
* fix fusions
* faster + golf it
* noqa 501
* implicit broadcast
* Revert "implicit broadcast"
This reverts commit 5915a9083d045ec1e6be84dcb492333325d48666.
* shorter
* shorter
* oops
* 216 upcasts is probably fine
* wino kernel count test
* test winograd number of sts
* specify device for apply_matrix mat elements
* extra/gemm: add a simple_conv.py along with correctness check
The goal is to easily test tensor core triggering situations
* test: add tests for acc_dtype handling and fixed typing
* wmma: enable METAL half tensor cores and clean up cstyle
* revert simple_matmul rand changes and break line in tensor
* added metal fp16->fp32 tensor core
we previously only upcast uint and int, and half was using half for acc.
change to acc in float for precision. but cast the result back to half to match torch/jax output dtype
the correct condition is that PADTO cannot be applied to reduce axis, not Reduce.MAX in ops.
even for Reduce.SUM it's possible that the reduce axis had a div before, and the padded 0 became inf then sum over it is incorrect.
* lazy rewrite, try 2
* min fix tests
* pass contig test
* put broken pads back
* move that to realize
* no contig child fixes array packing
* so wrong
* now that's correct
* base children
* fix bind issues
* disable to_image_idx
* fix tests
* that failure shouldn't break other tests
* more fixes
* fix torch
* skip failing tests in CI
* 1e-7
* half is broken
* 1e-6 margin of error
* upcast the other way
* Revert "upcast the other way"
This reverts commit 355692ba79.
* remove uop cast, this should have never been there
* add regression test
* now fuzz it
correct test
* the accumulator is always the output type
lint
* fuzz all reduce ops
* MULACC upcast_dtype could be half too
opencl supports it https://man.opencl.org/mad.html
* cast to the same dtype is a noop
* internal casting support for MULACC
* fuzz test mulacc internal casting
* get_reduce_dtype
handle vectorized acc
update get_reduce_acc calls with the correct dtype
update tests
* pending _complete_ implementation of a function that gets the dtype based on self.reduceop
+more failing tests
* get_reduce_dtype try 2
add TODO
* get_lazyop_info already does it
* cleanup
* bring back internal casting support for mulacc
* use the scalar version of the acc dtype
* conceptual diff cleanup
* one extra line to a cleaner linearizer
* correct test assumptions - these should promote?
* rm mulacc cast, the cast of vins happens with the acc dtype promotion
linearizer hacks
* Revert "rm mulacc cast, the cast of vins happens with the acc dtype promotion"
This reverts commit afdd540733.
Revert "correct test assumptions - these should promote?"
This reverts commit 49ae2206ed.
* skip tests blocked by MULACC->lazyop cleanup
* final changes to add back internal casting for MULACC and update skip test logic, upcast works but downcast does not
* only test the linearizer abstraction layer
we wanna ensure that linearizer matches whatever lazy is returning
* remove unused hypothesis module
* remove mulacc related changes, those will move to the lazy pr
* remove midcast test
* move to helpers
* Revert "remove midcast test"
This reverts commit 86e74d7960.
add TODO with skip
---------
Co-authored-by: chenyu <chenyu@fastmail.com>
* cpu tests pass
* torch works
* works
* metal works
* fix ops_disk
* metal jit works
* fix openpilot
* llvm and clang work
* fix webgpu
* docs are rly broken
* LRU works on metal
* delete comment
* revert name to ._buf. LRU only on Compiled
* changes
* allocator
* allocator, getting closer
* lru alloc
* LRUAllocator
* all pass
* metal
* cuda
* test examples
* linearizer
* test fixes
* fix custom + clean realize
* fix hip
* skip tests
* fix tests
* fix size=0
* fix MOCKHIP
* fix thneed
* copy better
* simple
* old style metal copy
* fix thneed
* np reshape
* give cuda a device
* remove force_wait
* refactor
* get rid of stupid ASTRunner
* fix del in diskbuffer
* BufferOps.FROM_UNDERLYING
* put offset in the rawbuffer
* fix bugs
* use exec