* remove old index reorder
* new style folder
* works better
* dedup
* one failure
* this is fine now...
* expander_rewrite
* images broken, but all else should work
* cleanups
* make tests work with old
* fix images
* cleanups + bugfix
* minor fixes
* fix gated store folding
* flip gate_creator and expander
* fix gated store
* remove unneeded rules
* lines getting close
* line count good
* multireduce no-opts works
* passed test_var_multireduce
* cleanup
* double reduce
* extra check for range_group
* more checking for range_groups
* cleaning up debug prints
* cleanup diff
* linters
* revert kernel changes
* these are uops toposort
---------
Co-authored-by: timmy <timmy0x@proton.me>
* render lidx starting with 0
changed from
```
int gidx0 = gid.x; /* 4096 */
int lidx4 = lid.x; /* 8 */
int gidx1 = gid.y; /* 7 */
int lidx5 = lid.y; /* 8 */
int gidx2 = gid.z; /* 7 */
int lidx6 = lid.z; /* 2 */
```
to
```
int gidx0 = gid.x; /* 4096 */
int lidx0 = lid.x; /* 8 */
int gidx1 = gid.y; /* 7 */
int lidx1 = lid.y; /* 8 */
int gidx2 = gid.z; /* 7 */
int lidx2 = lid.z; /* 2 */
```
the existing one started from pre-limited global dims which skip number if there are more than 3 global dims
* don't need start_dim
---------
Co-authored-by: qazal <77887910+Qazalin@users.noreply.github.com>
* indexing getting better [run_process_replay] [no_assert]
* fix test
* test_arange_2_reduce is a simpler test
* put that print back, NOOPT
* don't merge reduces (they could be different reduces)
* FUSE_AS_ONE_KERNEL
* fix tests
* fix test_var_multireduce
* w/e put that there
* fails on others too
* fix test, revert UNMUL change
* in case order matters
* one kernel indexing works
* one kernel indexing works (test other)
* st to uops function
* lowerer
* uops reduce
* uops reduce
* acc_number correct
* reduce unroll
* complete unroll
* do upcasts
* handle multioutput
* define_accs
* fix valid
* get grouped dims
* revert lin
* minor
* fixup_ast
* group for reduce
* group works now
* all forwards pass
* all ops tests pass
* fix clang
* mypy
* lil cleanups, no image yet
* ugh, variables everywhere
* bugfix
* counters and name fix
* use symbolic, not uops
* cleanups
* Fix tests
* linearizer tests
* expands
* float4 expand load
* tests pass
* woooo, float4 test
* test ops works again
* one more lin test
* more lin tests
* bypass
* fix tests
* something like this
* const in defineacc
* uops get_reduce_acc
* move around
* allow consts in the LOAD/STORE
* each axis should only appear once, 21 failures
* 16 failures
* fix some image
* optional float4
* onnx tests
* gate the stores
* add reorder
* fix terrible skip function
* tc work
* opt add/mul merge
* fix float4 tests
* tiny tweak, 9 failing
* 7 test failures
* start tc, but i don't think this will work
* progress on tensorcores
* note
* fix ops tests
* closer on tc
* weeee...one tensor core works
* still works, more generic
* large WMMA works
* tc test passes
* use WMMA as accumulator
* basic tc tests passing
* small gemm padded works
* 4 failures
* 3 tests failing
* super barrier
* now two tests failing
* one test failing
* cleanpus, add reduce to UopGraph
* remove the linearizer
* remove unused
* lil cleanups
* Lowerer everywhere
* remove test that doesn't exist now
* image indexing
* llvm fix
* fix metal
* fix image
* fix images
* might fix ptx
* fix image type mismatch
* more tests pass
* CAST -> VECTORIZE
* forgot that one
* fix TestOps.test_flip_eye_crash
* locals shouldn't be image dtype
* change less files
* test fix
* fix recursive expands
* touches
* MULACC support in python
* delete unneeded
* alu before contract
* bug fixes
* tests
* no var multireduce
* simpler tc
* metal works in new style
* working on AMD and METAL
* fix amd
* shot in the dark, fix amd
* something for CUDA
* CUDA WORKS from the docs
* comment
* correct merge
* cleanups + ptx fix + get_reduce_acc
* local alias isn't used anymore
* add store sanity check
* fix for AMD
* cleanups and single expand pass
* more correct with acc_cache
* tests should pass
* block on WMMA
* tests pass
* merge contract and reduce
* contractor fixes issue
* multicontract
* pre expand wmma (same as a reduce)
* expand wmma and only take one
* all expands
* comments and whitespace
fix broken PTX tests in test_linearizer and test_uops. there are tests that were skipped and broken because it runs only with CUDA=1 and we run PTX with NV=1 now
* Add UOps.VECTORIZE to core
* Update vectorized cast tests
* Addresses code review comments
- Removes VECTORIZE from LLVMRenderer
- Add line breaks to unduly long lines
- Add noop CAST rule back
- Update asserts and add render_vectorize in
CSytleLanguage renderer
* Add missing const folding rule for VECTORIZE
Also adds corresponding test
* Fixes test_const_vectorize_fold and add assert
- Use sane types with VECTORIZE in test_const_vectorize_fold
- Add assert that sanity checks the types for VECTORIZE
* Rename test_cast_vectorized_fold
Renames test_cast_vectorized_fold to test_noop_vectorize_fold
because the test targets a very specific rule and there are
other tests for VECTORIZE.
* Revert unrelated changes
---------
Co-authored-by: qazal <77887910+Qazalin@users.noreply.github.com>
Co-authored-by: qazal <qazal.software@gmail.com>