* real strides with uops [run_process_replay]
* compare with old
* Revert "compare with old"
This reverts commit f53a8d4276.
* make those @unittest.expectedFailure
* try
* pass
* clean up
* done
* I'm becoming dumber
* clean up 2
* remove useless max
* useless but make computer brrr [run_process_replay]
* try process replay
* try again
* 1 less line, just use pad2d
* Revert "late gate creation for STORE [run_process_replay] (#6373)"
This reverts commit c26744de9f.
* Revert "gated store rewrite to UOps.IF (#5976)"
This reverts commit 48061e8400.
* almost working with relu, even hackable... but acc size is wrong, fix needed
* upcast based on threads, change thread size to 4x4
* revert wrongfully commented assert
* fix tc load indexing
* modify for size 8
* fix bug for size 8
* Revert "fix bug for size 8"
This reverts commit cdb3f5df85.
* Revert "modify for size 8"
This reverts commit 3ef0904bd9.
* good kernel with changes in lowerer
* revert "good kernel with changes in lowerer"
This reverts commit 975e2b5a4e.
* good kernel for relu!
* refactor lowerer changes
* add amx context var to helper
* clean up amx flag
* improve lowerer changes readability
* improve check for amx
* revert lowerer if
* add float4 type rendering for clang
* add amx definitions
* enable indexing for clang if amx
* working amx example, wrong because of dims
* almost works for float 16, need to spot using double load in amx
* cleaner render_kernel
* revert chages in simple_matmul and delete env
* add new var upcast_offset to get_optimized_ast
* change axis for axes
* invert if in rendering phi
* fix some bugs
* fix linearizer tests
* fix vec/get pat for amx
* remove clang tc if amx is disabled
* add ops_python support
* refactor into one complementary function in ops_python
* add job for EMUALTE_AMX
* improve checking for AMX in UPCAST and TC extra ops
* fix lint issue
* commit before refactor into autocontained AMX
* start refactor by removing special rendering for AMX
* all ready for amx handcoded kernel
* working poc, most straightforward amx support
* avoid local opts for tc if amx
* fix merge bugs
* skip test for clang
* skip tc hand-coded opts if amx
* remove hardcoded ops_python values
* remove hardcoded sizes for amx kernel
* fix ops_python bug where dim was hard-coded
* change contract for vectorize
* working without changes in lowerer
* revert changes in gep rendering
* fix ops_python
* modify comment
* skip test if clang for different type accumulation
* move rename and bug for seperate pr
* fix wrong path for test
* addmm not implemented in torch for cpu
* change struct for vector; equally slow but cleaner
* revert modified test
* simply wmma rendering
* minor change
* noqa:501
* add length 16 for AMX
* fix vectorized half issue
* fix error
* remove comment
* change set for dedup
* split test of tensor_core_extra_ops so that cases that dont require locals run for AMX
* add amx reference
* load acc into amx registers
* fix dtype rendering and remove noqa
* moved tests change into another pr
* add real AMX job for CI and fix bug
* fix ops_python bug
* fix test class
* remove real AMX tests and fix uops_stats test
* remove wrong test
* acc folding
* hotfix: bug
* fix float4 tests for amx
* hack for fixing flops counting
* hotfix: mypy
* add flop counts test for amx
* improve test_float4_multidim_amx
* improve test_float4_multidim_amx
* improve test_float4_multidim_unaligned_load_amx
* nits tests
---------
Co-authored-by: George Hotz <72895+geohot@users.noreply.github.com>
* document UOps.IF [run_process_replay]
* this will be a block of STOREs after merge_gates
* now i can enable the assert
* more docs
* raw code block
* cname
* cleanup
Revert "cname"
This reverts commit d823f87561.