* minimum new style expand [run_process_replay]
* float4 folding works
* fix uop graph
* if means or
* dype.count idx overload
* fix test arange
* expand nope
* fix expand contract
* fix amd tensor core
* oh, that's a good test with a real failure
* remove prints
* early reduce
* tomorrow, we remove sorted on expand args
* fix wmma issue
* that makes test_arange pass
* vectorized folding
* no check
* broadcast
* fix clang with self assign rule
* unwrap_dtype maybe
* uopgraph stuff that hardcoded None
* test_ops passes
* dtypes.py fixups
* update test_linearizer and friends
* more ast updates
* test_beam and test_schedule too
* add void type to uop [run_process_replay]
* remove dumb casts
* start making it green
* more cast cleanups
* more cls methods to fix
* regenerate dataset
* split UOp and NOp const
* maybe that too
* fix docs
* update test_uop_symbolic
* test_verify_ast
* new sops with no diff
* meh, type_ignore is alright
* remove that assert
---------
Co-authored-by: qazal <qazal.software@gmail.com>
* Revert "late gate creation for STORE [run_process_replay] (#6373)"
This reverts commit c26744de9f.
* Revert "gated store rewrite to UOps.IF (#5976)"
This reverts commit 48061e8400.
* Core change to gate stores in IFs
* Updates to cstyle renderer to handle IFs around STOREs
* Make uops asserts happy
* Add tests and fix newly broken tests
* make ruff happy
* make mypy happy
* Simplify renderer to have all gated stores use IF
* Revert some changes
* Make test_where_fold happy
* Revert unnecessary handling of ifs rendering. Was included before when changes weren't fully built out
* Rewrite graph to have IFs be dependent on RANGEs if STORE is already dependent on RANGE
* Re-change broken test
* Make ifs be grouped together
* get non-merged IFs working. ALl tests pass except grouping related ifs together
* Fix tests by making the IF UOp dependent on the correct node of the STORE UOp
* Changes to uopgraph
* Simplify graph rewrite logic
* Changes to get test_padto_where_multireduce working
* Simplify uops.store renderer
* Make test_padto_where_multireduce pass but now other tests fail
* Clean up uopgraph from scrach work
* Ignore sudo IF srcs when rendering
* Attempt to fix llvm tests
* rm comment
* reduce lines
* Add line to make mypy happy :(
* llvmir fix pt 1
* Mods after rebasing to master
* Fix llvmir
* Fix ptx tests
* Fix other ptx tests
* Move changes from uops.py to ops.py
* rm uops.py
* Fix TestGateStoreRewrite tests
* Get multireduce tests working
* reset to remote branch
* Fix linearizer tests
* uop_graph test patch
* Add comment to create_gate
* hotfix: uncomment those tests
* Attempt to fix ptx tests by including whitespace inside if block
* Patch from remote tinybox. Tests passing here
* Min changes to get some ptx tests passsing
* Changes after rebase
* Exclude ifs and endifs from ptx
* IF conditional branching within ptx
* Save lines on delete_redundant_gates
* Simplify merge_gates
* rm noqa
* Remove unnecessary checks when merging gates
* Fix ops error msg
* Smarter check for if/endif in llvmir
* simplify delete redundant gates to only have 2 returns
* spacing
* Smarter check at beginning of merge_gates
* patches from comments
* Remove need for merge_gates
* include proper srcs in IF from the get-go
* test expand ifs dumb will result in 4 ifs, not 1 now
* Make tests happy
* Fix uops stats
* rm merge_gates method. Will add back in separate PR
* Spacing
* cleaner error msg
* Fix uops rendering when expanding. test_failure_43
* patch tests
* undo changes in delete_redundant_gates
* process replay attempt
* re-intro deletion of redundant gates
* fix addition of gates when they get nested in stores and loads
* patch tests
* smarter init of IF srcs when adding gate to STORE
* make ruff happy
* Resp to comment
* include all src[2]'s srcs in IF for gated store
* add reference of the storing value to the gate's src
* minor patch after rebasing
* change ptx renderer
---------
Co-authored-by: qazal <qazal.software@gmail.com>
* loooking into graph rewrite speed
* track, replace is slow
* if all same, no permutations [run_process_replay]
* types so compile works
* no implied comprehension
* TRACK_MATCH_STATS=2
* rewrite bool ADD to OR and MUL to AND
fixed running `tinyphysics.onnx`, which contains a getitem from a boolean tensor.
only can repro through BEAM_COMPARE, which i think is a different bug in test_linearizer_failure
* fold those, and fix tests
* only for bool
* move dtypes.bool
* test: uop and lazyop have the same compare
* typings
* self.assert_equiv_uops -> assertEqual
* hash dtype
* test nop too
* TestPatternMatcher never used this compare anyway
* nop eq and ne tests
* remove test_const_vectorize_fold
* remove const folding UPat for VECTORIZE
* refactor cstyle render_const
* remove calls to dtype.scalar() in render_const
* add assert
* add vectorized const to UOp.const
* add UPat GEP-VECTORIZE-CONST -> CONST
* render_vectorize for DEFINE_ACC in cstyle
* add back missing render_cast in render_const
* generate vectorized consts as UOps for DEFINE_ACC
* update asserts for DEFINE_ACC with VECTORIZE src
* add UPats for PHI with VECTORIZE src
* use prev rendered vectorize in DEFINE_ACC render
* update DEFINE_ACC in python runtime
* update vectorized DEFINE_ACC in PTXRenderer
* rebase DEFINE_ACC changes on lowerer
* verbose rewrite of bad UPats
* simplify UOps.CONST implementation in ops_python
* update sum_collapse UPats for DEFINE_ACC-VECTORIZE
* revert linearizer to TOT
* fix DEFINE_ACC implementation in ops_python
* simplify DEFINE_ACC in cstyle
* Fix linter error
* support VECTORIZE in fold gated load/store UPat
* support VECTORIZE in other fold gated load UPats
* rewrite VECTORIZE in UPat for no input DEFINE_ACC
* simplify DEFINE_ACC render in cstyle
* make VECTORIZE rules more concise
* add more vectorize fold tests
* inline VECTORIZE-CONSTs in cstyle render
* revert VECTORIZE/GEP rule refactor
* revert cstyle render_const refactor
* inline VECTORIZE-CONSTs in cstyle render
* implicitly vectorized const rendering -> explicit
* WMMA VECTORIZE CONST process replay hacks
* VECTORIZE CONST NAN process_replay hacks
* more VECTORIZE CONST NAN hacks
* cleanup process_replay hacks
* isnan() -> not isfinite() cstyle VECTORIZE CONST
* tweak isnan and isfinite checks VECTORIZE CONST
* tweak for positive vs negative infinity VECTORIZE CONST
* add assert to PTX CONST render
* process_replay VECTORIZE CONST render parity for PTX STORE
* vmin/vmax for VECTORIZE'd CONST
* update WMMA folding rules
* add tests for WMMA VECTORIZE fold
* hack for cstyle half4 CONST zero process_replay parity
* revert PTX backend changes
* add back minimal DEFINE_ACC PTX change
* remove cstyle process_replay hacks
* remove dead code in PTX CONST render
* cleanup vmin/vmax logic for VECTORIZE'd CONSTs
* update vectorize fold tests to use DEFINE_VAR
* fix long line formatting in test
* remove unwanted merge artifact
* more vmin/vmax cleanup
* remove unnecessary asserts
* yet more vmin/vmax cleanup
* get rid of explicit VECTORIZE CONST logic in _min_max
* reuse CONST instead of creating a new one
* remove unneeded cast
* handle DType correctly in sconst
* improve readability of tests
* save a line
* save another line
* tuplize pats in src
* remove GEP-VECTORIZE pats
* add vec +0 fold
* HACK: fold only vec8 +0
* remove vectorized ALU fold hack
---------
Co-authored-by: qazal <qazal.software@gmail.com>
Co-authored-by: qazal <77887910+Qazalin@users.noreply.github.com>
* remove old index reorder
* new style folder
* works better
* dedup
* one failure
* this is fine now...
* expander_rewrite
* images broken, but all else should work
* cleanups
* make tests work with old
* fix images
* cleanups + bugfix
* minor fixes
* fix gated store folding
* flip gate_creator and expander
* fix gated store
* remove unneeded rules
* lines getting close
* line count good