* rtoposort is fast, can replace rangeify with this
* fast rangeify
* work
* fast rangeify works for mnist
* should work
* progress
* pad fix
* FAST
* tests passing
* don't delete those shape ops
* put in rangeify map
* ending ranges fix
* tests
* mstack/mselect no hacks
* move to indexing.py
* touch up tests + add comments
* disable failing test
* actually make the file readable
* failing
* error
* add ordering
* fix some tests
* fix more tests
* shorten comment
* update test
* add rule and test
* add rule and test
* remove check
* use fold_divmod_congruence instead of simplify
* adjust tests
* shorten line
* new algo
* add test
* add function to un-nest the div
* add UOp.factor
* test UOp.factor
* uop_given_valid tries to factor simplex expression
* shorten line
* symbolic_flat is back
* change that back
* fix those new tests
* new rule for ordering
* factor multiple factors
* no symbolic_flat
* symbolic_flat to there
* move that back
* fix imports
* merge correctly
* linter happy
* add rule
* add a test
* cleanup
* revert that for now
* UOp.factor returns self instead of None
* try all_candidates
* remove or_else
* post index symbolic
* add test
* maket this closer to the original
* increase mac hlb_cifar min step time
* add some ordering tests
* cleanup
* increase pytest timeout time
* check dtype
* enable cleanup_dead_axes
* don't mess with user contig
* correct tag behavior
* double reshape isn't correct
* block on assign too
* skip messing with symbolic
* Fix tests
* disable RANGEIFY=2
* test w rangeify
* use deque instead of list
* increase ctx.progress and max stack_len
* add openpilot
* prevent placing uops on stack many times
* revert increasing ctx.progress and stack length limit
* dont block adding to the stack there
---------
Co-authored-by: George Hotz <72895+geohot@users.noreply.github.com>
* enable RANGEIFY=1 test_assign
* work
* rangeify=0 asserts this ast
* remove that
* beta test, it's correct though
* skip multi
* matches torch/np output
* memcopy without memcopy
* can remove this
* rangeify isn't silently wrong anymore
* diff cleanup
* use UOp toposort instead of global tags
* actual assert TestRangeifyAssign
* step
* work
* this isn't optimizing away now
* some todos
* test fusion schedule
* typo
* dedup idxs
* cleaner
* pre
* work
* diff
* ci
* extract mops
* work
* assert early
* port this?
* can realize shard
* allreduce passing
* notes
* better handling of shard
* err
* outerworld allreduce twice
* work
* don't tag movement ops
* don't tag movement ops
* delete old logic
* 19 failing + ram
* cleanup
* reset stuff
* simplest failing test
* diff
* test_ones
* allreduce work
* allreduce more work
* down to 22 failing tests
* port _device_num
* replace creates a new UOp here
* pour symbolic everywhere
* 7 failing
* focus on allreduce
* work
* cleanup
* more ci
* fix test_schedule_ring
* post index const shape
* much better
* diff cleanup