* new pm_lower_index_dtype
* load_store_indexing after index lowering
* shorten line
* seperate rule for long removal
* fix test
* fix index_to_concrete_int
* minor fixes
* add sink there
* update types in linearizer test
* use deque instead of list
* increase ctx.progress and max stack_len
* add openpilot
* prevent placing uops on stack many times
* revert increasing ctx.progress and stack length limit
* dont block adding to the stack there
---------
Co-authored-by: George Hotz <72895+geohot@users.noreply.github.com>
* it doesn't realize it when i reshape
* cleaner graph
* map out
* REDUCE_AXIS also gives the wrong answer
* maybe
* work
* back here
* try
* more
* refactor tests
* check MultiBuffer
* or copy
* fine with this
* don't need graph_rewrite_map in rangeify
* enable RANGEIFY=1 test_assign
* work
* rangeify=0 asserts this ast
* remove that
* beta test, it's correct though
* skip multi
* matches torch/np output
* memcopy without memcopy
* can remove this
* rangeify isn't silently wrong anymore
* diff cleanup
* use UOp toposort instead of global tags
* actual assert TestRangeifyAssign
* step
* work
* this isn't optimizing away now
* some todos
* test fusion schedule
* typo
* dedup idxs
* cleaner
* pre
* work
* diff
* ci
* extract mops
* work
* assert early
* port this?
* can realize shard
* allreduce passing
* notes
* better handling of shard
* err
* outerworld allreduce twice
* work
* don't tag movement ops
* don't tag movement ops
* delete old logic
* 19 failing + ram
* cleanup
* reset stuff
* simplest failing test
* diff
* test_ones
* allreduce work
* allreduce more work
* down to 22 failing tests
* port _device_num
* replace creates a new UOp here
* pour symbolic everywhere
* 7 failing
* focus on allreduce
* work
* cleanup
* more ci
* fix test_schedule_ring
* post index const shape
* much better
* diff cleanup
* remove check
* use fold_divmod_congruence instead of simplify
* adjust tests
* shorten line
* new algo
* add test
* cleanup
* update tests
* ALLOWED_GATED_READ_IMAGE from 16 -> 12
* only remove the call to simplify
* add option to simplify with factor_remainder
* Allowed readimage gates back to 16
* lowering invalid gate is part of lower_index_dtype
* update test
* remove import
* put that back
* reduce_collapse uses invalid
* fix that pattern to use invalid_pat
* valid creates the right dtype count
* seperate rule for lowering invalid gate
* dont unvectorize Invalid gate
* image_fixup uses Invalid
* update tests
* cleanup
* update split_load_store
* add .scalar() there