* remove ExecItem and merge it with ScheduleItem
* less diff
* fix issues
* min diff
* don't change bufs in _lower
* min diff
* update
* revert
* fixes
* diff
* work on shape property
* reshape causing issues
* more mops
* all mops
* need to cache it
* _shape is like _device
* mostly works
* shape is good
* const uses _shape
* fix tests
* size doesn't use st
* close
* test is broken
* one less st
* hack for 3 op assign
* oops, i didn't mean to change that
* support emulate in the NullDevice
* reproed failure in emulation
* fix wmma
* it doesn't realize it when i reshape
* cleaner graph
* map out
* REDUCE_AXIS also gives the wrong answer
* maybe
* work
* back here
* try
* more
* refactor tests
* check MultiBuffer
* or copy
* fine with this
* don't need graph_rewrite_map in rangeify
* RANGEIFY=1 test_jit
* don't do any of that
* disk
* simple disk tensor
* more work
* run more tests
* it also doesn't copy everytime
* skip tests that hang everything
* move device tests to test/device
* test speedups
* test device
* linalg to unit
* upd
* so pytest just works
* more divide and skip
* speed
* test devectorize
* add pillow
* viz: non blocking UOp tracing
* u.arg
* no if Ops.KENREL
* drop replace
* switch to weakref.WeakKeyDictionary
* back
* remove ram usage skips, viz works here
* cache on reconstruct
* move high level stuff to unit tests [pr]
* process replay on unit tests
* fix pr, less compute
* set omp num threads
* set 200MB buffer size limit
* delete junk
* fix tests
* faster
* move test_indexing to unit
* faster
`test_data_parallel_resnet_train_step` is already skipped on LLVM/CPU:
```python
@unittest.skipIf(CI and REAL_DEV in ("CUDA", "NV", "LLVM", "CPU"), "slow, and flaky on LLVM/CPU")
@unittest.skipIf(REAL_DEV == "WEBGPU" and not OSX, "WEBGPU Vulkan can only run kernels with up to 10 buffers")
def test_data_parallel_resnet_train_step(self):
```
It looks like `test_data_parallel_resnet` (no `_train_step`) is flaky in a similar way:
https://github.com/tinygrad/tinygrad/actions/runs/15472667248/job/43560773882?pr=10642#step:9:64
* prevent huge waste of multi ram
* fix ram usage
* only define var
* add resolve
* fix tests
* fix cifar training
* remove that logic
* fix test without long
This should fix remote cpu tests flakiness (segfaults were in
`test_data_parallel_resnet_train_step` which is skipped on cpu but wasn't
skipped on remote cpu)
* multi is O(1)
* allreduce
* no new uops needed
* junk
* something
* simple
* that's really what i want
* closer
* inject _device_num
* pretty print
* cleanups
* this
* early dnum
* ops allreduce is good
* ish
* device is the tuple and this is fine
* simpler
* progress
* copy_multi
* work
* more tests
* more tests pass
* work
* no None axis
* tests
* no none multi
* type fixes
* pre commit passes
* lil
* remove this
* mlperf dataloader on mac
* that test was wrong
* unbind
* support DEBUG=2
* realize
* only unbind bound vars
* don't include fixedvars
* graph test
* one test
* fixedvars in hcq
* new ring reduce
* ring reduce
* simpler ring
* mselect
* mselect doesn't work
* Revert "mselect doesn't work"
This reverts commit c78b77bd7d.
* Revert "mselect"
This reverts commit bb2e430ac3.
* simpler
* fixups
* no optional
* fix jit
* move things around
* cleanup multi
* simpler multi
* simpler reshape