* work on shape property
* reshape causing issues
* more mops
* all mops
* need to cache it
* _shape is like _device
* mostly works
* shape is good
* const uses _shape
* fix tests
* size doesn't use st
* close
* test is broken
* one less st
* hack for 3 op assign
* oops, i didn't mean to change that
* support emulate in the NullDevice
* reproed failure in emulation
* fix wmma
* it doesn't realize it when i reshape
* cleaner graph
* map out
* REDUCE_AXIS also gives the wrong answer
* maybe
* work
* back here
* try
* more
* refactor tests
* check MultiBuffer
* or copy
* fine with this
* don't need graph_rewrite_map in rangeify
* RANGEIFY=1 test_jit
* don't do any of that
* disk
* simple disk tensor
* more work
* run more tests
* it also doesn't copy everytime
* skip tests that hang everything
* move device tests to test/device
* test speedups
* test device
* linalg to unit
* upd
* so pytest just works
* more divide and skip
* speed
* test devectorize
* add pillow
* viz: non blocking UOp tracing
* u.arg
* no if Ops.KENREL
* drop replace
* switch to weakref.WeakKeyDictionary
* back
* remove ram usage skips, viz works here
* cache on reconstruct
* move high level stuff to unit tests [pr]
* process replay on unit tests
* fix pr, less compute
* set omp num threads
* set 200MB buffer size limit
* delete junk
* fix tests
* faster
* move test_indexing to unit
* faster
`test_data_parallel_resnet_train_step` is already skipped on LLVM/CPU:
```python
@unittest.skipIf(CI and REAL_DEV in ("CUDA", "NV", "LLVM", "CPU"), "slow, and flaky on LLVM/CPU")
@unittest.skipIf(REAL_DEV == "WEBGPU" and not OSX, "WEBGPU Vulkan can only run kernels with up to 10 buffers")
def test_data_parallel_resnet_train_step(self):
```
It looks like `test_data_parallel_resnet` (no `_train_step`) is flaky in a similar way:
https://github.com/tinygrad/tinygrad/actions/runs/15472667248/job/43560773882?pr=10642#step:9:64
* prevent huge waste of multi ram
* fix ram usage
* only define var
* add resolve
* fix tests
* fix cifar training
* remove that logic
* fix test without long
This should fix remote cpu tests flakiness (segfaults were in
`test_data_parallel_resnet_train_step` which is skipped on cpu but wasn't
skipped on remote cpu)
* multi is O(1)
* allreduce
* no new uops needed
* junk
* something
* simple
* that's really what i want
* closer
* inject _device_num
* pretty print
* cleanups
* this
* early dnum
* ops allreduce is good
* ish
* device is the tuple and this is fine
* simpler
* progress
* copy_multi
* work
* more tests
* more tests pass
* work
* no None axis
* tests
* no none multi
* type fixes
* pre commit passes
* lil
* remove this
* mlperf dataloader on mac
* that test was wrong
* unbind
* support DEBUG=2
* realize
* only unbind bound vars
* don't include fixedvars
* graph test
* one test
* fixedvars in hcq
* new ring reduce
* ring reduce
* simpler ring
* mselect
* mselect doesn't work
* Revert "mselect doesn't work"
This reverts commit c78b77bd7d.
* Revert "mselect"
This reverts commit bb2e430ac3.
* simpler
* fixups
* no optional
* fix jit
* move things around
* cleanup multi
* simpler multi
* simpler reshape
* Less messy broken graph on paravirtualized metal workaround
GitHub CI macOS runners use paravirtualized metal which is broken with
graph (some comments say that ICB in particular is broken but in my
testing it was fine sometimes, but other times hitting an assert inside
metal's code related to resouces, so not sure).
> Assertion failed: (resource != nil), function -[IOGPUMetalResource initWithResource:], file IOGPUMetalResource.m, line 458.
This can be reproduced locally with any virtualization software (like utm)
that can create macOS VMs with apple's own virtualization framework.
* unused import