* failed test case for getitem with leading Nones
torch matched numpy so tinygrad is incorrect.
another repro
```
t = np.arange(12).reshape((3, 4))
print(t[None, None, np.array([1, 2])])
t = torch.arange(12).reshape((3, 4))
print(t[None, None, torch.tensor([1, 2])].numpy())
t = Tensor.arange(12).reshape(3, 4)
print(t[None, None, Tensor([1, 2])].numpy())
```
* # noqa
* basic tests
* cleanup
* pylint
* ruff
* use define acc as a proxy for rendered reductions
* use define acc as a proxy for rendered reductions
* recursive reduceop rendering via ast_parse
* linters + cleanup
* fixing late buf loading
* plus linters
* removing extra line
* linters
* does this break ci?
* added tests and if add end change
* typo in add_ends
* linters
* removing comments
* allow endifs to be inserted before the end of the graph
* find add ENDIF before next BARRIER
* removing tests with manual ENDIF + linters
* specifically the next barrier aftr the store of the local result
* Revert "specifically the next barrier aftr the store of the local result"
This reverts commit b288a5c3ce.
* keeping up to date
* linters + merge changes
* cleaning up old bad decisions
* linters and opts
* mrged linearizer tests
* fixing merge issues
* removing the big ugly uop test (functionality tested end-to-end by test_linearizer additions
* small diff fixes
* updating linearizer to work without uops.add( ... cachable)
* linters
* comment in multireduce tests
* skipping tests without locals
* full tests
* linters
* load_cache[key] fix for multiple accs
* linters
* assert only one reduceop
* fix loop_scope test to actually cause an issue
* self.load_cache[key] key for DEFINE_ACC changed to use a string to make sure each acc is unique
* updated tests
* fixing merge
* removing debug prints
* complete merge fix
* linters
* diff cleanup
* adding tests in
* give each reduce it's own local buffer
* gpu=1 changes
* store and load locals with upcasting
* modifying test?
* make multireduce_netsted_local_upcast test match single reduce shapes
* removing todo
* cleaning up the diff
* unroll test
* unroll and upcast tests
* fix gpu
* seq and self.load_cache[key] cleaning
* linters
* padto works
* merge fixes
* fixes
* add skips for amd
* linters + seq
* cleaning & more tests
* softmax tests
* linters
* [run_process_replay]
* add new tests back
This reverts commit 19dec22e01.
* more hardcoded -1s
* fix ptx
* Fix name for loop in ptx
* cleaning up the diff
* cleaning up the uops diff
* nv ci is too slow
---------
Co-authored-by: qazal <qazal.software@gmail.com>
Co-authored-by: Szymon Ożóg <58388001+SzymonOzog@users.noreply.github.com>
Co-authored-by: qazal <77887910+Qazalin@users.noreply.github.com>
* don't use uops.add while constructing
* rebase
* bugfixes
* have to use BFS
* prove it's late
* simpler uop symbolic test (why we did this)
* use dict, not set
default [low, high] changed from [-1.5, 1.5] to [-2, 2] (except tan).
dropped several explicit atol if it's unnecessarily larger than default 1e-6.
tested on mac, tinybox red / green
* revert the .detach() in layernorm
it's only correct in LayerNorm where input is the data, and not correct in GroupNorm and InstanceNorm that reused layernorm.
Added backward tests for weights, bias and input for these norms.
* bigger atol for llvm
* relax backward more
* tqdm replacement almost
* formatting
* formatting
* imports
* line len
* fix
* removed set description :(
* removed set description :(
* fix
* fix
* green check?
* rewrote as class, fixed several bugs
* types spacing
* removed imports
* fix
* iterable
* typing
* mypy disagreement
* imports
* more e2e tests vs tqdm
* removed seed setting
* robustness against time.sleep() flakiness
* flaky fix
* automatic bar closing when count==total
* cleanup
* clang error with tqdm
* tqdm back
* use os lib, print to stderr (fixes the clang bug, where the bar was leaking into the generated c program
* back to shutil
* unit_scale + unit_scale test
* custom unit to tests
* pretty
* clean
* removed flaky test
* less test iters
* empty line
* remove disable
* start work
* more tests passing
* more tests passing
* more
* 34 failures
* expect the failures
* remove broken rule
* render is fine in just the test
* simplify and put in test
* amd support kernels with dispatch_ptr
* fixes
* line savings
* one line
* try
* Revert "try"
This reverts commit 5f340dfdd4.
* not used will be back when hsa is gone
* gone will be back
* add this as well