* don't use uops.add while constructing
* rebase
* bugfixes
* have to use BFS
* prove it's late
* simpler uop symbolic test (why we did this)
* use dict, not set
default [low, high] changed from [-1.5, 1.5] to [-2, 2] (except tan).
dropped several explicit atol if it's unnecessarily larger than default 1e-6.
tested on mac, tinybox red / green
* revert the .detach() in layernorm
it's only correct in LayerNorm where input is the data, and not correct in GroupNorm and InstanceNorm that reused layernorm.
Added backward tests for weights, bias and input for these norms.
* bigger atol for llvm
* relax backward more
* tqdm replacement almost
* formatting
* formatting
* imports
* line len
* fix
* removed set description :(
* removed set description :(
* fix
* fix
* green check?
* rewrote as class, fixed several bugs
* types spacing
* removed imports
* fix
* iterable
* typing
* mypy disagreement
* imports
* more e2e tests vs tqdm
* removed seed setting
* robustness against time.sleep() flakiness
* flaky fix
* automatic bar closing when count==total
* cleanup
* clang error with tqdm
* tqdm back
* use os lib, print to stderr (fixes the clang bug, where the bar was leaking into the generated c program
* back to shutil
* unit_scale + unit_scale test
* custom unit to tests
* pretty
* clean
* removed flaky test
* less test iters
* empty line
* remove disable
* start work
* more tests passing
* more tests passing
* more
* 34 failures
* expect the failures
* remove broken rule
* render is fine in just the test
* simplify and put in test
* amd support kernels with dispatch_ptr
* fixes
* line savings
* one line
* try
* Revert "try"
This reverts commit 5f340dfdd4.
* not used will be back when hsa is gone
* gone will be back
* add this as well
* support symbolic reshape with non-contiguous
pre-requisite for symbolic arange (make symbolic ones that can be folded).
* test cases
* typo
* shorter
* atomic load/store test
* tests for nested & unrolled
* check barriers
* linters
* cleaning up diff
* fix assert in _temp_create_multireduce_ast changes
* cleaning up the check for redundant barriers
* minor cleanups for the assert
* always seed randn, helps with debuggability
---------
Co-authored-by: qazal <qazal.software@gmail.com>
* add input to unit tests [run_process_replay]
* add setup [run_process_replay]
* run tests [run_process_replay]
* add cuda and amd [run_process_replay]
* run everything but BEAM=2 [run_process_replay]
* skip export_model [run_process_replay]
* fix amd CI
* add concurrency back
* padto test
* expanded multireduce padto tests
* cuda doesnt run on ci
* moving padto_where_multireduce test to SUM so that we can check the reduce axis
* cleaning up tests some more
* add wanna_outputs
* refactor test_padto_sum_multireduce
* fix max and refactor where
* fix axis
---------
Co-authored-by: qazal <qazal.software@gmail.com>
currently not supporting const fold symbolic shape. I think it's possible with a refactor to Tensor.from_node.
also added some failed required tests for symbolic arange.
* add pattern matcher regression tests
* Remove test for dtype str after rebasing
* Make test uops match type spec
* leave const const, add const alu vin test
* correct uops
* actually correct uops