* refactor to render_block
* move rendering the reduce to its own thing
* add todo and cleanups [run_process_replay]
* inplace update of idxs [run_process_replay]
* support symbolic reshape with non-contiguous
pre-requisite for symbolic arange (make symbolic ones that can be folded).
* test cases
* typo
* shorter
* atomic load/store test
* tests for nested & unrolled
* check barriers
* linters
* cleaning up diff
* fix assert in _temp_create_multireduce_ast changes
* cleaning up the check for redundant barriers
* minor cleanups for the assert
* always seed randn, helps with debuggability
---------
Co-authored-by: qazal <qazal.software@gmail.com>
* update acc key
* refactor return type
* remove return type
* run all reduces
* set acc key [run_process_replay]
* local_idxs are copied in render_reduceop [run_process_replay]
* add input to unit tests [run_process_replay]
* add setup [run_process_replay]
* run tests [run_process_replay]
* add cuda and amd [run_process_replay]
* run everything but BEAM=2 [run_process_replay]
* skip export_model [run_process_replay]
* fix amd CI
* add concurrency back
* testing dataloader
* matching dataloader implementation for unet3d
* remove comments
* clean up dataloader
* add cookie and cleanup
* use shm_path when creating SharedMemory
* add support for testing resnet and unet3d dataloaders
* update dataset test to return preprocesed data directory in prep for dataloader testing
* pass preprocessed dataset directory properly
* update loader function for dataloader
* add shuffling on indices
* update shm name
* more cleanup for unet3d dataloader
* remove changes to tests
---------
Co-authored-by: chenyu <chenyu@fastmail.com>
* padto test
* expanded multireduce padto tests
* cuda doesnt run on ci
* moving padto_where_multireduce test to SUM so that we can check the reduce axis
* cleaning up tests some more
* add wanna_outputs
* refactor test_padto_sum_multireduce
* fix max and refactor where
* fix axis
---------
Co-authored-by: qazal <qazal.software@gmail.com>