* bring FUSE_AS_ONE_KERNEL back
* operands need reshape?
* fused but arange didnt fold
* something deeply wrong
* yay, fused
* derive broadcasts
* s/input/reduce_input
* _fixup_ones proved a point
* this is what it takes
* down to 3 required reshapes:
1. output_shape
2. the second reduce merge dims
3. remove dims for above reshape
* start real reshapes
* resolve shape in the edges pre lazyop
* outputs are the same shape
* rewrite1: just the reduce
* more correct
* fuse_as_one_kernel
* closer
* this passes
* dont rerun info
* dont need these
* not needed
* multireduce no-opts works
* passed test_var_multireduce
* cleanup
* double reduce
* extra check for range_group
* more checking for range_groups
* cleaning up debug prints
* cleanup diff
* linters
* revert kernel changes
* these are uops toposort
---------
Co-authored-by: timmy <timmy0x@proton.me>
* test/test_linearizer_failures: add a new beautiful_mnist one
this one is from a DEPTH=2 fuzz_linearizer search
* add GPU to test_failure_40
---------
Co-authored-by: chenyu <chenyu@fastmail.com>
* test/external/fuzz_linearizer: fix for new AST changes
also add beautiful_mnist failures
* add CLANG and LLVM to test_failure_35 failed_platforms
* fix test_linearizer_failure names
* minor cleanups
* docs and logs
* shorter
* comma
* s/print/logging.info [run_process_replay]
* use logging.warn
* process name is noise
* revert lowerer change [run_process_replay]
* render lidx starting with 0
changed from
```
int gidx0 = gid.x; /* 4096 */
int lidx4 = lid.x; /* 8 */
int gidx1 = gid.y; /* 7 */
int lidx5 = lid.y; /* 8 */
int gidx2 = gid.z; /* 7 */
int lidx6 = lid.z; /* 2 */
```
to
```
int gidx0 = gid.x; /* 4096 */
int lidx0 = lid.x; /* 8 */
int gidx1 = gid.y; /* 7 */
int lidx1 = lid.y; /* 8 */
int gidx2 = gid.z; /* 7 */
int lidx2 = lid.z; /* 2 */
```
the existing one started from pre-limited global dims which skip number if there are more than 3 global dims
* don't need start_dim
---------
Co-authored-by: qazal <77887910+Qazalin@users.noreply.github.com>
* render lidx starting with 0
changed from
```
int gidx0 = gid.x; /* 4096 */
int lidx4 = lid.x; /* 8 */
int gidx1 = gid.y; /* 7 */
int lidx5 = lid.y; /* 8 */
int gidx2 = gid.z; /* 7 */
int lidx6 = lid.z; /* 2 */
```
to
```
int gidx0 = gid.x; /* 4096 */
int lidx0 = lid.x; /* 8 */
int gidx1 = gid.y; /* 7 */
int lidx1 = lid.y; /* 8 */
int gidx2 = gid.z; /* 7 */
int lidx2 = lid.z; /* 2 */
```
the existing one started from pre-limited global dims which skip number if there are more than 3 global dims
* don't need start_dim
* add changed
* env var
* more early exit
* simpler?
* Revert "Merge branch 'lidx0' into process_replay_limit"
This reverts commit cbadcfa5e9, reversing
changes made to fc9bf37ee7.
* minor cleanup
---------
Co-authored-by: chenyu <chenyu@fastmail.com>