* rebase from #10468
* fixup metadata 2
* that too
* comments for metadata
* remove_gbarrier is not needed anymore
* skip that
* break metadata more
* delete more metadata fixups
* err, fix kernelize diamond
* unskip metadata
* new_map
* roots
* replace metadata of roots
* check empty
* replace globals is better
* Add mmapeak implementation for 7900 XTX
* Change identation
* Use a template instead of multiple assebly files
* Fix output formatting
* Reduce register file bank conflicts
* More accurate measurement for quick instructions
* Add support for gfx1201
* RDNA4 wmma requires less VGRPs
* RDNA4 does not have s_cmpk instructions
* Add v_wmma_i32_16x16x32_iu4 for gfx1201
* Add sparse wmma instructions
* split to tinybox red MLPerf Benchmark
---------
Co-authored-by: Panagiotis Kourouklidis <panagiotis.kourouklidis@gmail.com>
* Add mmapeak implementation for 7900 XTX
* Change identation
* Use a template instead of multiple assebly files
* Fix output formatting
* Reduce register file bank conflicts
* More accurate measurement for quick instructions
* Add support for gfx1201
* RDNA4 wmma requires less VGRPs
* RDNA4 does not have s_cmpk instructions
* Add v_wmma_i32_16x16x32_iu4 for gfx1201
* Add sparse wmma instructions
---------
Co-authored-by: George Hotz <72895+geohot@users.noreply.github.com>
* add mselect op
* more work
* that shouldn't be contiguous
* remove junk
* it segfaults...
* more correct
* test fail
* inserting a contiguous fixes it
* fix children in mselect
* complain
* error
* push RESHAPE through MSELECT
* no copy arg, use mselect
hlb cifar is fast so added it, can add bert too if you think it's ok
6 real gpus to test multigraph and transfers + accuracy validation
should probably be added to tinystats too, i don't know how though
Co-authored-by: chenyu <chenyu@fastmail.com>
* insert contiguous into graph
* exclude contiguous from kernels
* and copy
* not needed on copy
* gbarrier
* gbarrier closer
* gb
* gb
* fix double realize logic bug
* remove gbarrier
* del that
* uop tags
* tag
* fix setitem, flaky
* no ctx there
* flip rewrite
* revert order until metadata is fixed