* HWCopyQueue in KFD
* hw compute queue
* test
* move test
* more tests
* fix wait
* fix multimap
* mes crash
* tests pass but slow
* stuff is working
* one more test
* fuzz_linearizer: reduce debug verbosity and make easier for CI usage
* rename FUZZ_BEAM to FUZZ_ALL_ACTIONS (not choosing a subset)
* skip simple ASTs (easier to use with LOGOPS output)
* don't fuzz a previously seen AST
* add options to allow non-zero --expected-failures
* clean up naming and use set
* debug: add optional detailed BEAM_LOG logging
show uop count, compile and run times for each candidate in search
also add --timing to verify_kernel.py to make it easier to explore
hand-crafted applied opts
* fix linter
* search: add BEAM_VERIFY option to validate search results
refactor fuzz_linearizer comparison to allow it to be used in for
BEAM_VERIFY in device.py
* search: fix to verify the beam_search result and not the fastest
* search: fix typing and clean up
* device: remove imports from test and add LOGKERN options
LOGKERN output can be used with test/external/verify_kernel.py
to validate correctness
* fix example in verify_kernel.py
* cleanup fixes
* fix to use f-strings
the goal is that big enough beam should be faster than hcopt/tc
also this failed on tc opt
NUM=2 FILTER_REDUCE=1 TEST_N=20 BEAM=4 DEBUG=2 python test/external/speed_beam_v_hcopt.py
* add FUZZ_NTH to fuzz_linearizer
also update tests in test_linearizer_failures to not just run on METAL
* update failures for HIP/HSA
* test_failure_21 LLVM PADTO
* working PolynomialDecayWithWarmup + tests.......
add lars_util.py, oops
* keep lars_util.py as intact as possible, simplify our interface
* whitespace
* clean up
* clean up
* asserts
* test polylr for full resnet training run
* add comment
* rename
* fix do_optim
* don't cast lr
* info
* calculate from train_files
* skip it
included non-reduce kernel and kernel with variables. green msg when everything passed
it's possible that creating rawbufs failed due to memory error, included that in failure cases
* lars optimizer + tests
* fix skip list!
* use id to compare in skip list
* go back to using set
* Tensor(bool) * Tensor(bool) is and
* don't lint external/mlperf_resnet
* whitespace
* add external_test_optim to opencl tests
* give mlperf task a name
* mlperf under onnx
* remove track_gnorm
* contiguous instead of realize
* assert momentum and weight decay positive
---------
Co-authored-by: chenyu <chenyu@fastmail.com>
* this mem fault still happening
* smaller
* that print doesn't work
* overflows test
* hip doesn't uses_ptr_arithmetic
* only with locals
* test overflow new name
* it's not ptr arith
* simpler
* simple repro
* old compiler
* simpler
* put that back
* test/external/fuzz_linearizer: add a FUZZ_MAX_SIZE option
this allows us to limit the size of the kernel and reduce running
times by avoiding ones that take a long time
* fix spacing and re-order to put parameters together