* HWCopyQueue in KFD
* hw compute queue
* test
* move test
* more tests
* fix wait
* fix multimap
* mes crash
* tests pass but slow
* stuff is working
* one more test
* uops const fold rules to prevent tautological compare warnings
`bool < false` is false, `true < bool` is false, `a == a` is true, `a != a` is false
* not true for nan
* and nan does not work with llvm
* full truth table test
* revert a==a
* comments and indents
* Embedding is in one kernel
* embedding is one kernel
* rm extra line
* newline
* bert test counts state vars?
* add a test?
* move items around
---------
Co-authored-by: Patrick Tsai <patosai@users.noreply.github.com>
* don't call contiguous for unpadded const into multi tensor
fixed multi const folding for sharded const.
still wip, need to be careful that this does not break multi device cache somewhere
* ehh need a memory test for that
* simple sharded memory test
* fix _to_const_val and const folding around it
is_unrealized_contiguous_const is too strict and almost never hit if const is expanded.
suffice to check if there's no pad
* that test is folded
* test_const_folding
* always use f32 for source of randn
fixed bfloat16 randn to not have inf.
don't really care about float64. threefry is float32 based too
* HSA is broken
* test case Tensor.randn should be finite
there's a hack to fix float16, need a generic solution that works with bf16 and threefry
* skip not supported
* bfloat16 local is wrong
* skip RHIP
* Shape changing bitcast
* only support it on disk
* basic test
* more tests
* RuntimeError instead of assert
* create unique temp files
* move tests that use disk to test_disk_tensor
* linter
* remove assert on error messages
* that's RuntimeError now
---------
Co-authored-by: George Hotz <72895+geohot@users.noreply.github.com>
* first commit
* state back to orig
* mamba comparisions
* rm file
* rename file
* use Tensor.einsum and mke default model 370M
* Cleaned code and made a comparision test
* Simplyfy pull request. Only has 1 mamba implementation now.
* Update prompt
* rm whitespaces
* last space
* remove Einops dependency
* rm unused code
* add tests
* rm print statement
* rm imports
* skip CLANG
* Update skipIf description
* skip model test in CI and add CLANG fix
* rm Device import
* don't be stupid
* Fix conv assign
When the prompt is too short, the logic for conv_state assign messes up. This can be fixed when padding the tokenized array to min length of 4. I padded using the empty string token, but idk if proper practice is to use the PAD token
* fix p1
* temp
* fix jit import
---------
Co-authored-by: schlimeszn <schlimeszn@gmail.com>
Co-authored-by: reddyn <nikidsniper@gmail.com>
Co-authored-by: George Hotz <72895+geohot@users.noreply.github.com>
* wmma: refactor to remove wmma_func and create TC funcs as needed
* test_linearizer: disable bf16 CUDA during emulation testing
* cstyle: clean up creation of CUDA vec dtypes
* extra/gemm: add option to accumulate to bfloat16
* cleanups
* benchmark: add CUDA bfloat16 matmul
* more cleanups
* fuzz_linearizer: reduce debug verbosity and make easier for CI usage
* rename FUZZ_BEAM to FUZZ_ALL_ACTIONS (not choosing a subset)
* skip simple ASTs (easier to use with LOGOPS output)
* don't fuzz a previously seen AST
* add options to allow non-zero --expected-failures
* clean up naming and use set