fix broken PTX tests in test_linearizer and test_uops. there are tests that were skipped and broken because it runs only with CUDA=1 and we run PTX with NV=1 now
* Add UOps.VECTORIZE to core
* Update vectorized cast tests
* Addresses code review comments
- Removes VECTORIZE from LLVMRenderer
- Add line breaks to unduly long lines
- Add noop CAST rule back
- Update asserts and add render_vectorize in
CSytleLanguage renderer
* Add missing const folding rule for VECTORIZE
Also adds corresponding test
* Fixes test_const_vectorize_fold and add assert
- Use sane types with VECTORIZE in test_const_vectorize_fold
- Add assert that sanity checks the types for VECTORIZE
* Rename test_cast_vectorized_fold
Renames test_cast_vectorized_fold to test_noop_vectorize_fold
because the test targets a very specific rule and there are
other tests for VECTORIZE.
* Revert unrelated changes
---------
Co-authored-by: qazal <77887910+Qazalin@users.noreply.github.com>
Co-authored-by: qazal <qazal.software@gmail.com>
* [Patch] added an option not to ignore view replacing when doing bitcast
* added the testcase
* [Add] reproduced bitcast cannot be fused into a single kernel in the unittest
---------
Co-authored-by: qazal <77887910+Qazalin@users.noreply.github.com>
* start writing openelm
* progress...hit bug
* repeat_interleave support
* gqa
* add rotary embedding
* spp
* i think it runs correctly
* broken
* output is good now
* cleanups
* no io_uring on android
* fix tqdm unit_scale and support hours in time
previously it only supports MM:SS.
more chars to unitscales, strip trailing "." and " " in formatting, and more tests
* simpler
* valid is always bool
* prevent NumNode to begin with
* part 2
* test: disable pattern matchers, asserts should pass
* test: store without cast
* test: if (0)
* cleanup time
* only pattern match bool literal
* better for upstream debug
* find all places
* test gates
* test
* gate based on depths
* add ctx
* that cache was so wrong
* delete useless things
* dont double write if
* self.if_cond
* move UOps.IF to gated store
* test_padto_where_multioutput
* test_padto_group
* minor cleanup
* hmm this actually works?
* need a good barrier
* merge 2
* delete ctx
* p1
* maybe p2
* p3
* minor fixup
* fixup 2
* smart thing from the Lowerer branch
* refactoring
* refactoring 2
* maybe before graph_rewrite
* slightly more acceptable Linearizer diff
* more correct
* [run_process_replay] [no_assert]
it's possible to support TinyJit inside TinyJit, but there are edge cases like two TinyJit functions shared another TinyJit function. so just give a more precise error for now
* single pass rewrite
* claude cleanups
* claude cleanups
* skip those tests
* restrict that to ints
* comment
* asserts i don't expect to fail do fail
* simplest...rewrite...ever
* simplest...rewrite...ever
* add that rule back
* tests pass?
* only collapse reduce loops
* second SHL/SHR arg must be 4 bytes
* fix verify
* no SHL/SHR in ptx
* put that back
* skip them in PTX...bad tests