* first commit
* state back to orig
* mamba comparisions
* rm file
* rename file
* use Tensor.einsum and mke default model 370M
* Cleaned code and made a comparision test
* Simplyfy pull request. Only has 1 mamba implementation now.
* Update prompt
* rm whitespaces
* last space
* remove Einops dependency
* rm unused code
* add tests
* rm print statement
* rm imports
* skip CLANG
* Update skipIf description
* skip model test in CI and add CLANG fix
* rm Device import
* don't be stupid
* Fix conv assign
When the prompt is too short, the logic for conv_state assign messes up. This can be fixed when padding the tokenized array to min length of 4. I padded using the empty string token, but idk if proper practice is to use the PAD token
* fix p1
* temp
* fix jit import
---------
Co-authored-by: schlimeszn <schlimeszn@gmail.com>
Co-authored-by: reddyn <nikidsniper@gmail.com>
Co-authored-by: George Hotz <72895+geohot@users.noreply.github.com>
* fp16 resnet
* cast running mean and var back to default float
* extra cast
* check symbolic no overflow
* add linearizer failure
* loss scaler after grad contig
* oops
* i think this works
* don't loss scale fp32
* remove overflow test case
* remove symbolic bounds check
* loss scaler should be float
* temporarily disable padto cuz bug
shruggie
* make running stats in batchnorm float32?
* calculate lars stuff in fp32?
* oops
* remove most changes
* move loss scaler out of optimizer
* no more FP16 var
* oops
---------
Co-authored-by: chenyu <chenyu@fastmail.com>
* wmma: refactor to remove wmma_func and create TC funcs as needed
* test_linearizer: disable bf16 CUDA during emulation testing
* cstyle: clean up creation of CUDA vec dtypes
* extra/gemm: add option to accumulate to bfloat16
* cleanups
* benchmark: add CUDA bfloat16 matmul
* more cleanups
* fuzz_linearizer: reduce debug verbosity and make easier for CI usage
* rename FUZZ_BEAM to FUZZ_ALL_ACTIONS (not choosing a subset)
* skip simple ASTs (easier to use with LOGOPS output)
* don't fuzz a previously seen AST
* add options to allow non-zero --expected-failures
* clean up naming and use set
* Track pointer provenance in load/store through ALU
Previously load/store could be incorrectly rendered into
ld.global/st.global when the input was an ALU op that performed an
address computation with DEFINE_LOCAL on one of the arguments.
* Simplify the load provenance workaround
The issue is that we can render the same code twice, and on the second
run the opstream is already modified so that vin[0] isn't a DEFINE_*,
which overwrites initially correct .shared wth .global.
* Add a couple tests for basic local use
* Skip local tests on LLVM since it doesn't implement DEFINE_LOCAL