make it read nicer and cleanup some movement methods and math simplification.
790m, 1.4b, 2.8b model does not really run.
sampling is not implemented.
jit is incorrect.
some deadcode / wrong code path and copied from torch stuff stuff.
* always use f32 for source of randn
fixed bfloat16 randn to not have inf.
don't really care about float64. threefry is float32 based too
* HSA is broken
* test case Tensor.randn should be finite
there's a hack to fix float16, need a generic solution that works with bf16 and threefry
* skip not supported
* bfloat16 local is wrong
* skip RHIP
* Shape changing bitcast
* only support it on disk
* basic test
* more tests
* RuntimeError instead of assert
* create unique temp files
* move tests that use disk to test_disk_tensor
* linter
* remove assert on error messages
* that's RuntimeError now
---------
Co-authored-by: George Hotz <72895+geohot@users.noreply.github.com>
* first commit
* state back to orig
* mamba comparisions
* rm file
* rename file
* use Tensor.einsum and mke default model 370M
* Cleaned code and made a comparision test
* Simplyfy pull request. Only has 1 mamba implementation now.
* Update prompt
* rm whitespaces
* last space
* remove Einops dependency
* rm unused code
* add tests
* rm print statement
* rm imports
* skip CLANG
* Update skipIf description
* skip model test in CI and add CLANG fix
* rm Device import
* don't be stupid
* Fix conv assign
When the prompt is too short, the logic for conv_state assign messes up. This can be fixed when padding the tokenized array to min length of 4. I padded using the empty string token, but idk if proper practice is to use the PAD token
* fix p1
* temp
* fix jit import
---------
Co-authored-by: schlimeszn <schlimeszn@gmail.com>
Co-authored-by: reddyn <nikidsniper@gmail.com>
Co-authored-by: George Hotz <72895+geohot@users.noreply.github.com>
* fp16 resnet
* cast running mean and var back to default float
* extra cast
* check symbolic no overflow
* add linearizer failure
* loss scaler after grad contig
* oops
* i think this works
* don't loss scale fp32
* remove overflow test case
* remove symbolic bounds check
* loss scaler should be float
* temporarily disable padto cuz bug
shruggie
* make running stats in batchnorm float32?
* calculate lars stuff in fp32?
* oops
* remove most changes
* move loss scaler out of optimizer
* no more FP16 var
* oops
---------
Co-authored-by: chenyu <chenyu@fastmail.com>
* wmma: refactor to remove wmma_func and create TC funcs as needed
* test_linearizer: disable bf16 CUDA during emulation testing
* cstyle: clean up creation of CUDA vec dtypes
* extra/gemm: add option to accumulate to bfloat16
* cleanups
* benchmark: add CUDA bfloat16 matmul
* more cleanups