* fix mean underflow for half tensor
divide only the reduce factor. added unit test and non-nan assertion in resnet training. also added a failed test cast for symbolic shape var
* skip for python backend
* kernel: change PADTO check to allow up to 4x padding
also optionally remove PADTO from the search action space with
BEAM_PADTO=0.
* fix test_linearizer test_tensor_cores_padded tests
* update resnet runs to use SPLIT_REDUCEOP=1
* fix up search TC axis and amt checking
* fix up the dimensions of the TC tests
* add support for train/val datasets for kits19
* split dataset into train and val sets
* add tests for kits19 dataloader
* add MLPerf dataset tests to CI
* update unet3d model_eval script
* fix linting
* add nibabel
* fix how mock dataset gets created
* update ref implementation with permalink and no edits
* clean up test and update rand_flip implementation
* cleanups
we want to have different BEAM values for resnet train and eval. global JITBEAM cannot do this. added the flag to change beam behavior at cnt=0 (so it default behaves the same with or without TinyJit), and for cnt=1 it uses existing BEAM.value.
Also updated the context var BEAM in resnet to be outside of TinyJit. saves about 3 minutes compile time
* add DICE loss and metrics
* update dice to include reference implementation's link
* remove unused imports
* remove unnecessary test file and update pred + label for metrics and losses test
* add tests to CI + add exclusion of mlperf_unet3d
---------
Co-authored-by: chenyu <chenyu@fastmail.com>
* resnet individual layer benchmarks!
* small
* 1 and 2
* mem_used
* no ci
* better conv print
* defaults
* prints
* adjust
* adjust
* adjust
* benchmark only one layer example
* tensor.training, zero_grad, sum instead of mean, last mem, last kernel count
* default jitcnt=1
* scale flops/kernels with jitcnt
* add note about jitcnt memory
* touchup
* write llm.c and add a few new methods to tensor
* training works
* add jit
* tests for new functions
* test tolist
* simple fix for onnx test failures (#4186)
* write llm.c and add a few new methods to tensor
* training works
* add jit
* tests for new functions
* bump line count to 7500
* simplest fix
* safenumpy tolist for now
---------
Co-authored-by: George Hotz <geohot@gmail.com>
Co-authored-by: George Hotz <72895+geohot@users.noreply.github.com>
---------
Co-authored-by: geohotstan <135171913+geohotstan@users.noreply.github.com>
* rewrite the jit in the context of new schedule
* mypy better
* fix placeholder
* tests
* all functionality should work
* fix tests
* no CacheCollector
make it read nicer and cleanup some movement methods and math simplification.
790m, 1.4b, 2.8b model does not really run.
sampling is not implemented.
jit is incorrect.
some deadcode / wrong code path and copied from torch stuff stuff.
* first commit
* state back to orig
* mamba comparisions
* rm file
* rename file
* use Tensor.einsum and mke default model 370M
* Cleaned code and made a comparision test
* Simplyfy pull request. Only has 1 mamba implementation now.
* Update prompt
* rm whitespaces
* last space
* remove Einops dependency
* rm unused code
* add tests
* rm print statement
* rm imports
* skip CLANG
* Update skipIf description
* skip model test in CI and add CLANG fix
* rm Device import
* don't be stupid
* Fix conv assign
When the prompt is too short, the logic for conv_state assign messes up. This can be fixed when padding the tokenized array to min length of 4. I padded using the empty string token, but idk if proper practice is to use the PAD token
* fix p1
* temp
* fix jit import
---------
Co-authored-by: schlimeszn <schlimeszn@gmail.com>
Co-authored-by: reddyn <nikidsniper@gmail.com>
Co-authored-by: George Hotz <72895+geohot@users.noreply.github.com>