* metal indirect command buffers
* sub 1ms gpt
* metal batch exec is good
* remove whitespace
* input_replace
* fix ci
* useResources
* very simple cacheallocator
* update_stats
* fix CI
* minor
* remove that from jit
* var_vals are global
* working with global ish
* better
* fix export model
* fix tests
* better kv cache
* does it run?
* use where for kvmask
* fix excessive var_vals
* fix import
* how does multigpu use this?
* llama kinda work
* faster and simpler
* cleanup
* fix conversation mode
* test cleanups
* fix one more test
* test cleanup
---------
Co-authored-by: George Hotz <geohot@gmail.com>
* Change linearizer to parse CAST
* Oneliner renders for cstyle and triton
* LLVM cast and ALU implementation
* pylint fixes
* cast in gep
* remove printbufs
* use cast for post-load ops
* get rid of parse_cast
* partially supported vectorized dtypes for initial dev
* render phi as the dtype
* Revert "partially supported vectorized dtypes for initial dev"
This reverts commit 1bf1a818a3.
* Revert "render phi as the dtype"
This reverts commit d08cb270b4.
* reenable triton tests
* no vstore_half if dtype is already half
* upcast max
* Change linearizer to parse CAST
* Oneliner renders for cstyle and triton
* LLVM cast and ALU implementation
* pylint fixes
* cast in gep
* remove printbufs
* use cast for post-load ops
* get rid of parse_cast
* partially supported vectorized dtypes for initial dev
* render phi as the dtype
* Revert "partially supported vectorized dtypes for initial dev"
This reverts commit 1bf1a818a3.
* Revert "render phi as the dtype"
This reverts commit d08cb270b4.
* reenable triton tests
---------
Co-authored-by: George Hotz <72895+geohot@users.noreply.github.com>
* For cuda get current free space from device, and rery alloc failures
* type ignore for mypy
* add init to get free mem in cuda
* Move retry logic in common lib.
Fix typo in override _get_cur_free_space
* linter error fix in test file
* Not catch all, as it will catch KeyboardInterrupt
* fix unintened line changes
* fix test ops
* decompose the err from test_ops
* skipTest skips the entire test, we dont want that
* handle cases with the same priority
* add int16 to torch map
* fuzz linearizer transformation
* no standard normal for fp16
* work
* Interpreted start
* CPU and TORCH work
* fix MemBuffer with same idx
* id for failed kernels
* no image and variable for Interpreted
* symbolic shape
* IMAGE only for GPU
* Interpreted almost all good
* cleanup
* fix bufs_from_lin
* zero size
* some failed examples
* just Exception
* just test not pass
* move metal+clang to compile api
* all to the new style
* remove binary arg
* fix triton
* fixup tests
* fix clang
* diskcache is generic
* __wrapped__
* compile_gpu
* fix thneed
* keep the src in the ASTRunner
* lib
* move compile_gpu
* compile_gpu in device
* put compiler in astrunner
* test reverts
* triton compiler
* ugh, that too
* merge kernel and optimizer
* linearize is reentrant
* move global/local size
* clean up linearizer copy
* remove unneeded lin copies
* stop linearizing twice
* oops, that should be None
* refactor unit tests for dtypes
* add missing dtypes in llvmir.py and lib.py
* skip torch tests
* webgpu
* cleaner skips
* fix llvm bool casting issue using compare
* llvm 100% passing
* llvm segfault
* TEMP decrease timeout mins to 11
debug
* add bf16 to setup
* skip half tests in cuda cpu
* check for CUDACPU insetad
* add int16 to triton dtypes
* u16 for triton
* remove debug - diff is still hard to read
* derive from base class TestDType
* enhance test_upcast and downcast by running on every possible version
* dummy commit to rerun the flakey test
* skip the correct tests for CUDA
* bf16 should be skipped in the common TestDType cases
* re-enable bf16
* more consistent structure
* tiny changes to is_dtype_supported 1
* tiny changes 2
add reason
* fuzz
* fuzzer p2
* run fp32 twice
* remove duplicate fp32 run
* clang: use stdbool
* skip triton on bool casts
* merge and resolve conflicts
* Enable Multi-Output Export
* Add test
* Update examples and lint
* fix padding
* test ops
* dummy commit to rerun test
* revert cuda lint
* Enforce tuple/list of tensors
* subscripted generics
* put back webgpu test
* Re-enable WebGPU Efficientnet test
* optimizer: simplify GROUP and LOCAL to have one of each
Now that tensor cores only use LASTLOCAL, we can simplify to use
only that op everywhere.
The only use of GROUP is in matvec hand-coded opts and it doesn't
make a performance difference so switching to use only the top
behavior.
Also adds additional asserts to prevent tensor core dims from
being altered which causes bad kernels to be generated.
* search: remove duplicated actions
* stable diffusion < 324ms
* revert swap action
* fix tests due to more sum splitting
* REDUCEOP_SPLIT_THRESHOLD env var
* added from unaligned np test (#2134)
* align cpu buffer before copy into cl buffer (#2135)
* remove shelve from handcode_resnet50_opt.py (#2139)
* Add dictionary keys to reduce db size (#2131)
* work
* ignore beam cache
* dictionary keys are generic
* minor db cleanups
* fix baseline and extract dataset
* fix training
* log likelihood
* more lin to feats
* sts
* training policynet
* net sort of works
* dedup
* refactor, stupid new actions
* fix uops deduping
* BEAM_ESTIMATE
---------
Co-authored-by: chenyu <chenyu@fastmail.com>
Co-authored-by: imaolo <56898718+imaolo@users.noreply.github.com>
* wmma: refactor tensor cores using existing local dims
* optimizer: fix bad rebase and break after one late local
---------
Co-authored-by: George Hotz <72895+geohot@users.noreply.github.com>
* feat: move to hip
* feat: special path for RawBufferTransfer
* feat: initial rawbuffertransfer
* feat: hip ipc
* feat: working hip ipc
* feat: need to base device without args
* feat: close mem handle
* feat: modified test
* feat: more multihip stuff
* clean: cleanup
* feat: cleaner
* feat: don't crash
* feat: test more
* clean: way cleaner hip wrapper
* feat: barrier
* feat: barrier
* feat: this breaks stuff
* feat: we can use empty here
* feat: maybe fix tests
* feat: maybe fix tests again?
* fix: probably fix tests
* feat: no waiting here
* feat: wait here
* feat: much larger test
* feat: need to sync here
* feat: make this async
* feat: no waiting!
* feat: cut here
* feat: sync copy
* feat: random imports
* feat: much cleaner world
* feat: restore this
* feat: restore this
* clean: cleanup
* feat: set this