* Symbolic Shape JIT
update tests
2 variables symbolic ops, adding more tests
test passing
cleanup
* more test cases
* single flag
* review update
* jit attention one piece
* realize
* symbolic_jit test for cuda
* old artifact
* works with cuda gpu but failed ci
* CUDACPU
* Implement scaled_dot_product_attention and test
* Support attn_mask
* Support is_causal too
* Use in llama
* Don't forget to reshape
* Set requires_grad=False for causal
* Remove staticmethod
* Remove extra spaces
* Fixes + improved test coverage for helpers.py
- added exception handling in `proc`, if an exception was thrown, the thread would hang
- made `_early_exec_process` catch any Exception, before if an exception was thrown before the process was started, it would hand the thread
* Made `_early_exec_process` catch any Exception
Otherwise, if an exception was thrown before the process was started, it would hang the thread. For example a type error for an argument passed to `subprocess.check_output`
* Fixed `from tinygrad.helpers import Timing` import
oops, for some reason my IDE cleaned that import from extra/helpers.
* Fixed import in llama.py
Another one that I skipped by accident, mybad
* Extracted a class for tests of early exec
* Normalize line endings, windows uses /r/n
* Made `cross_process` not a daemon
* safetensors test
* safe_save
* load back with real safetensors
* bugfix in device name. add simple torch_load
* it works for llama, but it's slower...
* mmap
* no intermediate
* load mmaped
* readinto speed
* not ready yet
* revert that
* feat: promote Embedding to nn
* fix: fix failing test
* feat: add test with jit
* feat: rewrite embedding to no longer need stacked for loops
* clean+fix: don't know how that happened
* Make GPU the default device
* Compile EfficientNet with CPU
* don't print device
* use METAL and CUDA if possible
* Revert some changes to workflow
* Fix import error when checking device availability
* device lookup is now optional
* hopefully fix linter and tests
* fix workflow
* Skip device if not available
* don't change default if CPU=1
* simplify device selection
* Default to CPU if no GPU
* don't print device name...
* No need to change default in llama
* Make GPU the default device
* Compile EfficientNet with CPU
* don't print device
* use METAL and CUDA if possible
* Revert some changes to workflow
* Fix import error when checking device availability
* device lookup is now optional
* hopefully fix linter and tests
* fix workflow
* Skip device if not available
* don't change default if CPU=1
* simplify device selection
* Default to CPU if no GPU
* don't print device name...
* No need to change default in llama
* run github workflow
* Fix logic to select default
* pass if an error occurs
* use separate function for try except
* fix binop, other tests failure
* that was a bad idea
* better layernorm
* inference kernel count tests
* new style reshape pushing
* fixup replacement
* 199 kernels is okay. fix flops
* push reshape through unaryops only
* GRAPH=2 draws the phantom ops
* found resnet issue
* non working test
* mul is cheaper than div
* OPT inflation
* SHUFFLE_PAD_OPS in OPT=2
* building shapetracker
* default ENABLE_METHOD_CACHE
* symbolic compiles
* improve types
* tensor compiles
* oops, that's a bug
* best of both worlds
* find legit typing bugs
* pad2d can take list or tuple
* sub 200ms when compiled
* third try at torch loading
* numpy fixed
* fix enet compile
* load_single_weight supports empty weights
* oops, CPU wasn't the default
* so many bugs