* no JIT call in TransformerBlock
* idea
* move 2 reshapes to jitted function
shrink inside jitted too, 6.3ms
remove back reshapes, 5.5ms
isinstance -> __class__ 4.99ms
* think
revert ops_gpu.py
revert symbolic.py too
PYOPENCL_COMPILER_OUTPUT=1
* cleanup
* fix cache shape for conversational model
only reshape if start_pos > 0
* small cleanup
* include var_vals.keys() to st.key
* add comments
* llama small update
* everything jitted again, similar structure to gpt2
* fix typing
* add TODO for in place update cache
* cache loads across buffers (since they may share rawbufs)
* typing
* add test
* fix test
* small changes to test
* fix test
* one big cache
* whitespace
* golf a line?
* invalid is RawBuffer(0)[0], valid 1.
* Symbolic Shape JIT
update tests
2 variables symbolic ops, adding more tests
test passing
cleanup
* more test cases
* single flag
* review update
* jit attention one piece
* realize
* symbolic_jit test for cuda
* old artifact
* works with cuda gpu but failed ci
* CUDACPU
* move assembly, assembly_ptx
* successful but broken rendering of ptx asm
* clear ins before render asm
* slightly less broken :')
* we needed thread syncs
* fix float16 loading, rounding modifiers and other casting stuff, passing casts_from_half
* Fix runtime_args for gpuocelot
* our casts were flipped on both ends
* more casting
* add ternary where op
* dealing with storing/loading bool
* add test for casting to bool from negative
* Fix args.valid on ConstOp
* add to CI, TODO: fix runtime_args for test_uops
* fix placement of runtime_args to work with lazy.Device
* undo ci changes so I can push
* fix lints
* start cleanup and fix things we broke fixing lints
* add checks for PTX specifc asm instructions
* revert added test -- doesn't pass on llvm
* skip tests for underflow,overflow
* another fix for how we're setting runtime args
* Less broken cleanup
* add to CI
* add more env variables for ci test
* fix ci to install pycuda for ptx
* ci: copy cuda test command
* cleanup
* assert to make sure we're actually running ptx in ci
* remove test assert
* move is_ptx arg
* move assembly, assembly_ptx back to extras
* fix imports
* initial merge fixes
* clear registers, fix UOps.LOAD with invalid value
* draft merge fixes
* remove prints
* quick lint and merge fixes
* cleanup
* remove PTXProgram wrapper
* final cleanup
* temp change for ci rerun
* ci rerun
* rollback ISA version
* try to run commavq
* fix 0 dim, start implementing new ops
- Implement EmbedLayerNormalization
- Implement Attention
* SkipLayerNormalization and FastGelu
* use original torch model, cast inputs
* fix some ops:
- properly do Cast
- Attention: bi- and unidirectional
- FastGelu: add bias before gelu
* cleanup onnx_ops.py
* add validation option to benchmark
* cleanup imports
* add checks incase onnx2torch implements ops in future
* run onnx instead of original torch
* just skip gpu on m1
* reactivate the other models
* check for strange params & squash whitespace
* cleanup
* fix causal mask Attention
* Range doesn't need int cast
* embedding vocab_counter same dtype as input
* no need to cast
* always validate, fix PosixPath ort
---------
Co-authored-by: George Hotz <george@comma.ai>