* no JIT call in TransformerBlock
* idea
* move 2 reshapes to jitted function
shrink inside jitted too, 6.3ms
remove back reshapes, 5.5ms
isinstance -> __class__ 4.99ms
* think
revert ops_gpu.py
revert symbolic.py too
PYOPENCL_COMPILER_OUTPUT=1
* cleanup
* fix cache shape for conversational model
only reshape if start_pos > 0
* small cleanup
* include var_vals.keys() to st.key
* add comments
* llama small update
* everything jitted again, similar structure to gpt2
* fix typing
* add TODO for in place update cache