* metal indirect command buffers
* sub 1ms gpt
* metal batch exec is good
* remove whitespace
* input_replace
* fix ci
* useResources
* very simple cacheallocator
* update_stats
* fix CI
* minor
* remove that from jit
* var_vals are global
* working with global ish
* better
* fix export model
* fix tests
* better kv cache
* does it run?
* use where for kvmask
* fix excessive var_vals
* fix import
* how does multigpu use this?
* llama kinda work
* faster and simpler
* cleanup
* fix conversation mode
* test cleanups
* fix one more test
* test cleanup
---------
Co-authored-by: George Hotz <geohot@gmail.com>
* add mops to graph, refactor IMAGE
* no reshape pushing
* add todo
* fix openpilot model alt
* push reshapes reduces kernels in new op
* IMAGE=2 is a first class citizen now
* start compile2
* tweak
* why are there two more kernels?
* minor cleanups
* don't break onnx tests
* add __metadata__ support to safetensors
* no early realize in onnx
* cleanups
* bugfix
* clean up image type, add optimize
* opt to match old
* try that
* opt work
* run compile2
* optimizer
* prt more
* prerealize
* imp
* NOLOCALS works
* no locals means no locals
* support fractional globals
* all locals welcome
* int that
* cleanups
* show gemv regression
* clean up diff
* use idx for the cond
* nolocals
---------
Co-authored-by: Comma Device <device@comma.ai>
* valid hacks
* valid hacks
* valid hacks
* new method
* new method
* handtune
* is gate load breaking?
* lint
ruff
less junk
new approach?
maybe this?
* Make it more clear
* Make it more clear
* Will deal with the linter later
* hack for linter
* subs the idx but dont touch the valid
* Updated the mod rules
* lint hack
* I believe bug fix lets see
* Mod Node left
* revert
* Maybe this wont break?
* revert
* implemented "handtuned garbage"
* revert and use VALIDHACKS
* Lets see the CI
* still broken?
* currently its jungle
* maybe this jungle ?
* This works for everything somehow
* Added test for symbolic
* lint
* final touch
* This still works
* lint
* midway clean
* less garbage
* lint
* final form
* Slow but working way
* lint and other stuff
* lint
* mypy
* Make sure CI test Openpilot valid checks
* test if CI break
* Convert back
* refactor
* refactor
* Managed to reduce openpilot time from 30 secs to 5 secs
* Refactor
* Substitute a node with variable
* flake8
* Comment and refactor
* More comprehensive mod
* refactor
* bug fix
* More shave off
* remove not sure part
* Symbolic Shape JIT
update tests
2 variables symbolic ops, adding more tests
test passing
cleanup
* more test cases
* single flag
* review update
* jit attention one piece
* realize
* symbolic_jit test for cuda
* old artifact
* works with cuda gpu but failed ci
* CUDACPU
* global -> group
* allow None for local_size in custom function
* lil local
* comment on shape
* fix cuda
* smart local cast
* better local heuristic
* fix ptx, and work_dim cleanup
* fix metal
* fix ops test
* fix openpilot jit
* no more optlocal
* might fix metal tests
* try metal now
* see generated metal code
* test free removal. REVERT THIS
* mergable
* fix binop, other tests failure
* that was a bad idea
* better layernorm
* inference kernel count tests
* new style reshape pushing
* fixup replacement
* 199 kernels is okay. fix flops
* push reshape through unaryops only
* GRAPH=2 draws the phantom ops
* found resnet issue
* non working test
* mul is cheaper than div
* OPT inflation
* SHUFFLE_PAD_OPS in OPT=2
* runs one metal kernel
* conv2d works
* ops tests are passing
* const folding
* all ops work
* pre commit always passes
* torch works
* working still
* fix graph test
* tests passing
* image almost works
* image conv works
* most images
* fix custom
* fix assignment
* fix compile enet
* clean up comments
* fix realize return value
* include shapetracker in LB repr
* copy should make a copy
* reenable method cache
* fix lna
* dtypes in graph
* forward only for IMAGE=2
* simple realize
* getting close
* fixup new api, it's good except the kernel count
* back to 197 kernels
* tests should pass
* go to a real float
* no type_on_cpu
* fix the docs
* put shapetracker back in it's proper place
* clean up opt
* don't let global kernels get too small
* 8192 -> 1024
* disable local shape for clang
* fix can_merge
* unroll the 5x5 depthwise convs in op
* load float4 check
* Refactor getenv into helpers
* Remove unused os
* Fix default value
* Fix more defaults for CI
* Fix bracket
* Revert changes to openpilot/compile.py
* Use getenv from helpers when possible
* working exec ast
* exec_ast is staticmethod
* GenericExecAST
* fold that sometimes
* ExplicitExecAST
* exec_ast for GPU
* gpu working
* get_lazyop_shape
* now gpubuffer is ExplicitExecAST
* dedup
* add a type
* RESHAPE in opencl code
* fix linter
* that too for linter
* cleanups
* remove dead code
* GenericShape is less lines
* add ALLOWED_KERNEL_COUNT to tests
* fix mypy
* that's gotta be recursive
* fix opencl shape processing
* remove unneeded lambda