* use device from LinearizerOptions in kernel search
removed all Device.DEFAULT in search.py
* pass device string for parallel pickle
* device for interpreted backends in LinearizerOptions
* cuda with gpuctypes
* hip gpuctypes
* graphs
* rename + linter happy
* use cpu_time_execution
* no ji in build_kernel_node_params
* remove hip_wrapper
* hip fix
* no arc
* smalle changes
* no clean moduke in cudacpu
* cpu tests pass
* torch works
* works
* metal works
* fix ops_disk
* metal jit works
* fix openpilot
* llvm and clang work
* fix webgpu
* docs are rly broken
* LRU works on metal
* delete comment
* revert name to ._buf. LRU only on Compiled
* changes
* allocator
* allocator, getting closer
* lru alloc
* LRUAllocator
* all pass
* metal
* cuda
* test examples
* linearizer
* test fixes
* fix custom + clean realize
* fix hip
* skip tests
* fix tests
* fix size=0
* fix MOCKHIP
* fix thneed
* copy better
* simple
* old style metal copy
* fix thneed
* np reshape
* give cuda a device
* refactor/ci: delete many `# type: ignore`
* replace `axis.__class__ is int` with `isinstance(axis, int)` to make mypy happy
* add `--warn-unused-ignores` to mypy flag
refs #2240
* ci: move `--warn-unused-ignores` flag to mypy config
refs #2240
* move metal+clang to compile api
* all to the new style
* remove binary arg
* fix triton
* fixup tests
* fix clang
* diskcache is generic
* __wrapped__
* compile_gpu
* fix thneed
* keep the src in the ASTRunner
* lib
* move compile_gpu
* compile_gpu in device
* put compiler in astrunner
* test reverts
* triton compiler
* ugh, that too
* remove arm64, caching for cuda
* caching in llvm
* switch cache_compiled to new cache
* fix clang
* caching for metal
* fix pylint
* cleanups
* perf_counter and binary
* optimizer: add test for correctness of opts
Also added OptOps.UPCASTMID to constrain valid axes for opts with
group_for_reduce.
* llvm: fix LinearizerOptions to correctly not has_shared
* optimizer: remove premature test scaffold for TC opts
* search: fix the action space
* linearizer outputs something
* working ish
* cstyle codegen
* clang mostly works
* fix load valid
* fix numberless loop
* fancy gen
* working
* fix enet compiler
* cleanups
* float4 upcasting
* less lines
* supports_float4
* constant folding
* mulacc
* internet tests flaky in CI
* 90% image support
* fix image generic
* bugs exposed with shapetracker and single view
* new llvm
* use vload, remove OLD
* that's really poorly done
* ending up being more lines
* runs one metal kernel
* conv2d works
* ops tests are passing
* const folding
* all ops work
* pre commit always passes
* torch works
* working still
* fix graph test
* tests passing
* image almost works
* image conv works
* most images
* fix custom
* fix assignment
* fix compile enet
* clean up comments
* fix realize return value
* include shapetracker in LB repr
* copy should make a copy
* reenable method cache
* fix lna
* dtypes in graph
* forward only for IMAGE=2
* simple realize
* getting close
* fixup new api, it's good except the kernel count
* back to 197 kernels
* tests should pass
* go to a real float
* no type_on_cpu
* fix the docs
* put shapetracker back in it's proper place
* add dtype class
* dtypes
* buffers are lazy
* dtype is tracked by lazybuffer and GenericShape
* fix types in llvm
* llvm store
* dtype tests
* fix tests maybe
* fix flop counter
* fix CI
* CI fix and check format
* fix dtype and dtype check
* fix custom test
* fix test graph