as pointed out by #4877, need to add `__init__.py` to trigger pylint. fixed some errors except ops_python (will do in a separate pr, it has a lot of errors), and sub-folders in runtime
* compile raise CompileError and skip only RuntimeError in multiprocess beam
renderer error with multiprocess should not be skipped by beam
* use `==` for dtype to dtype comparison
* that needs to be is
* typo
* refactor ops_gpu ctypes
- remove redundant byref as ctypes automatically handles passing `type` as
`POINTER(type)`
- use walrus operator instead of init_c_var when possible
* clSetKernelArg argtype is POINTER(None)
* generic rendering of half and bf16
hotfix
* fix uops + regression test
* fix the test for metal's half4
* uop.uop fixup
* mypy with --strict-equality, fix ops_gpu
* move gpuctypes in tree
* fix mypy
* regex exclude
* autogen sh
* mypy exclude
* does that fix it
* fix mypy
* add hip confirm
* verify all autogens
* build clang2py
* opencl headers
* gpu on 22.04
* compile cache for several devices
* ops_gpu uses hash to not care about sql
* hip rdna test with device
* linter happy
* no device passed where possible
* arch is optional to compile_{hip|cuda}
* use device from LinearizerOptions in kernel search
removed all Device.DEFAULT in search.py
* pass device string for parallel pickle
* device for interpreted backends in LinearizerOptions
* ops_gpu is go
* fix size 0
* fix image, and add more tests
* nerf openpilot test, doesn't test thneed
* run the schedule
* better
* oops, new inputs
* delete pyopencl
* Update ops_gpu.py
* rewrite 0 size loadop into a CONST
* check alloc size
* EMPTY is better
* Revert "EMPTY is better"
This reverts commit 574fe0f9ed28f1b97da5a81afdfd2cd5d9a94ff9.
* no ast is created
* fix test
* cpu tests pass
* torch works
* works
* metal works
* fix ops_disk
* metal jit works
* fix openpilot
* llvm and clang work
* fix webgpu
* docs are rly broken
* LRU works on metal
* delete comment
* revert name to ._buf. LRU only on Compiled
* changes
* allocator
* allocator, getting closer
* lru alloc
* LRUAllocator
* all pass
* metal
* cuda
* test examples
* linearizer
* test fixes
* fix custom + clean realize
* fix hip
* skip tests
* fix tests
* fix size=0
* fix MOCKHIP
* fix thneed
* copy better
* simple
* old style metal copy
* fix thneed
* np reshape
* give cuda a device
* refactor/ci: delete many `# type: ignore`
* replace `axis.__class__ is int` with `isinstance(axis, int)` to make mypy happy
* add `--warn-unused-ignores` to mypy flag
refs #2240
* ci: move `--warn-unused-ignores` flag to mypy config
refs #2240
* For cuda get current free space from device, and rery alloc failures
* type ignore for mypy
* add init to get free mem in cuda
* Move retry logic in common lib.
Fix typo in override _get_cur_free_space
* linter error fix in test file
* Not catch all, as it will catch KeyboardInterrupt
* fix unintened line changes
* move metal+clang to compile api
* all to the new style
* remove binary arg
* fix triton
* fixup tests
* fix clang
* diskcache is generic
* __wrapped__
* compile_gpu
* fix thneed
* keep the src in the ASTRunner
* lib
* move compile_gpu
* compile_gpu in device
* put compiler in astrunner
* test reverts
* triton compiler
* ugh, that too
* remove arm64, caching for cuda
* caching in llvm
* switch cache_compiled to new cache
* fix clang
* caching for metal
* fix pylint
* cleanups
* perf_counter and binary
* start compile2
* tweak
* why are there two more kernels?
* minor cleanups
* don't break onnx tests
* add __metadata__ support to safetensors
* no early realize in onnx
* cleanups
* bugfix
* clean up image type, add optimize
* opt to match old
* try that
* opt work
* run compile2
* optimizer
* prt more
* prerealize
* imp
* NOLOCALS works
* no locals means no locals
* support fractional globals
* all locals welcome
* int that
* cleanups
* show gemv regression
* clean up diff
* use idx for the cond
* nolocals
---------
Co-authored-by: Comma Device <device@comma.ai>