* start cpu threading
* fix
* fix2
* fix
* hacks?
* threads
* minor
* no dsp
* dsp 2
* n
* more
* test
* xm
* cleaner
* readable
* f
* reorder
* when no threads
* rangeify
* typos
* not needed
* reapply
* remoev this
* linter
* fixed cpu count in ci
* fix
* fixes
* rm
* typo
* sort based on speed
* test if test works in ci
* Revert "test if test works in ci"
This reverts commit 1f05edb531.
* do not pad thread
* change clang -march flag to -mcpu with fp16 disassembly test
* fix
* add capstone to macos dependencies
* just check no cast in test
* rm import
* woops
* lets check
* move check
* llvm init before cpu chcek
* try this
* bump autogen llvm version
* also update libclang?
* revert
* add comment
* skip llvm test and add comment
* linter
* remove cpu and torch backends
* don't copy to cpu
* use clang instead of cpu
* multitensor gathers on the first device
* clang is cpu + use default
* fixup
* bugfix
* skip matacc opt if the all src buffers of mul op are const buffers
* add noqa directive for long test
* unskip MALACC opt
* ensure that a_axes at least includes summation axes in order to perform np.einsum correctly
* add regression test for mulacc op
* compute a_slices using a_axes
* refactor helper of function to retrieve axes and slices for nonzero strides as well as summation axes
* include a regression test that uses and to test the behaviour indirectly
* try
* test: add logical_not tests
* gah im retarded, but this doesn't match types for const()
* fix: can't we jsut do this?
* big change: I don't actually know what I'm doing
* WOOO IM JUST CHANGING EVERYTHING WOW probably gon revert later
* BYE BYE noqa: E501
* fix: less lines and add test
* fix: rm 2 redundant tests
* fix: eq with False so we don't unintentionally implicit upcast, but it's bool anyways so w/e
now the input types are matched and checked in lazy, we can remove these output_type.
also remove the usage of least_upper_dtype in ops.py since we can just use the input type
* remove match_type in ops_torch and ops_cpu
input dtypes are aligned and casted in mlops
* dict union only after python3.9
* fix that
* fix Sigmoid forward cast
* cpu tests pass
* torch works
* works
* metal works
* fix ops_disk
* metal jit works
* fix openpilot
* llvm and clang work
* fix webgpu
* docs are rly broken
* LRU works on metal
* delete comment
* revert name to ._buf. LRU only on Compiled
* changes
* allocator
* allocator, getting closer
* lru alloc
* LRUAllocator
* all pass
* metal
* cuda
* test examples
* linearizer
* test fixes
* fix custom + clean realize
* fix hip
* skip tests
* fix tests
* fix size=0
* fix MOCKHIP
* fix thneed
* copy better
* simple
* old style metal copy
* fix thneed
* np reshape
* give cuda a device
* clean up the buffers
* remove allocate_output
* functools.lru_cache is methodcache
* add TestShapeTrackerSize
* cache_clear
* no 0 sz buffer, add _ on functions that shouldn't be imported
* fix size
* if -> while
* torch and numpy don't share ops anymore
* that should be filtered out elsewhere
* still const
* graph + enet example cleanup
* hmm, we do still need it because of symbolic