* new upcast works
* float4 try
* fix unaligned float4
* disallow unaligned access
* upcast dim
* maybe good now
* fix gpu half
* vstore_half4
* fix deep image bugs
* improve symbolic to fix issues
* fix symbolic
* cl test
* this maybe
* gcd of 1 is 1
* real fix for old python
* improve fuzzer
* test speed llama
* oops, put it back
* uses the real device codegen
* just do it on the mac
* pp
* is faster?
* Revert "is faster?"
This reverts commit 42db542010.
* disable docker again for less load on CI
* global -> group
* allow None for local_size in custom function
* lil local
* comment on shape
* fix cuda
* smart local cast
* better local heuristic
* fix ptx, and work_dim cleanup
* fix metal
* fix ops test
* fix openpilot jit
* no more optlocal
* might fix metal tests
* try metal now
* see generated metal code
* test free removal. REVERT THIS
* mergable
* matrix strategy
* push env to GITHUB_ENV
* use printf instead of echo
* use temp helper function for cross os paths
* use path join
* switched to using temp helper function
* skip test on windows due to memory limit
* small fix
* removed semi
* touchups
* clean up
* seperate tests
* test changes to test_utils on windows
* small refactor
* more cleanups
* undo helpers change
* only skip if in CI and WINDOWS
* e2e testing
* min failure
* no affine on bn, still fails
* why did i think i could detach that?
* allow more kernels for bn
* some test issue i don't understand
* fix binop, other tests failure
* that was a bad idea
* better layernorm
* inference kernel count tests
* new style reshape pushing
* fixup replacement
* 199 kernels is okay. fix flops
* push reshape through unaryops only
* GRAPH=2 draws the phantom ops
* found resnet issue
* non working test
* mul is cheaper than div
* OPT inflation
* SHUFFLE_PAD_OPS in OPT=2
* runs one metal kernel
* conv2d works
* ops tests are passing
* const folding
* all ops work
* pre commit always passes
* torch works
* working still
* fix graph test
* tests passing
* image almost works
* image conv works
* most images
* fix custom
* fix assignment
* fix compile enet
* clean up comments
* fix realize return value
* include shapetracker in LB repr
* copy should make a copy
* reenable method cache
* fix lna
* dtypes in graph
* forward only for IMAGE=2
* simple realize
* getting close
* fixup new api, it's good except the kernel count
* back to 197 kernels
* tests should pass
* go to a real float
* no type_on_cpu
* fix the docs
* put shapetracker back in it's proper place
* add dtype class
* dtypes
* buffers are lazy
* dtype is tracked by lazybuffer and GenericShape
* fix types in llvm
* llvm store
* dtype tests
* fix tests maybe
* fix flop counter
* fix CI
* CI fix and check format
* fix dtype and dtype check
* fix custom test
* fix test graph
* behavior is correct without VALIDHACKS
* simple div and mod
* fix tests
* no negative variables
* alt form is correct
* still correct
* bug in mulnode
* at least validhacks works now
* cleanups
* test validhacks, and to_image_idx
* cache compare key
* tests and __neg__
* conv2d is an hlop
* shorter conv
* KOPT=-1
* alt imp
* MULACC
* smarter mulacc
* pop conv
* 7x7 -> 5x5
* didn't fix, that's not going to work
* this is faster and matches old behavior
* oh, non lazy just won't work with mulacc
* mulacc in torch
* bool types were creeping in
* optimizer is actually better with hlop conv
* fix pushing permutes issue
* refactor einsum_mulacc
* fix up readme
* update readme
* _image_conv2d
* fix bias addition location
* pushing permutes gets back to 200 kernels
* conv cleanup
* disable hlop conv
* don't hide that in helpers
* start clang backend
* mostly working
* no group for reduce w clang
* it compiles
* compiles
* a11y
* minor fixups
* formatting
* add a test
* rename test
* add image
* load + store + boring stuff:
* image tests pass
* thneed print GFLOPS
* op conv test
* more debugging
* hack for multiview image
* shapetracker creates less views
* disable image tests
* working better
* ugh, lkey not key
* print in DEBUG, and allow views
* works
* simple padding conv2d
* use index for image
* that was bad code
* debug print
* fix types
* less lines
* save lines