* Make GPU the default device
* Compile EfficientNet with CPU
* don't print device
* use METAL and CUDA if possible
* Revert some changes to workflow
* Fix import error when checking device availability
* device lookup is now optional
* hopefully fix linter and tests
* fix workflow
* Skip device if not available
* don't change default if CPU=1
* simplify device selection
* Default to CPU if no GPU
* don't print device name...
* No need to change default in llama
* Make GPU the default device
* Compile EfficientNet with CPU
* don't print device
* use METAL and CUDA if possible
* Revert some changes to workflow
* Fix import error when checking device availability
* device lookup is now optional
* hopefully fix linter and tests
* fix workflow
* Skip device if not available
* don't change default if CPU=1
* simplify device selection
* Default to CPU if no GPU
* don't print device name...
* No need to change default in llama
* run github workflow
* Fix logic to select default
* pass if an error occurs
* use separate function for try except
* fix binop, other tests failure
* that was a bad idea
* better layernorm
* inference kernel count tests
* new style reshape pushing
* fixup replacement
* 199 kernels is okay. fix flops
* push reshape through unaryops only
* GRAPH=2 draws the phantom ops
* found resnet issue
* non working test
* mul is cheaper than div
* OPT inflation
* SHUFFLE_PAD_OPS in OPT=2
* add int64 as supported dtype from numpy
Without this, examples/transformer.py didn't run. With this change it runs successfully.
* Update helpers.py
* Update transformer.py
* Update training.py
* runs one metal kernel
* conv2d works
* ops tests are passing
* const folding
* all ops work
* pre commit always passes
* torch works
* working still
* fix graph test
* tests passing
* image almost works
* image conv works
* most images
* fix custom
* fix assignment
* fix compile enet
* clean up comments
* fix realize return value
* include shapetracker in LB repr
* copy should make a copy
* reenable method cache
* fix lna
* dtypes in graph
* forward only for IMAGE=2
* simple realize
* getting close
* fixup new api, it's good except the kernel count
* back to 197 kernels
* tests should pass
* go to a real float
* no type_on_cpu
* fix the docs
* put shapetracker back in it's proper place
* building shapetracker
* default ENABLE_METHOD_CACHE
* symbolic compiles
* improve types
* tensor compiles
* oops, that's a bug
* best of both worlds
* find legit typing bugs
* pad2d can take list or tuple
* sub 200ms when compiled
* third try at torch loading
* numpy fixed
* fix enet compile
* load_single_weight supports empty weights
* oops, CPU wasn't the default
* so many bugs
* Add AvgPool2d as a layer
* Clean up a bit
* Remove stateless layers in yolo_nn
* More cleanup
* Save label for test
* Add test for YOLO
* Test without cv2
* Don't fail if cv2 not installed
* Better import
* Fix image read
* Use opencv :)
* Don't download the file
* Fix errors
* Use same version
* Set higher confidence
* Why is the confidence so low?
* Start over
* Remove stateless layers
* Remove extra lines
* Revert changes
* Save a few more lines
* add restrict qualifier for clang backend convolution inputs/ outputs
see https://godbolt.org/z/Tb9jMxWfx for generated assembly
* enable more checks
* inline fmax to motivate the compiler to inline some more
* fix if else binding power