* Fix openpilot kernel from 209 to 206
1. Use push_movement_ops conditions in _movement_op. Don't push
PAD or check if the ops are safe to be pushed with PAD
2. Don't push if all the op.buffers are realized
* change ALLOWED_KERNEL_COUNT to 206 for openpilot
* don't push through sourceless buffers
* change the tests to adjust kernel counts for new behaviour
* restore pushing of movement ops through childless buffer
* don't push EXPAND, causes OOM
* allow push of intermediate movement ops
* adding new test behaviour
* modifying external_test_opt for new behaviour
* restore old tests
* Reenable push of EXPAND and introduce new tests
I was wrong intially thinking EXPAND can cause OOM and hence I had
disabled it. Since it is 0 stride and doesn't allocate memory its cool
* Don't push EXPAND above LoadOps LB. This is causing OOM
* Push should be decided on movement root of bufs
To check if ast.op.buffers is sourceless/ realized go the the movement
root and then decide if pushing should be done or not
* refactor for readability
* use .base instead
* don't push expand, bad memory/compute consumption
* restrict push of reshape, seeing improvement
* push reshape if unary without further check
* disable PAD solves convnext kernel count increase
* reenable test_cache_binaryop_transpose
* small nit
* init compiled cache
* clang not compile to stdout
* use kwrags in compile
* remove some useless lines
* slimmer
* fix
* tabs
* retry
* remove decorator
* no race in hip
* smaller hip
* unused import
* unused pathlib
* path to str
* add test
* fix linter
* less lines?
* decorator is back
* update tests
* no hip version
* better comments
* a bit better test
* linter
* work wo decorator
* linter happy
* simpler return type
* more tests
* better comment
* readable
* readable
* readable
* compile returns bytes
* no ununsed imports
* readable
* load weights in fp16
* add dtype option in nn
* fix test
* no need for dtype in nn
* add option to load weights in FP16, but NaN
* change loss scaler
* cast to float32 for norm layer
* add a todo for the forward pass padding
* fix transform
* start work on auto opt
* lin failure
* not beating hcopt
* greedy
* timing is fast
* codegen.search
* greedy search in handcode_opt
* track running gflops
* clean up those files
* no failure
* testing with the test_ops pattern
* add assign test
* flake8 complaining about single line fn
* slice 2d and minor cleanup
* make assign_slice a one-liner
* we dont need to repeat the same lambda twice, default tinygrad_fxn to be np_fxn
* back assign fn for np array
* implement __setitem__ in tensor.py
* dont re-slice the ret tesnsor
* one liner assign
* drop the permute test
* Allow multi-input model export
* Add model export unit test
* Fix efficientnet compilation
* Only run model export test on JIT supported devices
* Skip export model test if not EXPORT_SUPPORTED_DEVICE
* allow local + grouped reduce in hand_coded
* allowed loop size based on global_dims
* fix const
* fix const one more time
* better divisor
* a bit fix
* can take 2, why not
* fix linter
* better comments
* start with 2
* not always pick group reduce
* fix images
* better images
* better