* create cache for q learning
* make linter happy
* global beam
* where it belongs
* bugfix
* ditch the kopt, use the beam
* faster lin and DEBUG=2 okay
* remove kopt, move search to features
* load weights in fp16
* add dtype option in nn
* fix test
* no need for dtype in nn
* add option to load weights in FP16, but NaN
* change loss scaler
* cast to float32 for norm layer
* add a todo for the forward pass padding
* fix transform
* start work on auto opt
* lin failure
* not beating hcopt
* greedy
* timing is fast
* codegen.search
* greedy search in handcode_opt
* track running gflops
* clean up those files
* no failure
* Allow multi-input model export
* Add model export unit test
* Fix efficientnet compilation
* Only run model export test on JIT supported devices
* Skip export model test if not EXPORT_SUPPORTED_DEVICE
* simplify gpt2 example
* kernel_jitted_count and jit tests
* Revert "kernel_jitted_count and jit tests"
This reverts commit 31a3c26dd0.
* all_jitted test in test_real_world
* added missing colon
* bug fixes for cifar10 dataset loading
needed a reshape to work with conv layers and resolve fetched tensor to numpy since further code expects numpy array
* lazy cleanups
* ast functions take in LazyOps
* op instead of self.op
* _base for mops
* fix contiguous
* start schedule
* test_schedule
* fix openpilot
* more tests
* bugfix and test skip
* work
* make sure things get freed
* fix zerosized tensors
* fix failing test
* fix ceil and friends
* fix openpilot
* disable training
* disable test collectives
* small helps
* got something working
* faster?
* faster yes
* cleanup
* cleanup
* cleanup
* Fix non jit
* Fix fp16 and some cleanup
* Fix fp16 and some cleanup
* cleanup
* similar to master
* cleanup
* change reduceop heruistics
* add model ema and jit hack
* add ema eval
* have to create a duplicate eval function for jit
* remove manual seed
* 94% achieveable with normal eval
* ema is outputting the same results as normal
* fix ema bug
* ema achieves 94% with fix seed
* multigpu tested
* constant fold decay, fix jit, adjust message for multigpu
* pull SpeedyResNet out of train_cifar()
* patch to remove hack from stable_diffusion.py
* sorry linter
* realize after assign?
* float16 broken in llvmlite use float64 for now
* int32
* idiot forgot to change test array dtype