* stable diffusion < 324ms
* revert swap action
* fix tests due to more sum splitting
* REDUCEOP_SPLIT_THRESHOLD env var
* added from unaligned np test (#2134)
* align cpu buffer before copy into cl buffer (#2135)
* remove shelve from handcode_resnet50_opt.py (#2139)
* Add dictionary keys to reduce db size (#2131)
* work
* ignore beam cache
* dictionary keys are generic
* minor db cleanups
* fix baseline and extract dataset
* fix training
* log likelihood
* more lin to feats
* sts
* training policynet
* net sort of works
* dedup
* refactor, stupid new actions
* fix uops deduping
* BEAM_ESTIMATE
---------
Co-authored-by: chenyu <chenyu@fastmail.com>
Co-authored-by: imaolo <56898718+imaolo@users.noreply.github.com>
* feat: move to hip
* feat: special path for RawBufferTransfer
* feat: initial rawbuffertransfer
* feat: hip ipc
* feat: working hip ipc
* feat: need to base device without args
* feat: close mem handle
* feat: modified test
* feat: more multihip stuff
* clean: cleanup
* feat: cleaner
* feat: don't crash
* feat: test more
* clean: way cleaner hip wrapper
* feat: barrier
* feat: barrier
* feat: this breaks stuff
* feat: we can use empty here
* feat: maybe fix tests
* feat: maybe fix tests again?
* fix: probably fix tests
* feat: no waiting here
* feat: wait here
* feat: much larger test
* feat: need to sync here
* feat: make this async
* feat: no waiting!
* feat: cut here
* feat: sync copy
* feat: random imports
* feat: much cleaner world
* feat: restore this
* feat: restore this
* clean: cleanup
* feat: set this
* create cache for q learning
* make linter happy
* global beam
* where it belongs
* bugfix
* ditch the kopt, use the beam
* faster lin and DEBUG=2 okay
* remove kopt, move search to features
* load weights in fp16
* add dtype option in nn
* fix test
* no need for dtype in nn
* add option to load weights in FP16, but NaN
* change loss scaler
* cast to float32 for norm layer
* add a todo for the forward pass padding
* fix transform
* start work on auto opt
* lin failure
* not beating hcopt
* greedy
* timing is fast
* codegen.search
* greedy search in handcode_opt
* track running gflops
* clean up those files
* no failure
* Allow multi-input model export
* Add model export unit test
* Fix efficientnet compilation
* Only run model export test on JIT supported devices
* Skip export model test if not EXPORT_SUPPORTED_DEVICE
* simplify gpt2 example
* kernel_jitted_count and jit tests
* Revert "kernel_jitted_count and jit tests"
This reverts commit 31a3c26dd0.
* all_jitted test in test_real_world
* added missing colon
* bug fixes for cifar10 dataset loading
needed a reshape to work with conv layers and resolve fetched tensor to numpy since further code expects numpy array
* lazy cleanups
* ast functions take in LazyOps
* op instead of self.op
* _base for mops
* fix contiguous
* start schedule
* test_schedule
* fix openpilot
* more tests
* bugfix and test skip
* work
* make sure things get freed
* fix zerosized tensors
* fix failing test
* fix ceil and friends
* fix openpilot
* disable training
* disable test collectives
* small helps
* got something working
* faster?
* faster yes
* cleanup
* cleanup
* cleanup
* Fix non jit
* Fix fp16 and some cleanup
* Fix fp16 and some cleanup
* cleanup
* similar to master
* cleanup
* change reduceop heruistics
* add model ema and jit hack
* add ema eval
* have to create a duplicate eval function for jit
* remove manual seed
* 94% achieveable with normal eval
* ema is outputting the same results as normal
* fix ema bug
* ema achieves 94% with fix seed
* multigpu tested
* constant fold decay, fix jit, adjust message for multigpu
* pull SpeedyResNet out of train_cifar()
* patch to remove hack from stable_diffusion.py
* sorry linter
* realize after assign?
* float16 broken in llvmlite use float64 for now
* int32
* idiot forgot to change test array dtype