* option for matmul
* fixups
* fast like a nascar
* running
* thneed runner
* no buffer id makes no backing buffer
* move constant folding to the top
* runs on mac
* folded biases
* was v slow
* maybe just that
* elu touchup
* speed and float32
Co-authored-by: Comma Device <device@comma.ai>
* simple lazy
* simple
* fix graph and make realize simpler
* SHUFFLE_MOVEMENT_OPS already works
* MERGE_MOVEMENT_OPS and REMOVE_MOVEMENT_NOPS
* it works, but it's slow
* constant inlining
* cache misses are the reason for loss
* fix non determinism
* cleanup, a few tests fail
* profile
* cache lazyop
* cleanups
* create namedtuple once
* bunch of caches
* it's not deleting
* nograd
* caching allocator
* reduce_op
* fromCPU if you want fromCPU
* complain
* nvidia fix
* realized on Tensor
* numpy is very slow
* no loads in second run
* caching in View
* 10ms speedups on batman
* remove old profiler
* bunch of refactors
* contiguous on view
* elementwise_op_compile for conv
* support ewop after processing op
* this still works
* conv folding works
* all we do is conv conv conv no matter what
* all args to the conv
* still works
* unify conv and ewop
* ops_gpu cleanup
* move around ops_gpu
* remove caching allocator
* remove unused
* find_conv shorten
* gpu refactors
* simpler gpu
* and that
* cmp is fast
* 18ms on mac
* it's a lot of lines, but it's faster
* minor
* tests pass
* LoadOps.CONTIGUOUS
* remove dups
* torch converter doesn't support slice
* move lazy out for merge
* LoadOps are only for lazy
* simple lazy
* simple
* fix graph and make realize simpler
* SHUFFLE_MOVEMENT_OPS already works
* MERGE_MOVEMENT_OPS and REMOVE_MOVEMENT_NOPS
* it works, but it's slow
* constant inlining
* cache misses are the reason for loss
* fix non determinism
* cleanup, a few tests fail
* profile
* cache lazyop
* cleanups
* create namedtuple once
* bunch of caches
* it's not deleting
* nograd
* caching allocator
* reduce_op
* fromCPU if you want fromCPU
* complain
* nvidia fix
* realized on Tensor
* numpy is very slow
* no loads in second run
* caching in View
* 10ms speedups on batman
* remove old profiler
* bunch of refactors
* contiguous on view
* elementwise_op_compile for conv
* support ewop after processing op
* this still works
* conv folding works
* all we do is conv conv conv no matter what
* all args to the conv
* still works
* unify conv and ewop
* ops_gpu cleanup
* move around ops_gpu
* remove caching allocator
* remove unused
* find_conv shorten
* gpu refactors
* simpler gpu
* mergable without this
* ops torch
* accelerated opencl
* it's running, it's just wrong
* bugfix
* model is correct in opencl
* lazy image convert
* add padding support to convolution
* that stuff was all upstreamed
* remove HEAD
* oops
* test_simple_conv2d_4 passes, add dilation support
* put logic in ops_opencl
* fix crash
* hmm, stride seems okay
* padding for batched inputs
* just an issue now with cout%4
* op model still passes
* fix startPackedInputChannel
* pre and post processing ops for graph
* don't break other llops
* shapetrackering
* reshapes are free
* lazy movement ops