* remove ctx from gpu ops
* ctx for the others
* this is okay
* mlops are not static. fix lazy
* cl is property, _processing_op is class method
* kernel_name
* contiguous_op
* simple lazy
* simple
* fix graph and make realize simpler
* SHUFFLE_MOVEMENT_OPS already works
* MERGE_MOVEMENT_OPS and REMOVE_MOVEMENT_NOPS
* it works, but it's slow
* constant inlining
* cache misses are the reason for loss
* fix non determinism
* cleanup, a few tests fail
* profile
* cache lazyop
* cleanups
* create namedtuple once
* bunch of caches
* it's not deleting
* nograd
* caching allocator
* reduce_op
* fromCPU if you want fromCPU
* complain
* nvidia fix
* realized on Tensor
* numpy is very slow
* no loads in second run
* caching in View
* 10ms speedups on batman
* remove old profiler
* bunch of refactors
* contiguous on view
* elementwise_op_compile for conv
* support ewop after processing op
* this still works
* conv folding works
* all we do is conv conv conv no matter what
* all args to the conv
* still works
* unify conv and ewop
* ops_gpu cleanup
* move around ops_gpu
* remove caching allocator
* remove unused
* find_conv shorten
* gpu refactors
* simpler gpu
* and that
* cmp is fast
* 18ms on mac
* it's a lot of lines, but it's faster
* minor
* tests pass
* LoadOps.CONTIGUOUS
* remove dups
* torch converter doesn't support slice
* move lazy out for merge
* LoadOps are only for lazy
* simple lazy
* simple
* fix graph and make realize simpler
* SHUFFLE_MOVEMENT_OPS already works
* MERGE_MOVEMENT_OPS and REMOVE_MOVEMENT_NOPS
* it works, but it's slow
* constant inlining
* cache misses are the reason for loss
* fix non determinism
* cleanup, a few tests fail
* profile
* cache lazyop
* cleanups
* create namedtuple once
* bunch of caches
* it's not deleting
* nograd
* caching allocator
* reduce_op
* fromCPU if you want fromCPU
* complain
* nvidia fix
* realized on Tensor
* numpy is very slow
* no loads in second run
* caching in View
* 10ms speedups on batman
* remove old profiler
* bunch of refactors
* contiguous on view
* elementwise_op_compile for conv
* support ewop after processing op
* this still works
* conv folding works
* all we do is conv conv conv no matter what
* all args to the conv
* still works
* unify conv and ewop
* ops_gpu cleanup
* move around ops_gpu
* remove caching allocator
* remove unused
* find_conv shorten
* gpu refactors
* simpler gpu
* mergable without this
* ops torch
We can replace += with = since we only change tmp once.
Now np.empty() can replace np.zeros() which might be slightly faster.
This saves a few milliseconds, best case ~60ms.
(However, most of the time in ops_cpu.processing_op() seems to be spend on np.reshape())
* accelerated opencl
* it's running, it's just wrong
* bugfix
* model is correct in opencl
* lazy image convert
* add padding support to convolution
* that stuff was all upstreamed
* remove HEAD
* oops
* test_simple_conv2d_4 passes, add dilation support
* put logic in ops_opencl
* fix crash
* hmm, stride seems okay
* padding for batched inputs
* just an issue now with cout%4
* op model still passes
* fix startPackedInputChannel
* pre and post processing ops for graph
* don't break other llops
* shapetrackering
* reshapes are free
* lazy movement ops