* simple lazy
* simple
* fix graph and make realize simpler
* SHUFFLE_MOVEMENT_OPS already works
* MERGE_MOVEMENT_OPS and REMOVE_MOVEMENT_NOPS
* it works, but it's slow
* constant inlining
* cache misses are the reason for loss
* fix non determinism
* cleanup, a few tests fail
* profile
* cache lazyop
* cleanups
* create namedtuple once
* bunch of caches
* it's not deleting
* nograd
* caching allocator
* reduce_op
* fromCPU if you want fromCPU
* complain
* nvidia fix
* realized on Tensor
* numpy is very slow
* no loads in second run
* caching in View
* 10ms speedups on batman
* remove old profiler
* bunch of refactors
* contiguous on view
* elementwise_op_compile for conv
* support ewop after processing op
* this still works
* conv folding works
* all we do is conv conv conv no matter what
* all args to the conv
* still works
* unify conv and ewop
* ops_gpu cleanup
* move around ops_gpu
* remove caching allocator
* remove unused
* find_conv shorten
* gpu refactors
* simpler gpu
* and that
* cmp is fast
* 18ms on mac
* it's a lot of lines, but it's faster
* minor
* tests pass
* LoadOps.CONTIGUOUS
* remove dups
* torch converter doesn't support slice
* move lazy out for merge
* LoadOps are only for lazy
* simple lazy
* simple
* fix graph and make realize simpler
* SHUFFLE_MOVEMENT_OPS already works
* MERGE_MOVEMENT_OPS and REMOVE_MOVEMENT_NOPS
* it works, but it's slow
* constant inlining
* cache misses are the reason for loss
* fix non determinism
* cleanup, a few tests fail
* profile
* cache lazyop
* cleanups
* create namedtuple once
* bunch of caches
* it's not deleting
* nograd
* caching allocator
* reduce_op
* fromCPU if you want fromCPU
* complain
* nvidia fix
* realized on Tensor
* numpy is very slow
* no loads in second run
* caching in View
* 10ms speedups on batman
* remove old profiler
* bunch of refactors
* contiguous on view
* elementwise_op_compile for conv
* support ewop after processing op
* this still works
* conv folding works
* all we do is conv conv conv no matter what
* all args to the conv
* still works
* unify conv and ewop
* ops_gpu cleanup
* move around ops_gpu
* remove caching allocator
* remove unused
* find_conv shorten
* gpu refactors
* simpler gpu
* mergable without this
* ops torch
* accelerated opencl
* it's running, it's just wrong
* bugfix
* model is correct in opencl
* lazy image convert
* add padding support to convolution
* that stuff was all upstreamed
* remove HEAD
* oops
* test_simple_conv2d_4 passes, add dilation support
* put logic in ops_opencl
* fix crash
* hmm, stride seems okay
* padding for batched inputs
* just an issue now with cout%4
* op model still passes
* fix startPackedInputChannel
* pre and post processing ops for graph
* don't break other llops
* shapetrackering
* reshapes are free
* lazy movement ops
* start shapetracker
* that late reshape is crushing our hopes
* simple failure
* DumbShapeTracker passes tests
* improve st tests
* stacked view tracker works
* flip works
* tests pass
* shapetracker works
* use ShapeTracker in ops_gpu
* a couple lines
* fix 0 shape
* less lines
* use shapetracker for new_shape in ops.py
* simpler still
* padding with a ZeroView
* gamed it a little
* replace broadcasting with expand
* Tensor, not self
* remove broadcasting from mlops
* delete useless A operator
* expand, not repeat
* remove A op
* expand on gpu
* binary_op doesn't broadcast anymore
* expand is still total junk, but the tests should pass