We can replace += with = since we only change tmp once.
Now np.empty() can replace np.zeros() which might be slightly faster.
This saves a few milliseconds, best case ~60ms.
(However, most of the time in ops_cpu.processing_op() seems to be spend on np.reshape())
* accelerated opencl
* it's running, it's just wrong
* bugfix
* model is correct in opencl
* lazy image convert
* add padding support to convolution
* that stuff was all upstreamed
* remove HEAD
* oops
* test_simple_conv2d_4 passes, add dilation support
* put logic in ops_opencl
* fix crash
* hmm, stride seems okay
* padding for batched inputs
* just an issue now with cout%4
* op model still passes
* fix startPackedInputChannel
* pre and post processing ops for graph
* don't break other llops
* shapetrackering
* reshapes are free
* lazy movement ops
* start shapetracker
* that late reshape is crushing our hopes
* simple failure
* DumbShapeTracker passes tests
* improve st tests
* stacked view tracker works
* flip works
* tests pass
* shapetracker works
* use ShapeTracker in ops_gpu
* a couple lines
* fix 0 shape
* less lines
* use shapetracker for new_shape in ops.py
* simpler still
* padding with a ZeroView
* gamed it a little
* replace broadcasting with expand
* Tensor, not self
* remove broadcasting from mlops
* delete useless A operator
* expand, not repeat
* remove A op
* expand on gpu
* binary_op doesn't broadcast anymore
* expand is still total junk, but the tests should pass