* add image
* load + store + boring stuff:
* image tests pass
* thneed print GFLOPS
* op conv test
* more debugging
* hack for multiview image
* shapetracker creates less views
* disable image tests
* working better
* ugh, lkey not key
* print in DEBUG, and allow views
* works
* simple padding conv2d
* use index for image
* that was bad code
* debug print
* fix types
* less lines
* save lines
* working exec ast
* exec_ast is staticmethod
* GenericExecAST
* fold that sometimes
* ExplicitExecAST
* exec_ast for GPU
* gpu working
* get_lazyop_shape
* now gpubuffer is ExplicitExecAST
* dedup
* add a type
* RESHAPE in opencl code
* fix linter
* that too for linter
* cleanups
* remove dead code
* GenericShape is less lines
* add ALLOWED_KERNEL_COUNT to tests
* fix mypy
* that's gotta be recursive
* fix opencl shape processing
* remove unneeded lambda
* in progress
* big conv test works
* that's unneeded
* fix opencl with reduce
* rewrite contiguous_view_constant_fold
* clean up mids in loop code
* subidx
* print cl kernel before run
* no reduce, no loop
* Revert "no reduce, no loop"
This reverts commit 92777e40e9.
* option for matmul
* fixups
* fast like a nascar
* running
* thneed runner
* no buffer id makes no backing buffer
* move constant folding to the top
* runs on mac
* folded biases
* was v slow
* maybe just that
* elu touchup
* speed and float32
Co-authored-by: Comma Device <device@comma.ai>
* accelerated opencl
* it's running, it's just wrong
* bugfix
* model is correct in opencl
* lazy image convert
* add padding support to convolution
* that stuff was all upstreamed
* remove HEAD
* oops
* test_simple_conv2d_4 passes, add dilation support
* put logic in ops_opencl
* fix crash
* hmm, stride seems okay
* padding for batched inputs
* just an issue now with cout%4
* op model still passes
* fix startPackedInputChannel
* pre and post processing ops for graph
* don't break other llops
* shapetrackering
* reshapes are free
* lazy movement ops
* replace broadcasting with expand
* Tensor, not self
* remove broadcasting from mlops
* delete useless A operator
* expand, not repeat
* remove A op
* expand on gpu
* binary_op doesn't broadcast anymore
* expand is still total junk, but the tests should pass