* allow for general broadcasting of binary operations. can handle any situation where corresponding dimensions between the tensors match, or at least one of them is of size 1. if a tensor has fewer dimensions than the other, then its size is padded with 1s until they match have the same number. also refactored buffer_zeros() by creating a function buff() that makes a buffer from a numpy array
* remove extra tabs
* messy loop unrolling
* fix loop unrolling bugs
* revert loop unrolling changes, new plan here
* binary_op(): avoid having a loop in the GPU C code, instead compute indices with nested expressions. simple broadcasts should have a similar level of performance to the simple-broadcast-specific code that was there before. broke out codegen and compilation into get_binop_prg(), which has a larger cache and depends only on the operation type and complist (this avoids doing a bunch of python string ops every time we want to compile something we've already compiled). the larger cache is needed since there will end up being quite a few possible types of broacasts (sum_i^N 3**i is a loose upper bound, N being the maximum number of dimensions). I assumed 5 kinds of binary operations when sizing the cache here, +, -, *, /, and **. More may be needed in the future.
* add .cl to binop arguments
* solved edge case where len(dimlist)==0. still problems when len(dimlist) > CL_DEVICE_MAX_WORK_ITEM_DIMENSIONS
* pyopencl can't handle more than 3 gids, so we just use 1 gid and compute the indices into the returned tensor in the kernel. this means more computation for the individual indices, but less for the index into the flattened tensor (last line of kernel), since it's just gid0
* trim some lines
Co-authored-by: phillip <phillip_bement@reedbement.com>
* no of categories for efficientnet
* need layer_init_uniforn
* merge fail
* merge fail
* batchnorms
* needs work
* needs work how determine training
* pow
* needs work
* reshape was needed
* sum with axis
* sum with axis and tests
* broken
* works again
* clean up
* Update test_ops.py
* using sum
* don't always update running_stats
* space
* self
* default return running_stats
* passes test
* need to use mean
* merge
* testing
* fixing pow
* test_ops had a line dropped
* undo pow
* rebase
* Consistent GPU classes
Convert the existing GPU classes into one standard format.
Remove duplicated functions in `test_mnist` and create a TestMNISTGPU
class. This reduces line count and ensures consistency.
Use `@unittest.skipUnless(GPU, "Requires GPU")` instead of `if GPU:` to
skip GPU testing. This will ensure that skipped tests are displayed
accordingly in the pytest output.
* Optim Testing now supports GPU
* Tensor testing now supports GPU
jacobian and gradcheck auto skipped until GPU float64 support added.
* GPU support for custom constructor methods
* Remove GPU flag from Model constructors
It was requested that the `gpu` kwarg be removed from the model
constructor. GPU conversion is now handled in the train function.
This also required the conversion of Optimizer parameters as they are
constructed prior to execution of the `train` function and are dependant
on the model GPU state.
* Fix typo: float32->float64
* Clean `get_parameters` utility
Just a quick refactor w/ the new support for optimizers.
* Remove GPU kwarg from TinyNet
Remove `gpu` kwarg from tiny net to match test_mnist `train` function.