* Pad2d backward pass on GPU
* Faster Pad2D GPU backward pass (no zeroing needed)
* Fix out of bounds error
* Don't save prg
* Let compiler optimize division by 1
* More generic broadcasting (1s at the start)
* Bug fix
* Add comment
* Try to fix flaky test with other method
* Add mixed broadcast support
* 1kernel
* Separate broadcast tests
Co-authored-by: holonomicjl <58403584+holonomicjl@users.noreply.github.com>
* no trailing whitespace
* GPU MaxPool2D.backward(); TinyConvNet train passes!
* Fix GPU avgpool.forward() init_val
Doesn’t change result but is simpler.
* Fix MaxPool GPU init_val
Tests only cover random non-negative inputs. This fixes issues if negative inputs are fed to GPU MaxPool2D. Test update to follow.
* to make it work locally
* definitely not working
* Conv2D GPU passes some of the tests
* Conv2D GPU passes more of the tests
* passes some tests and mnist
* removed unecessary code
* Conv2D Backpass works
* wrong test_ops.py
* white space + test backward
* ereased useless code
* removed default argument
* long lines
Strided CPU Pooling was introduced but assumes small kernel size
(<=(10,10)), but efficientnet.py feeds kernel_size=(112,112).
This causes a huge array buffer allocation in stack_for_pool() that
hangs inference for a long time or until system OOM.
Revert CPU Pooling for now, and re-introduce #74 later with a new
global-average-pooling op that can be used instead of avgpool2d with
large kernel size for efficientnet inference.
Co-authored-by: Ryan Neph <ryanneph@google.com>
* copy tensors to and from gpu
* add on GPU
* adding works
* we stick shapes in
* works on cpu and gpu
* test changes, not passing yet
* something else
* op tests pass
* add, mean, and sum have working forward/backward
* mul ops test
* no gpu support, no problem
* test pass, clean up later
* gpu cleanup
* cleanup test ops, don't let div fail
* revert more
* aimpler dispatcher
* clean up grad
* GPU and
* grad is a Tensor now
* gate test on GPU
* cleanups
* late loading gpu
* GPU as input option
* last cleanups