* triton can add
* print stuff from triton
* write out file
* ops triton working
* reduce ops
* sort of works
* Triton bugfixes & implementation of remaining ops (#490)
* padding
* support pow, max, relu, gt0
* allocate return buffer
* Fix reduce
* Add tests for power op
* Fix triton illegal memory accesses and memory leak (#512)
* Fix mypy issue
* Add triton to setup.py
* Replace torch with pycuda
* Use one cuda stream for data transfer and kernels
* Remove triton submodule
* Fix memory leak by using weakrefs for caching
* Fix memory access by adding valid as mask for load
* Fix invalid kernel launches by flattening the grid (#515)
---------
Co-authored-by: Martin Loretz <20306567+martinloretzzz@users.noreply.github.com>