* init
* removed mulacc
* is uoptimize the problem?
* lol hax make work temporarily fix l8er
* revert extra/ changes
* clean up
* flaky metal tests?
* add back mulacc for metal
* revert last commit
* try skipping linearizer_failure tests
* skip flammit tests... cuz tests all work locally
* try narrow down exact linearizer failure test
* try 2
* try 4
* generated code is the exact same wtf why CI fails
* code for 15 and 17 are exact same with or without mulacc, this should pass
* try only 1 failure
* try garbage collecting lol...
* try del variables lol
* try gcing after del lol...
* is diskcache the problem???
* try disabling opts cache idk
* try remove hack
* try disable github metal cache...
* try CACHELEVEL=0 :D idk anymore
* try increase newCommandQueueWithMaxCommandBufferCount_, im almost out of ideas...
* revert
* actually not a HACK
* oops
* Fix numpy uint/int overflow
* lol
* Works
* Update
* Move overflow test to float64/float32
* One line
* Update
* One more
---------
Co-authored-by: Patrick Tsai <patosai@users.noreply.github.com>
* Cast correctly in python emulator
* Update test yml and fix lint
* make ruff pass
* mypy passes
---------
Co-authored-by: Patrick Tsai <patosai@users.noreply.github.com>
* remove cpu and torch backends
* don't copy to cpu
* use clang instead of cpu
* multitensor gathers on the first device
* clang is cpu + use default
* fixup
* bugfix
* fix OverflowError in UnaryOps.EXP2
* avoid accessing outputs for void uops
* skip execution for UOps.IF and UOps.ENDIF
* initialize bytearray to the correct size in UOps.DEFINE_LOCAL
* validate len of input that has .sz > 1
* remove comment in code
* reinitialize loop of already iterated
* validate first value in input to be a list for inputs with .sz > 1
* add python ops tests to CI
* skip long runtime tests for PYTHON backend
* respect dtype.sz arg in UOps.CONST, and remove incorrect validation in UOps.STORE
* use math.inf instead of float('int')
* handle 0 args to UnaryOPs.LOG2
* handle load op with default of .sz > 1
* initialize the loop correctly using UOps.LOOP arg
* remove unnecessary TODO comment
* remove newline
* select a subset of 22 ops tests to skip in CI when PYTHON=1
* handle gated UOps.LOAD referencing values that have .sz > 1
* Revert "select a subset of 22 ops tests to skip in CI when PYTHON=1"
This reverts commit 7674fee81d.
* skip tests in python backend CI command
* push fix lost in conflict resolve
* Revert "skip long runtime tests for PYTHON backend"
This reverts commit 5dd2a0376e.
* clear loop state after last iteration
* rename .sz for .count on dtype (and ANETensor for completeness)
* revert the changes to extra, as per review
* try to make linter happier
* remove the change to extra
* generic rendering of half and bf16
hotfix
* fix uops + regression test
* fix the test for metal's half4
* uop.uop fixup
* mypy with --strict-equality, fix ops_gpu
* set metal fast math default to 0 (disabled)
It's a correctness fix because we use inf and nan. Let's see how slow it is
* skip failed onnx tests
* tmp DISABLE_COMPILER_CACHE=1 in metal benchmark
* Revert "tmp DISABLE_COMPILER_CACHE=1 in metal benchmark"
This reverts commit 22267df380.
* env var METAL_FAST_MATH to disable fastmath for metal
use this to test impact of fast math. might need to disable compiler cache with DISABLE_COMPILER_CACHE
* failed onnx test with fast math
METAL_FAST_MATH=0 DISABLE_COMPILER_CACHE=1 NOOPT=1 python -m pytest -n=auto test/external/external_test_onnx_backend.py -k test_MaxPool3d_stride_padding_cpu
* ops_python: add HIP tensor core mock and refactor METAL
* Add tests to CI
* add DEBUG=2 to full tests
---------
Co-authored-by: George Hotz <72895+geohot@users.noreply.github.com>
* start uop emu
* tiny_add passes
* more ops
* emulate the whole warp
* test_gemm passes
* metal gemm test pass
* works on big gemm
* works on big gemm
* more tests pass
* touch ups
* fix mypy
* cleanups
* exp2 mypy
* arch is where it belongs
* actually emulate tensor cores
* fix test
* new style
* add gated load support to PYTHON
* out of bounds error message
* cleaner
* start uop emu
* tiny_add passes
* more ops
* emulate the whole warp
* test_gemm passes
* metal gemm test pass
* works on big gemm
* works on big gemm
* more tests pass
* touch ups
* fix mypy
* cleanups
* exp2 mypy
* arch is where it belongs
* actually emulate tensor cores
* fix test
* new style
* skip matacc opt if the all src buffers of mul op are const buffers
* add noqa directive for long test
* unskip MALACC opt
* ensure that a_axes at least includes summation axes in order to perform np.einsum correctly
* add regression test for mulacc op
* compute a_slices using a_axes
* refactor helper of function to retrieve axes and slices for nonzero strides as well as summation axes
* include a regression test that uses and to test the behaviour indirectly
* mockhip->hipcpu
* allocate buffers
* launch a kernel
read_asm api
* run remu in CI
* remu 0.0.2, real test ops
* simple driver
* 0.0.3, all test_ops
* run the latest emulator
* 9 minutes is way too long, drop backprop in CI
* bring back the backward pass
* Revert "bring back the backward pass"
This reverts commit 3781e1bc56.
* Print slowest tests
* emulated device directly in ops_hip
* fix ruff, override mypy for specific rules
* test in the same code path
- hip backend env variables
- install packages and verify autogen
- run certain tests
- remove the other hip tests path
- verify Device.DEFAULT
* remove the emulated hip in extra
---------
Co-authored-by: George Hotz <72895+geohot@users.noreply.github.com>
* move gpuctypes in tree
* fix mypy
* regex exclude
* autogen sh
* mypy exclude
* does that fix it
* fix mypy
* add hip confirm
* verify all autogens
* build clang2py
* opencl headers
* gpu on 22.04