* fix LtNode simplification when lhs and rhs contain same variables
`(Variable("a", 1, 5) < Variable("a", 1, 5))` should eval to `NumNode(0)`
* fix with less perf impact
* Fix numpy uint/int overflow
* lol
* Works
* Update
* Move overflow test to float64/float32
* One line
* Update
* One more
---------
Co-authored-by: Patrick Tsai <patosai@users.noreply.github.com>
* Cast correctly in python emulator
* Update test yml and fix lint
* make ruff pass
* mypy passes
---------
Co-authored-by: Patrick Tsai <patosai@users.noreply.github.com>
it's recommended to use __getnewargs__ to update the args of classes that use __new__ when unpickling.
It's preferred because it does not change the __new__ behavior.
* do not truncate float64 precision
* use l suffix to try avoid overload confusion
* long line, ruff bloats the function otherwise
* fmt
* remove long double suffix (l), it's sufficient to have the float32 (f) suffix to avoid function overload ambigouity; add test showcasing rtol=1e-12 precision increase, the test fails without the renderer changes
* use more reasonable test values, same as test_int_to_float_unary_func
* disable test for CUDACPU, does not support half and segfaults on some operations per dtypes_alu test
* disable test for HIP, renderer does not support f64 precision
* do not use noqa E501, break up condition
* remove cpu and torch backends
* don't copy to cpu
* use clang instead of cpu
* multitensor gathers on the first device
* clang is cpu + use default
* fixup
* bugfix
* fix OverflowError in UnaryOps.EXP2
* avoid accessing outputs for void uops
* skip execution for UOps.IF and UOps.ENDIF
* initialize bytearray to the correct size in UOps.DEFINE_LOCAL
* validate len of input that has .sz > 1
* remove comment in code
* reinitialize loop of already iterated
* validate first value in input to be a list for inputs with .sz > 1
* add python ops tests to CI
* skip long runtime tests for PYTHON backend
* respect dtype.sz arg in UOps.CONST, and remove incorrect validation in UOps.STORE
* use math.inf instead of float('int')
* handle 0 args to UnaryOPs.LOG2
* handle load op with default of .sz > 1
* initialize the loop correctly using UOps.LOOP arg
* remove unnecessary TODO comment
* remove newline
* select a subset of 22 ops tests to skip in CI when PYTHON=1
* handle gated UOps.LOAD referencing values that have .sz > 1
* Revert "select a subset of 22 ops tests to skip in CI when PYTHON=1"
This reverts commit 7674fee81d.
* skip tests in python backend CI command
* push fix lost in conflict resolve
* Revert "skip long runtime tests for PYTHON backend"
This reverts commit 5dd2a0376e.
* clear loop state after last iteration
* rename .sz for .count on dtype (and ANETensor for completeness)
* revert the changes to extra, as per review
* try to make linter happier
* remove the change to extra
* remove float cast
* cast scalars to the correct value in creation time
* cast scalar in the correct place
* wrong, use y_dtype
* make consts have a unique cache key
* add cast_scalar back
* test_load_cache_const_bufs
* add bool dtype
* test_const_dtype
* fix linters
* emulated ops_hip infra
* add int4
* include test_indexing in remu
* Revert "Merge branch 'remu-dev-mac'"
This reverts commit 6870457e57, reversing
changes made to 3c4c8c9e16.
* generic rendering of half and bf16
hotfix
* fix uops + regression test
* fix the test for metal's half4
* uop.uop fixup
* mypy with --strict-equality, fix ops_gpu
* set metal fast math default to 0 (disabled)
It's a correctness fix because we use inf and nan. Let's see how slow it is
* skip failed onnx tests
* tmp DISABLE_COMPILER_CACHE=1 in metal benchmark
* Revert "tmp DISABLE_COMPILER_CACHE=1 in metal benchmark"
This reverts commit 22267df380.