* use function for infinity instead of uniform
* test infinity math locally
* test infinity math in CI
* make pytest available to MacOS (WebGPU)
* revert to master except failing webgpu test
* Less messy broken graph on paravirtualized metal workaround
GitHub CI macOS runners use paravirtualized metal which is broken with
graph (some comments say that ICB in particular is broken but in my
testing it was fine sometimes, but other times hitting an assert inside
metal's code related to resouces, so not sure).
> Assertion failed: (resource != nil), function -[IOGPUMetalResource initWithResource:], file IOGPUMetalResource.m, line 458.
This can be reproduced locally with any virtualization software (like utm)
that can create macOS VMs with apple's own virtualization framework.
* unused import
* range has only one op now
* fix z3 checker
* ci fix
* needs shell
* try pip ensure update
* that ensurepip is useless
* upgrade pip before cache
* windows happy?
* propagate use_tensor_cores
* add use_tensor_core to arg in test and search
* bugfix
* get TC val from ContextVar in search
* revert minor space change
* add tc emulation test to ci and benchmark
* revert
* revert whitespace change
* remove test for ptx
* add comment and remove llvm test run
* init
* add expected failure to correctly track progres
* hotfix
* skip for amd_llvm as well
* add skip
* add pr number
* move comment to amd test
* change reason
A lot more work is required to enable all of them and move into osxtests
matrix, for now i created a separate runner for them (copied from WebGPU)
Will add test/test_graph.py to those tests in #9876
* set pad t 3 for amd padded tc test
* change pad for amd regardless CI
* test tc padded uops and correctness separately
* add test_tensor_cores_padded_uops test to ci
* remove redundant chack for amd device
* cleanup
* FastPatternMatcher
* works without that
* fix test pickle
* strict len
* compile match function
* dynamic compile
* fast
* faster
* compile
* track
* a lot faster
* clean up
* dup or
* faster and simpler
* fast match doesn't support store
* plane
* minor refactor
* real speed
* don't imply return None
* upat
* fix test
* heard you wanted more speed
* no generator
* split cf
* early fixup
* fxn fixup
* reconstruct_function
* Revert "reconstruct_function"
This reverts commit 37dac010ab.
* simpler stuff
* too big
* upat compile error
* cleanups
* don't cache that
* cleanups
* 10 -> 15
Had to autogen newer uapi headers for #9746 (dmabuf export ioctl missing),
submitting just the fix without updating to newer headers as they are only
needed for infiniband stuff
* fast idiv with tests and fuzzer
* Add todo comment
* Add env variable to toggle fast_idiv
* Move env check
* Add fuzz fast_idiv to ci
---------
Co-authored-by: chenyu <chenyu@fastmail.com>
* dont test bf16 for emulated amd tc
* skip bf16 tc test in ci
* skip bf16 for AMD in test_tensor_cores_codegen
* add simple bf16 gemm test to benchmark
* add default gate in index
* assert store
* add TestRendererFailures
- move test_gated_store_with_alu to new TestRenderFailures class for
tests that fail on multiple renderers
- add test_renderer_failures.py run on python CI
* add test for gated index in 2d
* test TestRenderFailures