Commit Graph

786 Commits

Author SHA1 Message Date
chenyu
0015b3921f sleep more in CI Remove amdgpu (#10261)
see if this is less flaky
2025-05-12 08:13:44 -04:00
hooved
7b4f05fd00 Add test for correctness of Infinity in WebGPU (#10201)
* use function for infinity instead of uniform

* test infinity math locally

* test infinity math in CI

* make pytest available to MacOS (WebGPU)

* revert to master except failing webgpu test
2025-05-08 05:20:05 -07:00
nimlgen
7d6ed1b1e9 hotfix: mac ci (#10210)
* fixed?

* cmnt
2025-05-08 14:13:23 +03:00
nimlgen
ba52fce4b2 usbgpu: benchmark in ci (#10208)
* usbgpu: benchmark

* usbgpu: benchmark
2025-05-08 12:02:04 +03:00
uuuvn
dba073e5c0 Less messy broken graph on paravirtualized metal workaround (#10182)
* Less messy broken graph on paravirtualized metal workaround

GitHub CI macOS runners use paravirtualized metal which is broken with
graph (some comments say that ICB in particular is broken but in my
testing it was fine sometimes, but other times hitting an assert inside
metal's code related to resouces, so not sure).

> Assertion failed: (resource != nil), function -[IOGPUMetalResource initWithResource:], file IOGPUMetalResource.m, line 458.

This can be reproduced locally with any virtualization software (like utm)
that can create macOS VMs with apple's own virtualization framework.

* unused import
2025-05-06 20:41:02 +03:00
wozeparrot
10437904cd refactor: ops_cloud -> ops_remote [pr] (#10166) 2025-05-05 15:59:51 -07:00
George Hotz
b68f036551 default on OSX is llvm 19 (#10159) 2025-05-04 18:13:50 -07:00
George Hotz
e07d8b147a hotfix: don't OOM in the osx unit test 2025-05-04 17:53:55 -07:00
George Hotz
a0240d8c2b lil work on llvm speed (#10157)
* lil work on llvm speed

* llvm failing test

* 1e-4

* simpler failing test

* once is fine

* gpt suggests this syntax change

* bump that debug
2025-05-04 16:37:26 -07:00
George Hotz
fe0724eebf prebuild all rewrites [pr] (#10154)
* prebuild all rewrites [pr]

* fix that

* tests pass with linearizer
2025-05-04 13:01:18 -07:00
qazal
230a369708 remove some IGNORE_OOB [pr] (#10142)
* remove some IGNORE_OOB

* remove fuzz_schedule stuff

* test with global

* add for amd ci
2025-05-03 01:16:14 +03:00
nimlgen
16e5376ae8 line limit 12800 for usb (#10130) 2025-05-01 16:57:44 +03:00
George Hotz
ef011ff5f9 flip Ops.COPY order [pr] (#10122)
* flip Ops.COPY order [pr]

* fix copy and support multi device copy in _device
2025-05-01 00:26:24 -04:00
Ignacio Sica
bf5fb97498 fix AMD_LLVM bf16 tc for gfx1100 (#10102)
* fix amd_llvm bf16 tc

* cleanup pattern
2025-04-30 20:06:38 -03:00
chenyu
4a04098389 fix llama3 with nf4 quantize (#10107)
also int8 outputs is wrong
2025-04-29 15:14:36 -04:00
Ignacio Sica
9d5677c12c fix ptx linearizer bug 2 [pr] (#9967)
* check for local buffer

* hotfix

* add test_tensor_cores_emulation run for ptx
2025-04-29 14:30:07 -03:00
George Hotz
c3ff308abb range has only one src now [pr] (#10100)
* range has only one op now

* fix z3 checker

* ci fix

* needs shell

* try pip ensure update

* that ensurepip is useless

* upgrade pip before cache

* windows happy?
2025-04-29 10:31:05 -04:00
Ignacio Sica
58cf8cd493 add support for "shared_mem" for LLVM (#10093)
* init llvm shared

* add test_tensor_cores_emulation run for llvm
2025-04-29 08:56:36 -04:00
Ignacio Sica
bda116d773 fix use_tensor_cores propagation (#10048)
* propagate use_tensor_cores

* add use_tensor_core to arg in test and search

* bugfix

* get TC val from ContextVar in search

* revert minor space change

* add tc emulation test to ci and benchmark

* revert

* revert whitespace change

* remove test for ptx

* add comment and remove llvm test run
2025-04-28 19:30:50 -03:00
chenyu
e996584685 olmoe in mac benchmark (#10077) 2025-04-27 21:07:02 -04:00
George Hotz
b6d2effaf5 assign is contiguous (#10066)
* assign is contiguous

* disable process replay for SDXL
2025-04-27 08:40:33 -04:00
George Hotz
ea5dddc537 reduce collapse generic (#10045)
* reduce collapse generic

* new arange folder

* new range folding

* correct with sym

* all tests pass

* indexing ops passes

* failing tests

* fix tests, remove unused

* revert that

* torch indexing is fast

* skip on webgpu

* touchups

* comments
2025-04-26 09:13:24 -04:00
chenyu
74c6cf8be3 lint mlperf model_train (#10038) 2025-04-24 16:19:44 -04:00
Ignacio Sica
51ca19d061 set test_tensor_cores_padded_amd to expectedFailure (#10036)
* init

* add expected failure to correctly track progres

* hotfix

* skip for amd_llvm as well

* add skip

* add pr number

* move comment to amd test

* change reason
2025-04-24 17:11:40 -03:00
b1tg
914d89fa0b fix tensor cores for gfx1201 (#9838)
* fix tensor cores for gfx1201

* fix typo

* fix python wmma

* AMDLLVMRenderer with arch + AMDLLVM tensor_cores

* fix ci

* clean up

* more tensor cores for RDNA4

* fix half/half, bfloat16/float, bfloat16/bfloat16 for amd_llvm

---------

Co-authored-by: nimlgen <138685161+nimlgen@users.noreply.github.com>
Co-authored-by: George Hotz <72895+geohot@users.noreply.github.com>
2025-04-24 14:57:41 -04:00
uuuvn
779aa1e2e9 Enable image tests on cloud if clouddev supports image (#9903)
Co-authored-by: George Hotz <72895+geohot@users.noreply.github.com>
2025-04-24 14:30:12 -04:00
uuuvn
29a12b19ea Add macos CLOUD tests (#10033)
A lot more work is required to enable all of them and move into osxtests
matrix, for now i created a separate runner for them (copied from WebGPU)

Will add test/test_graph.py to those tests in #9876
2025-04-24 14:14:13 -04:00
uuuvn
754d789f51 Fix and enable jit tests on CLOUD (#10031) 2025-04-24 18:39:31 +03:00
George Hotz
4e2ccfddc6 ci refactor to split AMD/NVIDIA [pr] (#10024)
* ci refactor to split AMD [pr]

* split

* split amd tests

* explicit 0
2025-04-24 08:59:54 -04:00
Sieds Lykles
e75be6eafc [bounty] [pr] index validation with z3 (#9981)
* index validation with z3

* Change comment

* toposort -> toposort()

---------

Co-authored-by: George Hotz <72895+geohot@users.noreply.github.com>
2025-04-24 08:06:08 -04:00
Ignacio Sica
023b1c28a2 test_tensor_cores_padded refactor (#9724)
* set pad t 3 for amd padded tc test

* change pad for amd regardless CI

* test tc padded uops and correctness separately

* add test_tensor_cores_padded_uops test to ci

* remove redundant chack for amd device

* cleanup
2025-04-18 17:05:54 -03:00
qazal
16dfe0a902 upstream remu (#9921) 2025-04-18 01:57:36 +03:00
George Hotz
44e4934167 fast pattern matcher [pr] (#9737)
* FastPatternMatcher

* works without that

* fix test pickle

* strict len

* compile match function

* dynamic compile

* fast

* faster

* compile

* track

* a lot faster

* clean up

* dup or

* faster and simpler

* fast match doesn't support store

* plane

* minor refactor

* real speed

* don't imply return None

* upat

* fix test

* heard you wanted more speed

* no generator

* split cf

* early fixup

* fxn fixup

* reconstruct_function

* Revert "reconstruct_function"

This reverts commit 37dac010ab.

* simpler stuff

* too big

* upat compile error

* cleanups

* don't cache that

* cleanups

* 10 -> 15
2025-04-14 15:24:41 +01:00
George Hotz
355739fc94 switch to universal match [pr] (#9879)
* switch to universal match [pr]

* 10 -> 15
2025-04-14 09:15:37 +01:00
chenyu
6896197978 relax ATOL for TC half tests more (#9847) 2025-04-11 03:20:22 -04:00
George Hotz
f666dd14eb fix get reduce contraction with test (#9834) 2025-04-10 22:24:21 +08:00
chenyu
566e389585 more relaxed ATOL for HALF=1 simple_matmul test (#9823)
it's a function of N so only updated in the test command
2025-04-10 00:46:16 -04:00
chenyu
06a928b341 higher ATOL for half input TC test (#9821)
flaky
2025-04-09 23:57:25 -04:00
uuuvn
3ee317ffed Fix kfd autogen and verify it in ci (#9818)
Had to autogen newer uapi headers for #9746 (dmabuf export ioctl missing),
submitting just the fix without updating to newer headers as they are only
needed for infiniband stuff
2025-04-10 09:53:42 +08:00
chenyu
c5db5b83b9 add SHOULD_USE_TC=1 check to simple_matmul (#9802)
* add SHOULD_USE_TC=1 check to simple_matmul

also zero centered the random input and update atol for tf32

* ATOL=2e-2 for HALF
2025-04-09 02:24:42 -04:00
George Hotz
78caf55154 Revert "FP8 support on NVIDIA (#8631)"
This reverts commit 2c8e4ea865.
2025-04-09 12:27:41 +08:00
George Hotz
14928fecff Revert "fix TF32 tensor core dropped in tc_sm89 (#9798)"
This reverts commit 7c9a96824f.
2025-04-09 12:27:39 +08:00
chenyu
7c9a96824f fix TF32 tensor core dropped in tc_sm89 (#9798)
also add `SHOULD_USE_TC=1` to verify TC is applied in simple_matmul
2025-04-08 23:20:50 -04:00
pkotzbach
2c8e4ea865 FP8 support on NVIDIA (#8631)
* squashed fp8 commits

* tensorcore start

* minor changes

* pre-commit

* pylint

* Delete fp8mul.cu

* clean

* small bugfix

* fix test_dtype

* fix test_dtype_alu

* add EMULATE_CUDA_SM89

* fix ci

* fix test_linearizer

* fix test_linearizer

* fix swizzle

* add debug to simple_matmul

* fixed swizzle

* python emulator

* refactor python emulator

* setup fix

* numpy setup

* ml_dtypes only in emulate_cuda_sm89

* fix pylint

* fix tests

* fix mypy

* fix mypy

* fix ruff

* done python emulator

* add acc type

* tests

* mypy

* clean code

* add cuda tensor core tests to CI

* minor fix

* clean test_dtype.py

* clean cstyle.py

* clean test_ops.py

* fix test

* fix test

* whitespaces

* pylint

* pylint

* amd?

* amd?

* amd

* reduce lines

* mockgpu remove

* fix

* ruff

* ruff

* fix mypy

* ruff

* test only for cuda

* fixed formatting

* small fixes

* small fix

* least_upper_dtype if fp8s not supported

* log and reciprocal are supported for fp8s

* ops python fixes

* dtypes.fp8s use

* e4m3 + e5m2 result dtype test

* truncate linter fix

---------

Co-authored-by: pkotzbach <pawkotz@gmail.com>
Co-authored-by: George Hotz <72895+geohot@users.noreply.github.com>
Co-authored-by: chenyu <chenyu@fastmail.com>
2025-04-08 21:54:04 -04:00
Sieds Lykles
07d1aefaf4 fast idiv (#9755)
* fast idiv with tests and fuzzer

* Add todo comment

* Add env variable to toggle fast_idiv

* Move env check

* Add fuzz fast_idiv to ci

---------

Co-authored-by: chenyu <chenyu@fastmail.com>
2025-04-07 08:32:24 -04:00
Ignacio Sica
58785181a8 AMD bf16xf32 TC (#9717)
* dont test bf16 for emulated amd tc

* skip bf16 tc test in ci

* skip bf16 for AMD in test_tensor_cores_codegen

* add simple bf16 gemm test to benchmark
2025-04-07 11:41:04 +08:00
George Hotz
cac8bcf8b5 use Ops.REDUCE (#9721)
* decrease bert python time [pr]

* order copies

* Revert "order copies"

This reverts commit 3f62c8693b.

* rewrite count

* Ops.REDUCE

* acc first in the add chain

* Fix tensor core acc

* arange patterns look good

* fix multireduce gate

* reduce rewrite rule

* bump that to 15 minutes

* multiwmma isn't fusing

* gep through wmma is gep pushing

* bump that timeout too, it's all env setup

* add failing test
2025-04-04 10:14:34 +08:00
chenyu
1d25844d44 Revert "disable CI red llama 3 4 gpu beam (#9690)" (#9709)
This reverts commit 6a5eacba8b.
2025-04-03 02:34:39 -04:00
George Hotz
49dafe6d43 add gc tests [pr] (#9718)
* add gc tests [pr]

* del

* more gc tests

* add NullGraph
2025-04-03 14:08:32 +08:00
Ignacio Sica
bc91fffc5d fix gated store with index in python backend (#9703)
* add default gate in index

* assert store

* add TestRendererFailures

- move test_gated_store_with_alu to new TestRenderFailures class for
tests that fail on multiple renderers
- add test_renderer_failures.py run on python CI

* add test for gated index in 2d

* test TestRenderFailures
2025-04-03 12:48:28 +08:00