Commit Graph

10633 Commits

Author SHA1 Message Date
qazal
7045920786 give _apply_map_to_tensors substitutes name [pr] (#9840) 2025-04-11 10:38:57 +08:00
qazal
40ef2f2857 add ast fixup stage to tensor_map [pr] (#9839) 2025-04-11 09:24:01 +08:00
qazal
fbc6aa53d4 script for local process_replay + fix viz name [pr] (#9837) 2025-04-11 00:39:18 +08:00
b1tg
a35b475d18 fix am driver for gfx1201 (#9836) 2025-04-10 19:33:02 +03:00
qazal
16956b79de canonicalize Device.DEFAULT (#9835) 2025-04-10 23:02:11 +08:00
George Hotz
f666dd14eb fix get reduce contraction with test (#9834) 2025-04-10 22:24:21 +08:00
George Hotz
c3fa470852 hotfix: remove tracebacklimit, it persists if you catch the exception and made webgpu flaky 2025-04-10 20:29:25 +08:00
chenyu
7fa5f29582 add test_embedding to test_softmax_fusion (#9832) 2025-04-10 08:25:34 -04:00
chenyu
995d20673a increase bert TRAIN_STEPS for mi300x (#9833)
got a few non converged ones so try to increase steps. we need >= 90% runs to converge
2025-04-10 08:25:09 -04:00
George Hotz
25e2a3cf5d hotfix: fix get_contraction_with_reduce 2025-04-10 20:18:19 +08:00
George Hotz
53f0b2aad7 fix infinite loop in flash attention (#9827)
* fix infinite loop in flash attention

* get_contraction_with_reduce

* skip that test

* SINGLE_KERNEL_SOFTMAX + fix multi

* default IGNORE_OOB

* print change
2025-04-10 20:06:44 +08:00
qazal
16afe04f45 move process replay to grouper (#9830)
* simpler

* sched
2025-04-10 18:27:42 +08:00
chenyu
c8f47c1d07 not_support_multi_device helper (#9831)
unify the test helper to skip ci device that does not support multi
2025-04-10 05:25:29 -04:00
chenyu
817746b30e add contiguous to EmbeddingBert output (#9829)
for some reason with random dropout it creates different ast on each device. And search embedding is slow. This workaround saved 6 minutes setup time on mi300x (25->19) and resulted in similar speed
2025-04-10 04:31:19 -04:00
qazal
fd4f06e623 kernelize prereqs [pr] (#9811)
* kernelize prereqs [pr]

* work

* tensor maps to assign

* unwrap st

* process replay

* grouper changes

* replay
2025-04-10 15:22:20 +08:00
chenyu
c462162db8 update benchmark bert scripts with BS and ACC_DTYPE (#9826)
BS=16, ACC_DTYPE=half for tinybox, BS=128, ACC_DTYPE=float for mi300x
2025-04-10 02:06:02 -04:00
qazal
498a2bf738 add err handling tests to viz + cleanups (#9825)
* cleanup

* add err handling tests to viz + cleanups

* lint
2025-04-10 14:05:05 +08:00
chenyu
a0b72f066a don't free intermediate for bert mi300x (#9824) 2025-04-10 01:48:34 -04:00
chenyu
566e389585 more relaxed ATOL for HALF=1 simple_matmul test (#9823)
it's a function of N so only updated in the test command
2025-04-10 00:46:16 -04:00
Francis Lata
eb2e59db42 RetinaNet model type annotations and loss functions (#9822)
* add type annotations and loss functions for training

* combine sum of multiple dims inside loss functions
2025-04-10 00:31:37 -04:00
chenyu
06a928b341 higher ATOL for half input TC test (#9821)
flaky
2025-04-09 23:57:25 -04:00
Francis Lata
7bb36d71b2 remove openimages iterate (#9820) 2025-04-09 22:54:12 -04:00
chenyu
2e1002e179 EVAL_BS=96 and BEAM=3 for bert green (#9819)
19m -> 13m setup and same end to end time
2025-04-09 22:37:27 -04:00
uuuvn
3ee317ffed Fix kfd autogen and verify it in ci (#9818)
Had to autogen newer uapi headers for #9746 (dmabuf export ioctl missing),
submitting just the fix without updating to newer headers as they are only
needed for infiniband stuff
2025-04-10 09:53:42 +08:00
nimlgen
d7330ea6ad amd: refactor sqtt into sep functions (#9816)
* amd: refactor sqtt into sep functions

* fix
2025-04-10 00:39:45 +03:00
nimlgen
0ca98b9f20 amd: gfx9 use cache ctrls in acquire_mem (#9815) 2025-04-09 20:17:02 +03:00
George Hotz
fce432d2e3 Ops.FUSE makes softmax a single kernel (#9808)
* KERNELIZE makes softmax a single kernel

* single kernel works

* softmax works

* broken

* correct

* skip that test

* kernelize tests

* rename to fuse

* better reduce_push_add_ones code

* correct now

* cleanups

* oops

* return None if we can't push ones

* rename + docs

* atol fixes group

* flash attention broken test
2025-04-09 22:56:28 +08:00
nimlgen
1798ce7e52 amd: faster xcc sync (#9783)
* amd: faster xcc sync

* move to same cacheline

* comment

* keep it uncached + better poll timings

* revert this, should be fine

* fixed now?

* minor
2025-04-09 15:56:50 +03:00
qazal
3bd992dc95 multi stage graph_rewrite_map (#9803)
* multistage graph_rewrite_map

* s/merge_map/input_map

* build up kernel_map from the tensor_map
2025-04-09 15:59:45 +08:00
chenyu
57f4bc3fbb add numpy to setup linting (#9806)
this would have caught the mypy error in fp8 pr. keep ignore_missing_imports to true as we also import torch which is fat
2025-04-09 03:47:03 -04:00
George Hotz
bf769fa5c5 label ranges with their number (#9805) 2025-04-09 14:31:18 +08:00
chenyu
c5db5b83b9 add SHOULD_USE_TC=1 check to simple_matmul (#9802)
* add SHOULD_USE_TC=1 check to simple_matmul

also zero centered the random input and update atol for tf32

* ATOL=2e-2 for HALF
2025-04-09 02:24:42 -04:00
qazal
f27dbc8c35 becomes_map cleanups [pr] (#9790)
* cleanup becomes_map [pr]

* source
2025-04-09 14:11:53 +08:00
qazal
7d2349c827 track_rewrites in scheduler [pr] (#9801) 2025-04-09 12:48:14 +08:00
George Hotz
bb18adb0d5 reduce with a mul chain (#9799)
* reduce with a mul chain

* inside is just 1
2025-04-09 12:42:32 +08:00
George Hotz
78caf55154 Revert "FP8 support on NVIDIA (#8631)"
This reverts commit 2c8e4ea865.
2025-04-09 12:27:41 +08:00
George Hotz
d1505137ad Revert "move TestOpsFp8s skipTest (#9797)"
This reverts commit a3aaf92b21.
2025-04-09 12:27:40 +08:00
George Hotz
14928fecff Revert "fix TF32 tensor core dropped in tc_sm89 (#9798)"
This reverts commit 7c9a96824f.
2025-04-09 12:27:39 +08:00
qazal
1ed4eae510 hotfix: don't add shape to SINK viz node (#9800) 2025-04-09 12:04:33 +08:00
chenyu
7c9a96824f fix TF32 tensor core dropped in tc_sm89 (#9798)
also add `SHOULD_USE_TC=1` to verify TC is applied in simple_matmul
2025-04-08 23:20:50 -04:00
chenyu
a3aaf92b21 move TestOpsFp8s skipTest (#9797)
so get_available_devices is not called when running other tests
2025-04-08 22:44:07 -04:00
pkotzbach
2c8e4ea865 FP8 support on NVIDIA (#8631)
* squashed fp8 commits

* tensorcore start

* minor changes

* pre-commit

* pylint

* Delete fp8mul.cu

* clean

* small bugfix

* fix test_dtype

* fix test_dtype_alu

* add EMULATE_CUDA_SM89

* fix ci

* fix test_linearizer

* fix test_linearizer

* fix swizzle

* add debug to simple_matmul

* fixed swizzle

* python emulator

* refactor python emulator

* setup fix

* numpy setup

* ml_dtypes only in emulate_cuda_sm89

* fix pylint

* fix tests

* fix mypy

* fix mypy

* fix ruff

* done python emulator

* add acc type

* tests

* mypy

* clean code

* add cuda tensor core tests to CI

* minor fix

* clean test_dtype.py

* clean cstyle.py

* clean test_ops.py

* fix test

* fix test

* whitespaces

* pylint

* pylint

* amd?

* amd?

* amd

* reduce lines

* mockgpu remove

* fix

* ruff

* ruff

* fix mypy

* ruff

* test only for cuda

* fixed formatting

* small fixes

* small fix

* least_upper_dtype if fp8s not supported

* log and reciprocal are supported for fp8s

* ops python fixes

* dtypes.fp8s use

* e4m3 + e5m2 result dtype test

* truncate linter fix

---------

Co-authored-by: pkotzbach <pawkotz@gmail.com>
Co-authored-by: George Hotz <72895+geohot@users.noreply.github.com>
Co-authored-by: chenyu <chenyu@fastmail.com>
2025-04-08 21:54:04 -04:00
hooved
5d85765327 types for WebGPU runtime (#9791)
* add type annotations to ops_webgpu

* rerun CI

* add types to some _run params
2025-04-08 22:52:17 +03:00
chenyu
4c8582a7ce pipe allow_test_size in _time_program (#9789)
* pipe allow_test_size in _time_program

it was dropped long time ago and BEAM_ESTIMATE is doing nothing

* revert BEAM_ESTIMATE
2025-04-08 09:07:40 -04:00
chenyu
8fe83385ec add system json for mi300x mlperf (#9786)
* add system json for mi300x mlperf

```
python3 -m mlperf_logging.system_desc_checker examples/mlperf/training_submission_v5.0/tinycorp/systems/tinybox_8xMI300X.json training 4.1.0
INFO -   System description checker passed for tinybox 8xMI300X
```

also removed the rocm from tinybox_red since we are not using it

* update mlperf-logging version
2025-04-08 06:36:44 -04:00
chenyu
4a807ee952 remove duplicated z3-solver in setup.py (#9787) 2025-04-08 06:12:58 -04:00
qazal
21e872df44 remove consts from sched_sink [pr] (#9782) 2025-04-08 16:08:24 +08:00
qazal
f13e9cf2d9 move view_left to grouper.py + tiny reorders [pr] (#9780)
* move view_left to grouper.py [pr]

* reorder grouper

* test_schedule
2025-04-08 15:39:28 +08:00
chenyu
7a28133b37 failed test for single softmax backward (#9778)
getting RecursionError with DONT_GROUP_REDUCES=1
2025-04-08 02:36:32 -04:00
George Hotz
fefee5d3ab single kernel softmax (#9776)
* real single kernel softmax

* cleanup

* fix blockend insertion

* add to bert test
2025-04-08 12:35:48 +08:00