Commit Graph

11094 Commits

Author SHA1 Message Date
Alexey Zaytsev
3bce5ad2b4 clang should not emit the .comment section (#9859)
This section gets included in the finanl image, and we get a lot of garbage with DEBUG=7
2025-04-12 10:59:11 +08:00
Alexey Zaytsev
7dda6aae7d Skip CLOUD in external_test_example (#9857)
Closes #9814
2025-04-12 10:17:44 +08:00
nimlgen
7919bb4f8a amd: do not use log2 (#9852) 2025-04-11 19:53:06 +03:00
nimlgen
ada0f67d3d am: fix speed of ring copies (#9854) 2025-04-11 17:28:06 +03:00
chenyu
4aab16ca6a bert script cleanup and assert nan loss (#9851) 2025-04-11 05:41:49 -04:00
qazal
ad677f8e55 create_ast cleanups from kernelize [pr] (#9849) 2025-04-11 16:10:21 +08:00
qazal
cbc5e7ed45 unbind variables when creating ScheduleItems [pr] (#9846) 2025-04-11 15:23:53 +08:00
chenyu
6896197978 relax ATOL for TC half tests more (#9847) 2025-04-11 03:20:22 -04:00
George Hotz
dd52951dd0 fix single kernel softmax with cast (#9842)
* fix single kernel softmax with cast

* tolerate none

* 3e-4

* skip on dtype
2025-04-11 12:12:02 +08:00
chenyu
8c6299bced move hand_coded_optimizations to heuristic.py [pr] (#9844)
* move hand_coded_optimizations to heuristic.py [pr]

also folded all long lines

* make a copy and rename self -> k

* fix test
2025-04-10 23:40:16 -04:00
chenyu
e0ec8be37d use CPU for test_schedule_ring (#9843)
* use CPU for test_schedule_ring

* why pre-commit is good
2025-04-10 23:20:53 -04:00
qazal
7045920786 give _apply_map_to_tensors substitutes name [pr] (#9840) 2025-04-11 10:38:57 +08:00
qazal
40ef2f2857 add ast fixup stage to tensor_map [pr] (#9839) 2025-04-11 09:24:01 +08:00
qazal
fbc6aa53d4 script for local process_replay + fix viz name [pr] (#9837) 2025-04-11 00:39:18 +08:00
b1tg
a35b475d18 fix am driver for gfx1201 (#9836) 2025-04-10 19:33:02 +03:00
qazal
16956b79de canonicalize Device.DEFAULT (#9835) 2025-04-10 23:02:11 +08:00
George Hotz
f666dd14eb fix get reduce contraction with test (#9834) 2025-04-10 22:24:21 +08:00
George Hotz
c3fa470852 hotfix: remove tracebacklimit, it persists if you catch the exception and made webgpu flaky 2025-04-10 20:29:25 +08:00
chenyu
7fa5f29582 add test_embedding to test_softmax_fusion (#9832) 2025-04-10 08:25:34 -04:00
chenyu
995d20673a increase bert TRAIN_STEPS for mi300x (#9833)
got a few non converged ones so try to increase steps. we need >= 90% runs to converge
2025-04-10 08:25:09 -04:00
George Hotz
25e2a3cf5d hotfix: fix get_contraction_with_reduce 2025-04-10 20:18:19 +08:00
George Hotz
53f0b2aad7 fix infinite loop in flash attention (#9827)
* fix infinite loop in flash attention

* get_contraction_with_reduce

* skip that test

* SINGLE_KERNEL_SOFTMAX + fix multi

* default IGNORE_OOB

* print change
2025-04-10 20:06:44 +08:00
qazal
16afe04f45 move process replay to grouper (#9830)
* simpler

* sched
2025-04-10 18:27:42 +08:00
chenyu
c8f47c1d07 not_support_multi_device helper (#9831)
unify the test helper to skip ci device that does not support multi
2025-04-10 05:25:29 -04:00
chenyu
817746b30e add contiguous to EmbeddingBert output (#9829)
for some reason with random dropout it creates different ast on each device. And search embedding is slow. This workaround saved 6 minutes setup time on mi300x (25->19) and resulted in similar speed
2025-04-10 04:31:19 -04:00
qazal
fd4f06e623 kernelize prereqs [pr] (#9811)
* kernelize prereqs [pr]

* work

* tensor maps to assign

* unwrap st

* process replay

* grouper changes

* replay
2025-04-10 15:22:20 +08:00
chenyu
c462162db8 update benchmark bert scripts with BS and ACC_DTYPE (#9826)
BS=16, ACC_DTYPE=half for tinybox, BS=128, ACC_DTYPE=float for mi300x
2025-04-10 02:06:02 -04:00
qazal
498a2bf738 add err handling tests to viz + cleanups (#9825)
* cleanup

* add err handling tests to viz + cleanups

* lint
2025-04-10 14:05:05 +08:00
chenyu
a0b72f066a don't free intermediate for bert mi300x (#9824) 2025-04-10 01:48:34 -04:00
chenyu
566e389585 more relaxed ATOL for HALF=1 simple_matmul test (#9823)
it's a function of N so only updated in the test command
2025-04-10 00:46:16 -04:00
Francis Lata
eb2e59db42 RetinaNet model type annotations and loss functions (#9822)
* add type annotations and loss functions for training

* combine sum of multiple dims inside loss functions
2025-04-10 00:31:37 -04:00
chenyu
06a928b341 higher ATOL for half input TC test (#9821)
flaky
2025-04-09 23:57:25 -04:00
Francis Lata
7bb36d71b2 remove openimages iterate (#9820) 2025-04-09 22:54:12 -04:00
chenyu
2e1002e179 EVAL_BS=96 and BEAM=3 for bert green (#9819)
19m -> 13m setup and same end to end time
2025-04-09 22:37:27 -04:00
uuuvn
3ee317ffed Fix kfd autogen and verify it in ci (#9818)
Had to autogen newer uapi headers for #9746 (dmabuf export ioctl missing),
submitting just the fix without updating to newer headers as they are only
needed for infiniband stuff
2025-04-10 09:53:42 +08:00
nimlgen
d7330ea6ad amd: refactor sqtt into sep functions (#9816)
* amd: refactor sqtt into sep functions

* fix
2025-04-10 00:39:45 +03:00
nimlgen
0ca98b9f20 amd: gfx9 use cache ctrls in acquire_mem (#9815) 2025-04-09 20:17:02 +03:00
George Hotz
fce432d2e3 Ops.FUSE makes softmax a single kernel (#9808)
* KERNELIZE makes softmax a single kernel

* single kernel works

* softmax works

* broken

* correct

* skip that test

* kernelize tests

* rename to fuse

* better reduce_push_add_ones code

* correct now

* cleanups

* oops

* return None if we can't push ones

* rename + docs

* atol fixes group

* flash attention broken test
2025-04-09 22:56:28 +08:00
nimlgen
1798ce7e52 amd: faster xcc sync (#9783)
* amd: faster xcc sync

* move to same cacheline

* comment

* keep it uncached + better poll timings

* revert this, should be fine

* fixed now?

* minor
2025-04-09 15:56:50 +03:00
qazal
3bd992dc95 multi stage graph_rewrite_map (#9803)
* multistage graph_rewrite_map

* s/merge_map/input_map

* build up kernel_map from the tensor_map
2025-04-09 15:59:45 +08:00
chenyu
57f4bc3fbb add numpy to setup linting (#9806)
this would have caught the mypy error in fp8 pr. keep ignore_missing_imports to true as we also import torch which is fat
2025-04-09 03:47:03 -04:00
George Hotz
bf769fa5c5 label ranges with their number (#9805) 2025-04-09 14:31:18 +08:00
chenyu
c5db5b83b9 add SHOULD_USE_TC=1 check to simple_matmul (#9802)
* add SHOULD_USE_TC=1 check to simple_matmul

also zero centered the random input and update atol for tf32

* ATOL=2e-2 for HALF
2025-04-09 02:24:42 -04:00
qazal
f27dbc8c35 becomes_map cleanups [pr] (#9790)
* cleanup becomes_map [pr]

* source
2025-04-09 14:11:53 +08:00
qazal
7d2349c827 track_rewrites in scheduler [pr] (#9801) 2025-04-09 12:48:14 +08:00
George Hotz
bb18adb0d5 reduce with a mul chain (#9799)
* reduce with a mul chain

* inside is just 1
2025-04-09 12:42:32 +08:00
George Hotz
78caf55154 Revert "FP8 support on NVIDIA (#8631)"
This reverts commit 2c8e4ea865.
2025-04-09 12:27:41 +08:00
George Hotz
d1505137ad Revert "move TestOpsFp8s skipTest (#9797)"
This reverts commit a3aaf92b21.
2025-04-09 12:27:40 +08:00
George Hotz
14928fecff Revert "fix TF32 tensor core dropped in tc_sm89 (#9798)"
This reverts commit 7c9a96824f.
2025-04-09 12:27:39 +08:00
qazal
1ed4eae510 hotfix: don't add shape to SINK viz node (#9800) 2025-04-09 12:04:33 +08:00