qazal
7045920786
give _apply_map_to_tensors substitutes name [pr] ( #9840 )
2025-04-11 10:38:57 +08:00
qazal
40ef2f2857
add ast fixup stage to tensor_map [pr] ( #9839 )
2025-04-11 09:24:01 +08:00
qazal
fbc6aa53d4
script for local process_replay + fix viz name [pr] ( #9837 )
2025-04-11 00:39:18 +08:00
b1tg
a35b475d18
fix am driver for gfx1201 ( #9836 )
2025-04-10 19:33:02 +03:00
qazal
16956b79de
canonicalize Device.DEFAULT ( #9835 )
2025-04-10 23:02:11 +08:00
George Hotz
f666dd14eb
fix get reduce contraction with test ( #9834 )
2025-04-10 22:24:21 +08:00
George Hotz
c3fa470852
hotfix: remove tracebacklimit, it persists if you catch the exception and made webgpu flaky
2025-04-10 20:29:25 +08:00
chenyu
7fa5f29582
add test_embedding to test_softmax_fusion ( #9832 )
2025-04-10 08:25:34 -04:00
chenyu
995d20673a
increase bert TRAIN_STEPS for mi300x ( #9833 )
...
got a few non converged ones so try to increase steps. we need >= 90% runs to converge
2025-04-10 08:25:09 -04:00
George Hotz
25e2a3cf5d
hotfix: fix get_contraction_with_reduce
2025-04-10 20:18:19 +08:00
George Hotz
53f0b2aad7
fix infinite loop in flash attention ( #9827 )
...
* fix infinite loop in flash attention
* get_contraction_with_reduce
* skip that test
* SINGLE_KERNEL_SOFTMAX + fix multi
* default IGNORE_OOB
* print change
2025-04-10 20:06:44 +08:00
qazal
16afe04f45
move process replay to grouper ( #9830 )
...
* simpler
* sched
2025-04-10 18:27:42 +08:00
chenyu
c8f47c1d07
not_support_multi_device helper ( #9831 )
...
unify the test helper to skip ci device that does not support multi
2025-04-10 05:25:29 -04:00
chenyu
817746b30e
add contiguous to EmbeddingBert output ( #9829 )
...
for some reason with random dropout it creates different ast on each device. And search embedding is slow. This workaround saved 6 minutes setup time on mi300x (25->19) and resulted in similar speed
2025-04-10 04:31:19 -04:00
qazal
fd4f06e623
kernelize prereqs [pr] ( #9811 )
...
* kernelize prereqs [pr]
* work
* tensor maps to assign
* unwrap st
* process replay
* grouper changes
* replay
2025-04-10 15:22:20 +08:00
chenyu
c462162db8
update benchmark bert scripts with BS and ACC_DTYPE ( #9826 )
...
BS=16, ACC_DTYPE=half for tinybox, BS=128, ACC_DTYPE=float for mi300x
2025-04-10 02:06:02 -04:00
qazal
498a2bf738
add err handling tests to viz + cleanups ( #9825 )
...
* cleanup
* add err handling tests to viz + cleanups
* lint
2025-04-10 14:05:05 +08:00
chenyu
a0b72f066a
don't free intermediate for bert mi300x ( #9824 )
2025-04-10 01:48:34 -04:00
chenyu
566e389585
more relaxed ATOL for HALF=1 simple_matmul test ( #9823 )
...
it's a function of N so only updated in the test command
2025-04-10 00:46:16 -04:00
Francis Lata
eb2e59db42
RetinaNet model type annotations and loss functions ( #9822 )
...
* add type annotations and loss functions for training
* combine sum of multiple dims inside loss functions
2025-04-10 00:31:37 -04:00
chenyu
06a928b341
higher ATOL for half input TC test ( #9821 )
...
flaky
2025-04-09 23:57:25 -04:00
Francis Lata
7bb36d71b2
remove openimages iterate ( #9820 )
2025-04-09 22:54:12 -04:00
chenyu
2e1002e179
EVAL_BS=96 and BEAM=3 for bert green ( #9819 )
...
19m -> 13m setup and same end to end time
2025-04-09 22:37:27 -04:00
uuuvn
3ee317ffed
Fix kfd autogen and verify it in ci ( #9818 )
...
Had to autogen newer uapi headers for #9746 (dmabuf export ioctl missing),
submitting just the fix without updating to newer headers as they are only
needed for infiniband stuff
2025-04-10 09:53:42 +08:00
nimlgen
d7330ea6ad
amd: refactor sqtt into sep functions ( #9816 )
...
* amd: refactor sqtt into sep functions
* fix
2025-04-10 00:39:45 +03:00
nimlgen
0ca98b9f20
amd: gfx9 use cache ctrls in acquire_mem ( #9815 )
2025-04-09 20:17:02 +03:00
George Hotz
fce432d2e3
Ops.FUSE makes softmax a single kernel ( #9808 )
...
* KERNELIZE makes softmax a single kernel
* single kernel works
* softmax works
* broken
* correct
* skip that test
* kernelize tests
* rename to fuse
* better reduce_push_add_ones code
* correct now
* cleanups
* oops
* return None if we can't push ones
* rename + docs
* atol fixes group
* flash attention broken test
2025-04-09 22:56:28 +08:00
nimlgen
1798ce7e52
amd: faster xcc sync ( #9783 )
...
* amd: faster xcc sync
* move to same cacheline
* comment
* keep it uncached + better poll timings
* revert this, should be fine
* fixed now?
* minor
2025-04-09 15:56:50 +03:00
qazal
3bd992dc95
multi stage graph_rewrite_map ( #9803 )
...
* multistage graph_rewrite_map
* s/merge_map/input_map
* build up kernel_map from the tensor_map
2025-04-09 15:59:45 +08:00
chenyu
57f4bc3fbb
add numpy to setup linting ( #9806 )
...
this would have caught the mypy error in fp8 pr. keep ignore_missing_imports to true as we also import torch which is fat
2025-04-09 03:47:03 -04:00
George Hotz
bf769fa5c5
label ranges with their number ( #9805 )
2025-04-09 14:31:18 +08:00
chenyu
c5db5b83b9
add SHOULD_USE_TC=1 check to simple_matmul ( #9802 )
...
* add SHOULD_USE_TC=1 check to simple_matmul
also zero centered the random input and update atol for tf32
* ATOL=2e-2 for HALF
2025-04-09 02:24:42 -04:00
qazal
f27dbc8c35
becomes_map cleanups [pr] ( #9790 )
...
* cleanup becomes_map [pr]
* source
2025-04-09 14:11:53 +08:00
qazal
7d2349c827
track_rewrites in scheduler [pr] ( #9801 )
2025-04-09 12:48:14 +08:00
George Hotz
bb18adb0d5
reduce with a mul chain ( #9799 )
...
* reduce with a mul chain
* inside is just 1
2025-04-09 12:42:32 +08:00
George Hotz
78caf55154
Revert "FP8 support on NVIDIA ( #8631 )"
...
This reverts commit 2c8e4ea865 .
2025-04-09 12:27:41 +08:00
George Hotz
d1505137ad
Revert "move TestOpsFp8s skipTest ( #9797 )"
...
This reverts commit a3aaf92b21 .
2025-04-09 12:27:40 +08:00
George Hotz
14928fecff
Revert "fix TF32 tensor core dropped in tc_sm89 ( #9798 )"
...
This reverts commit 7c9a96824f .
2025-04-09 12:27:39 +08:00
qazal
1ed4eae510
hotfix: don't add shape to SINK viz node ( #9800 )
2025-04-09 12:04:33 +08:00
chenyu
7c9a96824f
fix TF32 tensor core dropped in tc_sm89 ( #9798 )
...
also add `SHOULD_USE_TC=1` to verify TC is applied in simple_matmul
2025-04-08 23:20:50 -04:00
chenyu
a3aaf92b21
move TestOpsFp8s skipTest ( #9797 )
...
so get_available_devices is not called when running other tests
2025-04-08 22:44:07 -04:00
pkotzbach
2c8e4ea865
FP8 support on NVIDIA ( #8631 )
...
* squashed fp8 commits
* tensorcore start
* minor changes
* pre-commit
* pylint
* Delete fp8mul.cu
* clean
* small bugfix
* fix test_dtype
* fix test_dtype_alu
* add EMULATE_CUDA_SM89
* fix ci
* fix test_linearizer
* fix test_linearizer
* fix swizzle
* add debug to simple_matmul
* fixed swizzle
* python emulator
* refactor python emulator
* setup fix
* numpy setup
* ml_dtypes only in emulate_cuda_sm89
* fix pylint
* fix tests
* fix mypy
* fix mypy
* fix ruff
* done python emulator
* add acc type
* tests
* mypy
* clean code
* add cuda tensor core tests to CI
* minor fix
* clean test_dtype.py
* clean cstyle.py
* clean test_ops.py
* fix test
* fix test
* whitespaces
* pylint
* pylint
* amd?
* amd?
* amd
* reduce lines
* mockgpu remove
* fix
* ruff
* ruff
* fix mypy
* ruff
* test only for cuda
* fixed formatting
* small fixes
* small fix
* least_upper_dtype if fp8s not supported
* log and reciprocal are supported for fp8s
* ops python fixes
* dtypes.fp8s use
* e4m3 + e5m2 result dtype test
* truncate linter fix
---------
Co-authored-by: pkotzbach <pawkotz@gmail.com >
Co-authored-by: George Hotz <72895+geohot@users.noreply.github.com >
Co-authored-by: chenyu <chenyu@fastmail.com >
2025-04-08 21:54:04 -04:00
hooved
5d85765327
types for WebGPU runtime ( #9791 )
...
* add type annotations to ops_webgpu
* rerun CI
* add types to some _run params
2025-04-08 22:52:17 +03:00
chenyu
4c8582a7ce
pipe allow_test_size in _time_program ( #9789 )
...
* pipe allow_test_size in _time_program
it was dropped long time ago and BEAM_ESTIMATE is doing nothing
* revert BEAM_ESTIMATE
2025-04-08 09:07:40 -04:00
chenyu
8fe83385ec
add system json for mi300x mlperf ( #9786 )
...
* add system json for mi300x mlperf
```
python3 -m mlperf_logging.system_desc_checker examples/mlperf/training_submission_v5.0/tinycorp/systems/tinybox_8xMI300X.json training 4.1.0
INFO - System description checker passed for tinybox 8xMI300X
```
also removed the rocm from tinybox_red since we are not using it
* update mlperf-logging version
2025-04-08 06:36:44 -04:00
chenyu
4a807ee952
remove duplicated z3-solver in setup.py ( #9787 )
2025-04-08 06:12:58 -04:00
qazal
21e872df44
remove consts from sched_sink [pr] ( #9782 )
2025-04-08 16:08:24 +08:00
qazal
f13e9cf2d9
move view_left to grouper.py + tiny reorders [pr] ( #9780 )
...
* move view_left to grouper.py [pr]
* reorder grouper
* test_schedule
2025-04-08 15:39:28 +08:00
chenyu
7a28133b37
failed test for single softmax backward ( #9778 )
...
getting RecursionError with DONT_GROUP_REDUCES=1
2025-04-08 02:36:32 -04:00
George Hotz
fefee5d3ab
single kernel softmax ( #9776 )
...
* real single kernel softmax
* cleanup
* fix blockend insertion
* add to bert test
2025-04-08 12:35:48 +08:00