Commit Graph

8403 Commits

Author SHA1 Message Date
George Hotz
bf769fa5c5 label ranges with their number (#9805) 2025-04-09 14:31:18 +08:00
chenyu
c5db5b83b9 add SHOULD_USE_TC=1 check to simple_matmul (#9802)
* add SHOULD_USE_TC=1 check to simple_matmul

also zero centered the random input and update atol for tf32

* ATOL=2e-2 for HALF
2025-04-09 02:24:42 -04:00
qazal
f27dbc8c35 becomes_map cleanups [pr] (#9790)
* cleanup becomes_map [pr]

* source
2025-04-09 14:11:53 +08:00
qazal
7d2349c827 track_rewrites in scheduler [pr] (#9801) 2025-04-09 12:48:14 +08:00
George Hotz
bb18adb0d5 reduce with a mul chain (#9799)
* reduce with a mul chain

* inside is just 1
2025-04-09 12:42:32 +08:00
George Hotz
78caf55154 Revert "FP8 support on NVIDIA (#8631)"
This reverts commit 2c8e4ea865.
2025-04-09 12:27:41 +08:00
George Hotz
d1505137ad Revert "move TestOpsFp8s skipTest (#9797)"
This reverts commit a3aaf92b21.
2025-04-09 12:27:40 +08:00
George Hotz
14928fecff Revert "fix TF32 tensor core dropped in tc_sm89 (#9798)"
This reverts commit 7c9a96824f.
2025-04-09 12:27:39 +08:00
qazal
1ed4eae510 hotfix: don't add shape to SINK viz node (#9800) 2025-04-09 12:04:33 +08:00
chenyu
7c9a96824f fix TF32 tensor core dropped in tc_sm89 (#9798)
also add `SHOULD_USE_TC=1` to verify TC is applied in simple_matmul
2025-04-08 23:20:50 -04:00
chenyu
a3aaf92b21 move TestOpsFp8s skipTest (#9797)
so get_available_devices is not called when running other tests
2025-04-08 22:44:07 -04:00
pkotzbach
2c8e4ea865 FP8 support on NVIDIA (#8631)
* squashed fp8 commits

* tensorcore start

* minor changes

* pre-commit

* pylint

* Delete fp8mul.cu

* clean

* small bugfix

* fix test_dtype

* fix test_dtype_alu

* add EMULATE_CUDA_SM89

* fix ci

* fix test_linearizer

* fix test_linearizer

* fix swizzle

* add debug to simple_matmul

* fixed swizzle

* python emulator

* refactor python emulator

* setup fix

* numpy setup

* ml_dtypes only in emulate_cuda_sm89

* fix pylint

* fix tests

* fix mypy

* fix mypy

* fix ruff

* done python emulator

* add acc type

* tests

* mypy

* clean code

* add cuda tensor core tests to CI

* minor fix

* clean test_dtype.py

* clean cstyle.py

* clean test_ops.py

* fix test

* fix test

* whitespaces

* pylint

* pylint

* amd?

* amd?

* amd

* reduce lines

* mockgpu remove

* fix

* ruff

* ruff

* fix mypy

* ruff

* test only for cuda

* fixed formatting

* small fixes

* small fix

* least_upper_dtype if fp8s not supported

* log and reciprocal are supported for fp8s

* ops python fixes

* dtypes.fp8s use

* e4m3 + e5m2 result dtype test

* truncate linter fix

---------

Co-authored-by: pkotzbach <pawkotz@gmail.com>
Co-authored-by: George Hotz <72895+geohot@users.noreply.github.com>
Co-authored-by: chenyu <chenyu@fastmail.com>
2025-04-08 21:54:04 -04:00
hooved
5d85765327 types for WebGPU runtime (#9791)
* add type annotations to ops_webgpu

* rerun CI

* add types to some _run params
2025-04-08 22:52:17 +03:00
chenyu
4c8582a7ce pipe allow_test_size in _time_program (#9789)
* pipe allow_test_size in _time_program

it was dropped long time ago and BEAM_ESTIMATE is doing nothing

* revert BEAM_ESTIMATE
2025-04-08 09:07:40 -04:00
chenyu
8fe83385ec add system json for mi300x mlperf (#9786)
* add system json for mi300x mlperf

```
python3 -m mlperf_logging.system_desc_checker examples/mlperf/training_submission_v5.0/tinycorp/systems/tinybox_8xMI300X.json training 4.1.0
INFO -   System description checker passed for tinybox 8xMI300X
```

also removed the rocm from tinybox_red since we are not using it

* update mlperf-logging version
2025-04-08 06:36:44 -04:00
chenyu
4a807ee952 remove duplicated z3-solver in setup.py (#9787) 2025-04-08 06:12:58 -04:00
qazal
21e872df44 remove consts from sched_sink [pr] (#9782) 2025-04-08 16:08:24 +08:00
qazal
f13e9cf2d9 move view_left to grouper.py + tiny reorders [pr] (#9780)
* move view_left to grouper.py [pr]

* reorder grouper

* test_schedule
2025-04-08 15:39:28 +08:00
chenyu
7a28133b37 failed test for single softmax backward (#9778)
getting RecursionError with DONT_GROUP_REDUCES=1
2025-04-08 02:36:32 -04:00
George Hotz
fefee5d3ab single kernel softmax (#9776)
* real single kernel softmax

* cleanup

* fix blockend insertion

* add to bert test
2025-04-08 12:35:48 +08:00
qazal
9963bb51e0 grouper tests cleanups [pr] (#9777)
* grouper tests cleanups [pr]

* viz

* tuple

* whitespace
2025-04-08 12:33:11 +08:00
chenyu
4cc7422769 use AM driver in bert mlperf (#9775)
we should commit to use AM. it's 7ms slower python time now
2025-04-07 23:40:27 -04:00
George Hotz
db22094d35 hotfix: update softmax fusion test 2025-04-08 11:23:19 +08:00
Francis Lata
f8fe15e64e move BoxCoder to mlperf helpers (#9773) 2025-04-07 20:27:06 -04:00
Eitan Turok
bb7922b95f Vectorize Transcendental Regression Tests (#9753)
* init test

* cleanup
2025-04-08 01:27:39 +08:00
chenyu
7c4a739fe4 full script for bert mi300x (#9772) 2025-04-07 11:41:31 -04:00
Sieds Lykles
07d1aefaf4 fast idiv (#9755)
* fast idiv with tests and fuzzer

* Add todo comment

* Add env variable to toggle fast_idiv

* Move env check

* Add fuzz fast_idiv to ci

---------

Co-authored-by: chenyu <chenyu@fastmail.com>
2025-04-07 08:32:24 -04:00
nimlgen
fa888ee077 minor test cleanups (#9770)
* fix test_graph on max

* pcie5
2025-04-07 15:29:12 +03:00
chenyu
3069ebfad1 use BERT_LAYERS=2 in bert init (#9769)
save 5 minut scheduling in setup so we can fit more search
2025-04-07 07:46:37 -04:00
qazal
891322fd51 split into grouper.py (#9768)
* split into grouper.py

* update tests

* reorder
2025-04-07 18:40:59 +08:00
qazal
219b8c9e8b return becomes_map in scheduler [pr] (#9766)
* add a graph_rewrite pass for creating asts [pr]

* disk

* benchmark

* return becomes_map in scheduler

* reorder schedule.py into grouper and linearizer [pr]

* comments
2025-04-07 17:36:23 +08:00
qazal
6306dea6e2 add a graph_rewrite pass for creating asts [pr] (#9765)
* add a graph_rewrite pass for creating asts [pr]

* disk

* benchmark
2025-04-07 16:32:11 +08:00
qazal
07eea567d4 reorder tensor_map and grouper parts [pr] (#9764) 2025-04-07 15:36:13 +08:00
qazal
8ddb1357c0 fix UPat.location after pickle (#9763)
* fix UPat.location after pickle [pr]

* named upat test
2025-04-07 15:16:42 +08:00
qazal
4cd27aa0e6 hotfix: viz recenter and unlimited zoom (#9760)
* hotfix: viz recenter and unlimited zoom

* add shapes to the ast graph

* not for COPY
2025-04-07 14:38:03 +08:00
chenyu
d0dace4306 update doc for permute to 3d tensor (#9758)
easier to see if it's permuted to or permuted from
2025-04-07 00:38:05 -04:00
chenyu
b190d85ad7 benchmark script bert softmax (#9759) 2025-04-07 00:31:18 -04:00
Ignacio Sica
58785181a8 AMD bf16xf32 TC (#9717)
* dont test bf16 for emulated amd tc

* skip bf16 tc test in ci

* skip bf16 for AMD in test_tensor_cores_codegen

* add simple bf16 gemm test to benchmark
2025-04-07 11:41:04 +08:00
chenyu
43e4565148 weighted linear in external_benchmark_bert_matmuls (#9757)
include the linear to get qkv, and permute so that stride matches with the real run
2025-04-06 23:35:42 -04:00
George Hotz
28e06d2d44 minor cleanups from patternmatcher [pr] (#9756) 2025-04-07 11:28:14 +08:00
qazal
1ce4912770 viz profiler ui (#9664)
* localhost:8000/prof

* selector + table

* add pid

* on null selection reset filters

* table sort

* charset=utf-8

* clear the rest

* sort by duration

* render table

* format

* nothing in copy thread

* keep starts

* sort back

* less javascript

* diff

* works on firefox
2025-04-07 00:30:17 +08:00
chenyu
8a585dc5c1 benchmark script for matmuls in bert (#9752)
2 main matmuls in the bert layers. getting these to be fast makes bert fast
2025-04-06 19:34:25 +08:00
qazal
139999c6d7 map viz files + query params cleanup [pr] (#9754)
* map viz files + query params cleanup [pr]

* .width + fix
2025-04-06 16:20:00 +08:00
Francis Lata
71b8890dd6 use validation dataloader inside retinanet eval (#9747) 2025-04-05 16:46:55 -04:00
nimlgen
5f7c79676f jit: prune independent copies (#9749)
* jit: prune independent copies

* linter

* check kernel cnt
2025-04-05 20:50:28 +03:00
nimlgen
c2573b247c jit: rename optimize_weights -> replan_buffers_memory_layout (#9751) 2025-04-05 20:35:15 +03:00
uuuvn
493fb315b1 fix RDNA2 support (#9700)
linux amdgpu_discovery.c:amdgpu_discovery_set_ip_blocks is a ton of
switch cases with sometimes weird choices like replacing nbio 3.X with
2.3 while nbio 2.5 is somehow nbio 7.0. `import_module` currently just
tries to replace revision and minor with zeroes if there is no exact
match, but that's not enough to cover all that weirdness
2025-04-05 18:42:47 +03:00
chenyu
5a04f4d4ba revert bert hparams for green and red (#9744)
did more runs and it's not really better and not worth the change. only useful for BS=1024
2025-04-05 07:38:01 -04:00
chenyu
407ca54382 symbolic fold double where (#9436)
* symbolic fold double where

a.where(b.where(c, d), d) -> (a & b).where(c, d). a pattern in optimizer

* test case
2025-04-05 05:12:17 -04:00
Sieds Lykles
9c2fc695b5 cond.logical_not().where(a,b) -> cond.where(b,a) (#9741)
* Add rule for negation in where, simplifies arange patterns

* 0 becomes 0.0 again

* Only if cond is bool

* ne is never None

* Add a test

---------

Co-authored-by: chenyu <chenyu@fastmail.com>
2025-04-04 19:13:32 -04:00