George Hotz
bf769fa5c5
label ranges with their number ( #9805 )
2025-04-09 14:31:18 +08:00
chenyu
c5db5b83b9
add SHOULD_USE_TC=1 check to simple_matmul ( #9802 )
...
* add SHOULD_USE_TC=1 check to simple_matmul
also zero centered the random input and update atol for tf32
* ATOL=2e-2 for HALF
2025-04-09 02:24:42 -04:00
qazal
f27dbc8c35
becomes_map cleanups [pr] ( #9790 )
...
* cleanup becomes_map [pr]
* source
2025-04-09 14:11:53 +08:00
qazal
7d2349c827
track_rewrites in scheduler [pr] ( #9801 )
2025-04-09 12:48:14 +08:00
George Hotz
bb18adb0d5
reduce with a mul chain ( #9799 )
...
* reduce with a mul chain
* inside is just 1
2025-04-09 12:42:32 +08:00
George Hotz
78caf55154
Revert "FP8 support on NVIDIA ( #8631 )"
...
This reverts commit 2c8e4ea865 .
2025-04-09 12:27:41 +08:00
George Hotz
d1505137ad
Revert "move TestOpsFp8s skipTest ( #9797 )"
...
This reverts commit a3aaf92b21 .
2025-04-09 12:27:40 +08:00
George Hotz
14928fecff
Revert "fix TF32 tensor core dropped in tc_sm89 ( #9798 )"
...
This reverts commit 7c9a96824f .
2025-04-09 12:27:39 +08:00
qazal
1ed4eae510
hotfix: don't add shape to SINK viz node ( #9800 )
2025-04-09 12:04:33 +08:00
chenyu
7c9a96824f
fix TF32 tensor core dropped in tc_sm89 ( #9798 )
...
also add `SHOULD_USE_TC=1` to verify TC is applied in simple_matmul
2025-04-08 23:20:50 -04:00
chenyu
a3aaf92b21
move TestOpsFp8s skipTest ( #9797 )
...
so get_available_devices is not called when running other tests
2025-04-08 22:44:07 -04:00
pkotzbach
2c8e4ea865
FP8 support on NVIDIA ( #8631 )
...
* squashed fp8 commits
* tensorcore start
* minor changes
* pre-commit
* pylint
* Delete fp8mul.cu
* clean
* small bugfix
* fix test_dtype
* fix test_dtype_alu
* add EMULATE_CUDA_SM89
* fix ci
* fix test_linearizer
* fix test_linearizer
* fix swizzle
* add debug to simple_matmul
* fixed swizzle
* python emulator
* refactor python emulator
* setup fix
* numpy setup
* ml_dtypes only in emulate_cuda_sm89
* fix pylint
* fix tests
* fix mypy
* fix mypy
* fix ruff
* done python emulator
* add acc type
* tests
* mypy
* clean code
* add cuda tensor core tests to CI
* minor fix
* clean test_dtype.py
* clean cstyle.py
* clean test_ops.py
* fix test
* fix test
* whitespaces
* pylint
* pylint
* amd?
* amd?
* amd
* reduce lines
* mockgpu remove
* fix
* ruff
* ruff
* fix mypy
* ruff
* test only for cuda
* fixed formatting
* small fixes
* small fix
* least_upper_dtype if fp8s not supported
* log and reciprocal are supported for fp8s
* ops python fixes
* dtypes.fp8s use
* e4m3 + e5m2 result dtype test
* truncate linter fix
---------
Co-authored-by: pkotzbach <pawkotz@gmail.com >
Co-authored-by: George Hotz <72895+geohot@users.noreply.github.com >
Co-authored-by: chenyu <chenyu@fastmail.com >
2025-04-08 21:54:04 -04:00
hooved
5d85765327
types for WebGPU runtime ( #9791 )
...
* add type annotations to ops_webgpu
* rerun CI
* add types to some _run params
2025-04-08 22:52:17 +03:00
chenyu
4c8582a7ce
pipe allow_test_size in _time_program ( #9789 )
...
* pipe allow_test_size in _time_program
it was dropped long time ago and BEAM_ESTIMATE is doing nothing
* revert BEAM_ESTIMATE
2025-04-08 09:07:40 -04:00
chenyu
8fe83385ec
add system json for mi300x mlperf ( #9786 )
...
* add system json for mi300x mlperf
```
python3 -m mlperf_logging.system_desc_checker examples/mlperf/training_submission_v5.0/tinycorp/systems/tinybox_8xMI300X.json training 4.1.0
INFO - System description checker passed for tinybox 8xMI300X
```
also removed the rocm from tinybox_red since we are not using it
* update mlperf-logging version
2025-04-08 06:36:44 -04:00
chenyu
4a807ee952
remove duplicated z3-solver in setup.py ( #9787 )
2025-04-08 06:12:58 -04:00
qazal
21e872df44
remove consts from sched_sink [pr] ( #9782 )
2025-04-08 16:08:24 +08:00
qazal
f13e9cf2d9
move view_left to grouper.py + tiny reorders [pr] ( #9780 )
...
* move view_left to grouper.py [pr]
* reorder grouper
* test_schedule
2025-04-08 15:39:28 +08:00
chenyu
7a28133b37
failed test for single softmax backward ( #9778 )
...
getting RecursionError with DONT_GROUP_REDUCES=1
2025-04-08 02:36:32 -04:00
George Hotz
fefee5d3ab
single kernel softmax ( #9776 )
...
* real single kernel softmax
* cleanup
* fix blockend insertion
* add to bert test
2025-04-08 12:35:48 +08:00
qazal
9963bb51e0
grouper tests cleanups [pr] ( #9777 )
...
* grouper tests cleanups [pr]
* viz
* tuple
* whitespace
2025-04-08 12:33:11 +08:00
chenyu
4cc7422769
use AM driver in bert mlperf ( #9775 )
...
we should commit to use AM. it's 7ms slower python time now
2025-04-07 23:40:27 -04:00
George Hotz
db22094d35
hotfix: update softmax fusion test
2025-04-08 11:23:19 +08:00
Francis Lata
f8fe15e64e
move BoxCoder to mlperf helpers ( #9773 )
2025-04-07 20:27:06 -04:00
Eitan Turok
bb7922b95f
Vectorize Transcendental Regression Tests ( #9753 )
...
* init test
* cleanup
2025-04-08 01:27:39 +08:00
chenyu
7c4a739fe4
full script for bert mi300x ( #9772 )
2025-04-07 11:41:31 -04:00
Sieds Lykles
07d1aefaf4
fast idiv ( #9755 )
...
* fast idiv with tests and fuzzer
* Add todo comment
* Add env variable to toggle fast_idiv
* Move env check
* Add fuzz fast_idiv to ci
---------
Co-authored-by: chenyu <chenyu@fastmail.com >
2025-04-07 08:32:24 -04:00
nimlgen
fa888ee077
minor test cleanups ( #9770 )
...
* fix test_graph on max
* pcie5
2025-04-07 15:29:12 +03:00
chenyu
3069ebfad1
use BERT_LAYERS=2 in bert init ( #9769 )
...
save 5 minut scheduling in setup so we can fit more search
2025-04-07 07:46:37 -04:00
qazal
891322fd51
split into grouper.py ( #9768 )
...
* split into grouper.py
* update tests
* reorder
2025-04-07 18:40:59 +08:00
qazal
219b8c9e8b
return becomes_map in scheduler [pr] ( #9766 )
...
* add a graph_rewrite pass for creating asts [pr]
* disk
* benchmark
* return becomes_map in scheduler
* reorder schedule.py into grouper and linearizer [pr]
* comments
2025-04-07 17:36:23 +08:00
qazal
6306dea6e2
add a graph_rewrite pass for creating asts [pr] ( #9765 )
...
* add a graph_rewrite pass for creating asts [pr]
* disk
* benchmark
2025-04-07 16:32:11 +08:00
qazal
07eea567d4
reorder tensor_map and grouper parts [pr] ( #9764 )
2025-04-07 15:36:13 +08:00
qazal
8ddb1357c0
fix UPat.location after pickle ( #9763 )
...
* fix UPat.location after pickle [pr]
* named upat test
2025-04-07 15:16:42 +08:00
qazal
4cd27aa0e6
hotfix: viz recenter and unlimited zoom ( #9760 )
...
* hotfix: viz recenter and unlimited zoom
* add shapes to the ast graph
* not for COPY
2025-04-07 14:38:03 +08:00
chenyu
d0dace4306
update doc for permute to 3d tensor ( #9758 )
...
easier to see if it's permuted to or permuted from
2025-04-07 00:38:05 -04:00
chenyu
b190d85ad7
benchmark script bert softmax ( #9759 )
2025-04-07 00:31:18 -04:00
Ignacio Sica
58785181a8
AMD bf16xf32 TC ( #9717 )
...
* dont test bf16 for emulated amd tc
* skip bf16 tc test in ci
* skip bf16 for AMD in test_tensor_cores_codegen
* add simple bf16 gemm test to benchmark
2025-04-07 11:41:04 +08:00
chenyu
43e4565148
weighted linear in external_benchmark_bert_matmuls ( #9757 )
...
include the linear to get qkv, and permute so that stride matches with the real run
2025-04-06 23:35:42 -04:00
George Hotz
28e06d2d44
minor cleanups from patternmatcher [pr] ( #9756 )
2025-04-07 11:28:14 +08:00
qazal
1ce4912770
viz profiler ui ( #9664 )
...
* localhost:8000/prof
* selector + table
* add pid
* on null selection reset filters
* table sort
* charset=utf-8
* clear the rest
* sort by duration
* render table
* format
* nothing in copy thread
* keep starts
* sort back
* less javascript
* diff
* works on firefox
2025-04-07 00:30:17 +08:00
chenyu
8a585dc5c1
benchmark script for matmuls in bert ( #9752 )
...
2 main matmuls in the bert layers. getting these to be fast makes bert fast
2025-04-06 19:34:25 +08:00
qazal
139999c6d7
map viz files + query params cleanup [pr] ( #9754 )
...
* map viz files + query params cleanup [pr]
* .width + fix
2025-04-06 16:20:00 +08:00
Francis Lata
71b8890dd6
use validation dataloader inside retinanet eval ( #9747 )
2025-04-05 16:46:55 -04:00
nimlgen
5f7c79676f
jit: prune independent copies ( #9749 )
...
* jit: prune independent copies
* linter
* check kernel cnt
2025-04-05 20:50:28 +03:00
nimlgen
c2573b247c
jit: rename optimize_weights -> replan_buffers_memory_layout ( #9751 )
2025-04-05 20:35:15 +03:00
uuuvn
493fb315b1
fix RDNA2 support ( #9700 )
...
linux amdgpu_discovery.c:amdgpu_discovery_set_ip_blocks is a ton of
switch cases with sometimes weird choices like replacing nbio 3.X with
2.3 while nbio 2.5 is somehow nbio 7.0. `import_module` currently just
tries to replace revision and minor with zeroes if there is no exact
match, but that's not enough to cover all that weirdness
2025-04-05 18:42:47 +03:00
chenyu
5a04f4d4ba
revert bert hparams for green and red ( #9744 )
...
did more runs and it's not really better and not worth the change. only useful for BS=1024
2025-04-05 07:38:01 -04:00
chenyu
407ca54382
symbolic fold double where ( #9436 )
...
* symbolic fold double where
a.where(b.where(c, d), d) -> (a & b).where(c, d). a pattern in optimizer
* test case
2025-04-05 05:12:17 -04:00
Sieds Lykles
9c2fc695b5
cond.logical_not().where(a,b) -> cond.where(b,a) ( #9741 )
...
* Add rule for negation in where, simplifies arange patterns
* 0 becomes 0.0 again
* Only if cond is bool
* ne is never None
* Add a test
---------
Co-authored-by: chenyu <chenyu@fastmail.com >
2025-04-04 19:13:32 -04:00