chenyu
b190d85ad7
benchmark script bert softmax ( #9759 )
2025-04-07 00:31:18 -04:00
Ignacio Sica
58785181a8
AMD bf16xf32 TC ( #9717 )
...
* dont test bf16 for emulated amd tc
* skip bf16 tc test in ci
* skip bf16 for AMD in test_tensor_cores_codegen
* add simple bf16 gemm test to benchmark
2025-04-07 11:41:04 +08:00
chenyu
43e4565148
weighted linear in external_benchmark_bert_matmuls ( #9757 )
...
include the linear to get qkv, and permute so that stride matches with the real run
2025-04-06 23:35:42 -04:00
George Hotz
28e06d2d44
minor cleanups from patternmatcher [pr] ( #9756 )
2025-04-07 11:28:14 +08:00
qazal
1ce4912770
viz profiler ui ( #9664 )
...
* localhost:8000/prof
* selector + table
* add pid
* on null selection reset filters
* table sort
* charset=utf-8
* clear the rest
* sort by duration
* render table
* format
* nothing in copy thread
* keep starts
* sort back
* less javascript
* diff
* works on firefox
2025-04-07 00:30:17 +08:00
chenyu
8a585dc5c1
benchmark script for matmuls in bert ( #9752 )
...
2 main matmuls in the bert layers. getting these to be fast makes bert fast
2025-04-06 19:34:25 +08:00
qazal
139999c6d7
map viz files + query params cleanup [pr] ( #9754 )
...
* map viz files + query params cleanup [pr]
* .width + fix
2025-04-06 16:20:00 +08:00
Francis Lata
71b8890dd6
use validation dataloader inside retinanet eval ( #9747 )
2025-04-05 16:46:55 -04:00
nimlgen
5f7c79676f
jit: prune independent copies ( #9749 )
...
* jit: prune independent copies
* linter
* check kernel cnt
2025-04-05 20:50:28 +03:00
nimlgen
c2573b247c
jit: rename optimize_weights -> replan_buffers_memory_layout ( #9751 )
2025-04-05 20:35:15 +03:00
uuuvn
493fb315b1
fix RDNA2 support ( #9700 )
...
linux amdgpu_discovery.c:amdgpu_discovery_set_ip_blocks is a ton of
switch cases with sometimes weird choices like replacing nbio 3.X with
2.3 while nbio 2.5 is somehow nbio 7.0. `import_module` currently just
tries to replace revision and minor with zeroes if there is no exact
match, but that's not enough to cover all that weirdness
2025-04-05 18:42:47 +03:00
chenyu
5a04f4d4ba
revert bert hparams for green and red ( #9744 )
...
did more runs and it's not really better and not worth the change. only useful for BS=1024
2025-04-05 07:38:01 -04:00
chenyu
407ca54382
symbolic fold double where ( #9436 )
...
* symbolic fold double where
a.where(b.where(c, d), d) -> (a & b).where(c, d). a pattern in optimizer
* test case
2025-04-05 05:12:17 -04:00
Sieds Lykles
9c2fc695b5
cond.logical_not().where(a,b) -> cond.where(b,a) ( #9741 )
...
* Add rule for negation in where, simplifies arange patterns
* 0 becomes 0.0 again
* Only if cond is bool
* ne is never None
* Add a test
---------
Co-authored-by: chenyu <chenyu@fastmail.com >
2025-04-04 19:13:32 -04:00
Sieds Lykles
e9a3ac02a5
Remove ne from arange pattern ( #9743 )
2025-04-04 18:31:13 -04:00
nimlgen
86c55414d7
ops_amd: simplify gfx version ( #9742 )
...
* ops_amd: simplify gfx version
* fix
* all vesrsion compact style
* mypy
* revert this
* rename back to target
2025-04-04 22:18:11 +03:00
qazal
16d6aa15f1
record unittest name in process replay ( #9731 )
...
* record unittest name in process replay
* getitem
* filename + (optional) name
* del
* get_test_method
* not solved
* try with linecache
* test: print_loc
* format
* without linecache
* checkout master
2025-04-05 01:39:48 +08:00
qazal
354db961c6
viz refactor to prep for profiler [pr] ( #9739 )
2025-04-04 17:13:14 +08:00
chenyu
fe998798fb
linearizer failure test for OUT OF BOUNDS ACCESS ( #9738 )
2025-04-04 03:48:43 -04:00
George Hotz
8b5a523743
fix minimum length in pattern matcher ( #9736 )
2025-04-04 14:57:01 +08:00
chenyu
640ff681c3
rename bert script to 8xMI300X ( #9734 )
...
and adds a script for single MI300X
2025-04-03 23:36:24 -04:00
George Hotz
b719aa1fb0
only check once for divisible fold lengths ( #9732 )
2025-04-04 11:27:34 +08:00
George Hotz
926b0bcc57
cache folded upcast [pr] ( #9733 )
2025-04-04 11:23:19 +08:00
George Hotz
8206c7281e
move const multiply after REDUCE ( #9730 )
2025-04-04 11:07:46 +08:00
chenyu
6b3480ec70
update mi300x bert haparams ( #9716 )
...
* update mi300x bert haparams
borrowed from previous submission that also did BS=1024
* update
2025-04-03 22:30:00 -04:00
George Hotz
cac8bcf8b5
use Ops.REDUCE ( #9721 )
...
* decrease bert python time [pr]
* order copies
* Revert "order copies"
This reverts commit 3f62c8693b .
* rewrite count
* Ops.REDUCE
* acc first in the add chain
* Fix tensor core acc
* arange patterns look good
* fix multireduce gate
* reduce rewrite rule
* bump that to 15 minutes
* multiwmma isn't fusing
* gep through wmma is gep pushing
* bump that timeout too, it's all env setup
* add failing test
2025-04-04 10:14:34 +08:00
nimlgen
949459fdd6
jit: fix deallocate on unallocated buffers in free_intermediates ( #9699 )
2025-04-03 18:32:51 +03:00
qazal
52a8ecb15e
record unittest location in process replay [pr] ( #9727 )
2025-04-03 20:50:09 +08:00
geohotstan
ac713e04db
ONNX add output shape validation ( #9720 )
...
* add output shape validation and remove support for sequence_type
* nit better err msg
* add sequence_type back
* improve err msg
* Revert "improve err msg"
This reverts commit dc9eaea4bb .
* Revert "add sequence_type back"
This reverts commit 288170b2d9 .
* do explicit shape equality
* small nit
2025-04-03 05:44:53 -04:00
chenyu
7dadbf3697
insert float() in bert acc ( #9726 )
...
sum of bool by default uses default_float for acc. So without float, it might overflow with a large BS and default_float=HALF.
fixed clsf_accuracy to not be inf in mi300x bert
2025-04-03 05:44:09 -04:00
chenyu
79145e3d40
cleanup truncate_bf16 [pr] ( #9725 )
...
use torch bfloat16 for groundtruth in test. also a TODO for discrepancy
2025-04-03 05:43:49 -04:00
Ignacio Sica
bc2d86195e
increase test tolerance ( #9719 )
2025-04-03 15:24:09 +08:00
chenyu
1d25844d44
Revert "disable CI red llama 3 4 gpu beam ( #9690 )" ( #9709 )
...
This reverts commit 6a5eacba8b .
2025-04-03 02:34:39 -04:00
George Hotz
49dafe6d43
add gc tests [pr] ( #9718 )
...
* add gc tests [pr]
* del
* more gc tests
* add NullGraph
2025-04-03 14:08:32 +08:00
Ignacio Sica
bc91fffc5d
fix gated store with index in python backend ( #9703 )
...
* add default gate in index
* assert store
* add TestRendererFailures
- move test_gated_store_with_alu to new TestRenderFailures class for
tests that fail on multiple renderers
- add test_renderer_failures.py run on python CI
* add test for gated index in 2d
* test TestRenderFailures
2025-04-03 12:48:28 +08:00
qazal
f2bd65ccfc
delete Ops.EMPTY and Tensor._metaop ( #9715 )
...
* delete Ops.EMPTY and Tensor._metaop [pr]
* test_creation
* arg=
* abstractions2
2025-04-03 12:29:02 +08:00
George Hotz
5c7b549eab
use functools.cache instead of lru_cache(None) [pr] ( #9714 )
...
* use functools.cache instead of lru_cache(None) [pr]
* more cache
2025-04-03 11:47:13 +08:00
qazal
bbd13191f4
cleanup tensor BIND + remove outdated comments in tensor.py [pr] ( #9712 )
...
* cleanup tensor BIND + remove outdated comments in tensor.py [pr]
* from_blob whitespace
* assert
2025-04-03 11:21:53 +08:00
geohotstan
e1d7e47cca
fix ONNX IsInf unintended dtype promotion ( #9711 )
...
* add IsInf
* add corresponding test
* that float16 is kinda silly
2025-04-02 22:46:15 -04:00
qazal
11ae254dc5
construct BUFFER UOps directly when device in known [pr] ( #9710 )
...
* construct BUFFER UOps directly when device in known [pr]
* diff
2025-04-03 10:41:44 +08:00
George Hotz
1714fc3ba4
start work on speed [pr] ( #9707 )
...
* fix get_location
* fix get_location try 2
* clean up split_load_store [pr]
* SHR fixup [pr]
2025-04-03 10:39:01 +08:00
George Hotz
0f1ffc2050
hotfix: cat tests 2048 instead of 256
2025-04-03 10:37:56 +08:00
uuuvn
5bd485c027
Fix double SDMA_OP_FENCE ( #9705 )
...
Introduced in #9585 , probably when i incorrectly resolved merge conflict
while rebasing an old, mi300x-only branch. Seems to be the source of
multi gpu beam llama hangs
2025-04-03 09:43:37 +08:00
chenyu
a6fec2f5ae
dev_run for bert on mi300x ( #9706 )
2025-04-02 21:12:55 -04:00
nimlgen
d96b4983ac
amd: support rdna4 in runtime again ( #9702 )
2025-04-03 01:19:23 +07:00
Ignacio Sica
2d6d8b7355
add bf16 mfma support ( #9695 )
...
* add bf16 mfma support
* skip tc if emulated_amd and dtypes is bf16
* hotfix
2025-04-02 21:44:49 +08:00
nimlgen
a6733f519f
dsp: make relro sections contiguous ( #9701 )
2025-04-02 18:02:16 +07:00
George Hotz
ea5caefef0
gep should look at count, not vcount ( #9698 )
...
* gep should look at count, not vcount
* gep in order is a rule
* min change
* gep on void
2025-04-02 18:10:57 +08:00
George Hotz
f72a87fd0e
add proper support for Ops.IGNORE to remove store masks ( #9692 )
...
* add proper support for Ops.IGNORE to remove store masks
* remove useless NHWC
* revert that
2025-04-02 16:38:01 +08:00
chenyu
3b8d923692
remove skip LLVM in test_div_int ( #9686 )
2025-04-02 04:15:00 -04:00