George Hotz
8206c7281e
move const multiply after REDUCE ( #9730 )
2025-04-04 11:07:46 +08:00
chenyu
6b3480ec70
update mi300x bert haparams ( #9716 )
...
* update mi300x bert haparams
borrowed from previous submission that also did BS=1024
* update
2025-04-03 22:30:00 -04:00
George Hotz
cac8bcf8b5
use Ops.REDUCE ( #9721 )
...
* decrease bert python time [pr]
* order copies
* Revert "order copies"
This reverts commit 3f62c8693b .
* rewrite count
* Ops.REDUCE
* acc first in the add chain
* Fix tensor core acc
* arange patterns look good
* fix multireduce gate
* reduce rewrite rule
* bump that to 15 minutes
* multiwmma isn't fusing
* gep through wmma is gep pushing
* bump that timeout too, it's all env setup
* add failing test
2025-04-04 10:14:34 +08:00
nimlgen
949459fdd6
jit: fix deallocate on unallocated buffers in free_intermediates ( #9699 )
2025-04-03 18:32:51 +03:00
qazal
52a8ecb15e
record unittest location in process replay [pr] ( #9727 )
2025-04-03 20:50:09 +08:00
geohotstan
ac713e04db
ONNX add output shape validation ( #9720 )
...
* add output shape validation and remove support for sequence_type
* nit better err msg
* add sequence_type back
* improve err msg
* Revert "improve err msg"
This reverts commit dc9eaea4bb .
* Revert "add sequence_type back"
This reverts commit 288170b2d9 .
* do explicit shape equality
* small nit
2025-04-03 05:44:53 -04:00
chenyu
7dadbf3697
insert float() in bert acc ( #9726 )
...
sum of bool by default uses default_float for acc. So without float, it might overflow with a large BS and default_float=HALF.
fixed clsf_accuracy to not be inf in mi300x bert
2025-04-03 05:44:09 -04:00
chenyu
79145e3d40
cleanup truncate_bf16 [pr] ( #9725 )
...
use torch bfloat16 for groundtruth in test. also a TODO for discrepancy
2025-04-03 05:43:49 -04:00
Ignacio Sica
bc2d86195e
increase test tolerance ( #9719 )
2025-04-03 15:24:09 +08:00
chenyu
1d25844d44
Revert "disable CI red llama 3 4 gpu beam ( #9690 )" ( #9709 )
...
This reverts commit 6a5eacba8b .
2025-04-03 02:34:39 -04:00
George Hotz
49dafe6d43
add gc tests [pr] ( #9718 )
...
* add gc tests [pr]
* del
* more gc tests
* add NullGraph
2025-04-03 14:08:32 +08:00
Ignacio Sica
bc91fffc5d
fix gated store with index in python backend ( #9703 )
...
* add default gate in index
* assert store
* add TestRendererFailures
- move test_gated_store_with_alu to new TestRenderFailures class for
tests that fail on multiple renderers
- add test_renderer_failures.py run on python CI
* add test for gated index in 2d
* test TestRenderFailures
2025-04-03 12:48:28 +08:00
qazal
f2bd65ccfc
delete Ops.EMPTY and Tensor._metaop ( #9715 )
...
* delete Ops.EMPTY and Tensor._metaop [pr]
* test_creation
* arg=
* abstractions2
2025-04-03 12:29:02 +08:00
George Hotz
5c7b549eab
use functools.cache instead of lru_cache(None) [pr] ( #9714 )
...
* use functools.cache instead of lru_cache(None) [pr]
* more cache
2025-04-03 11:47:13 +08:00
qazal
bbd13191f4
cleanup tensor BIND + remove outdated comments in tensor.py [pr] ( #9712 )
...
* cleanup tensor BIND + remove outdated comments in tensor.py [pr]
* from_blob whitespace
* assert
2025-04-03 11:21:53 +08:00
geohotstan
e1d7e47cca
fix ONNX IsInf unintended dtype promotion ( #9711 )
...
* add IsInf
* add corresponding test
* that float16 is kinda silly
2025-04-02 22:46:15 -04:00
qazal
11ae254dc5
construct BUFFER UOps directly when device in known [pr] ( #9710 )
...
* construct BUFFER UOps directly when device in known [pr]
* diff
2025-04-03 10:41:44 +08:00
George Hotz
1714fc3ba4
start work on speed [pr] ( #9707 )
...
* fix get_location
* fix get_location try 2
* clean up split_load_store [pr]
* SHR fixup [pr]
2025-04-03 10:39:01 +08:00
George Hotz
0f1ffc2050
hotfix: cat tests 2048 instead of 256
2025-04-03 10:37:56 +08:00
uuuvn
5bd485c027
Fix double SDMA_OP_FENCE ( #9705 )
...
Introduced in #9585 , probably when i incorrectly resolved merge conflict
while rebasing an old, mi300x-only branch. Seems to be the source of
multi gpu beam llama hangs
2025-04-03 09:43:37 +08:00
chenyu
a6fec2f5ae
dev_run for bert on mi300x ( #9706 )
2025-04-02 21:12:55 -04:00
nimlgen
d96b4983ac
amd: support rdna4 in runtime again ( #9702 )
2025-04-03 01:19:23 +07:00
Ignacio Sica
2d6d8b7355
add bf16 mfma support ( #9695 )
...
* add bf16 mfma support
* skip tc if emulated_amd and dtypes is bf16
* hotfix
2025-04-02 21:44:49 +08:00
nimlgen
a6733f519f
dsp: make relro sections contiguous ( #9701 )
2025-04-02 18:02:16 +07:00
George Hotz
ea5caefef0
gep should look at count, not vcount ( #9698 )
...
* gep should look at count, not vcount
* gep in order is a rule
* min change
* gep on void
2025-04-02 18:10:57 +08:00
George Hotz
f72a87fd0e
add proper support for Ops.IGNORE to remove store masks ( #9692 )
...
* add proper support for Ops.IGNORE to remove store masks
* remove useless NHWC
* revert that
2025-04-02 16:38:01 +08:00
chenyu
3b8d923692
remove skip LLVM in test_div_int ( #9686 )
2025-04-02 04:15:00 -04:00
chenyu
bc3bfcbad4
update install gpuocelot ( #9693 )
...
`-DCMAKE_POLICY_VERSION_MINIMUM=3.5`
2025-04-02 04:10:34 -04:00
George Hotz
e78e8722dc
Revert "LDS noop and spec ( #9669 )" ( #9691 )
...
This reverts commit 870b545ace .
Co-authored-by: Ignacio Sica <mignacio.sica@gmail.com >
2025-04-02 15:31:32 +08:00
George Hotz
4514fd91c1
more stuff from DSP ( #9689 )
...
* more good stuff from dsp branch
* test pkl imagenet
2025-04-02 15:27:48 +08:00
chenyu
6a5eacba8b
disable CI red llama 3 4 gpu beam ( #9690 )
...
device hangs and ci would fail
2025-04-02 03:19:09 -04:00
Ignacio Sica
876a8be97a
Debug env var breakdown ( #9663 )
...
* add debug level breakdown
* hotfix
* Update env_vars.md
2025-04-02 14:34:07 +08:00
George Hotz
6f812d3f2f
fixes from the dsp branch + 12500 lines ( #9683 )
...
* fixes from the dsp branch
* more changes
* those are gep pushing
2025-04-02 13:07:17 +08:00
chenyu
c20f112e9f
example test use z3 to verify valid simplification ( #9684 )
2025-04-02 01:05:52 -04:00
chenyu
bca0c85193
skip CI CPU test_data_parallel_resnet_train_step ( #9685 )
...
flaky
2025-04-02 01:04:54 -04:00
qazal
bb94f13e58
add RECORD_TRACEBACKS=1 option to process replay ( #9679 )
...
* add RECORD_TRACEBACKS=1 option to process replay
* stack
2025-04-02 11:58:27 +08:00
chenyu
3acc1b928a
minor div_and_mod_folding cleanup [pr] ( #9681 )
...
it's not wrong because the dtype is never used, but `x.const_like` is more readable
2025-04-01 23:51:36 -04:00
chenyu
c672716b38
improve vmin/vmax for IDIV ( #9678 )
2025-04-01 23:16:01 -04:00
chenyu
8dd88ad476
don't div_and_mod_folding for negative numerator with remainder ( #9674 )
...
can be wrong in C div since it truncates towards zero
2025-04-01 16:26:23 -04:00
chenyu
0e34f9082e
helper functions for cstyle div mod [pr] ( #9673 )
2025-04-01 08:06:56 -04:00
qazal
eee0dcc37a
merge viz back into one file ( #9672 )
...
* merge viz back into one file
* work
* rename lib to js directory
* fix diff
* less indenting
* memory graph is back
* viz_sz.py
2025-04-01 19:52:02 +08:00
Ignacio Sica
870b545ace
LDS noop and spec ( #9669 )
...
* init lds noop and lds_0 spec
* refactor lds helper test
* fix typo
* test all lds at the same time
* change comment
* comment
* start test_lds_full
* test_lds_tc
* add tc spec
2025-04-01 18:44:55 +08:00
uuuvn
609a006242
AMDComputeQueue.wreg ( #9628 )
...
* AMDComputeQueue.wreg
Used to be part of #9428 , i think it's much more readable than repeating
the ~same pm4 things over and over again, especially with separate .encode
* fix indentation
2025-04-01 17:01:33 +07:00
qazal
fa373e15a3
hotfix: NULL=1 Buffer does not have _buf ( #9661 )
2025-04-01 17:43:55 +08:00
nimlgen
3e2f42c2e8
autogen: remove am headers from extra ( #9666 )
2025-04-01 14:45:30 +07:00
Ignacio Sica
cfad139189
bump assembly debug to 7 ( #9662 )
2025-04-01 11:51:33 +08:00
Ignacio Sica
ac533e89a2
remove duplicated ast print ( #9660 )
2025-04-01 10:29:24 +08:00
Ignacio Sica
846ef84cda
move uops print to debug >= 6 ( #9659 )
2025-04-01 10:29:09 +08:00
b1tg
d9af4cfc1b
AMD_LLVM: tensor cores support ( #9613 )
...
* tensor cores support
* test tesor cores codegen
* use rewrite rules
---------
Co-authored-by: b1tg <b1tg@users.noreply.github.com >
2025-04-01 09:56:27 +08:00
qazal
1658eb4e63
always fit fresh viz graph into view [pr] ( #9657 )
2025-04-01 09:34:26 +08:00