deftdawg
32bbff942c
amd: add nbio 7.2.0 for some rdna2 ( #9964 )
...
* - Updated of #9700 which fixes #9665 but for the Steam Deck which was erroring on NBIO 7.2.0
* unrelated change
---------
Co-authored-by: nimlgen <138685161+nimlgen@users.noreply.github.com >
2025-04-22 12:10:48 +03:00
Ignacio Sica
0e79aee706
use_tensor_cores bugfix ( #9969 )
2025-04-21 22:58:17 -03:00
chenyu
5294c32279
dev scripts for retinanet ( #9968 )
...
also BASE_DIR -> BASEDIR for consistency, and move wandb up a bit for more accurate timing
2025-04-21 17:54:56 -04:00
nimlgen
4340197132
am: download fw from web ( #9956 )
...
* am: download fw from web
* tested
* link works
* default to web
* this is default
* not used
2025-04-21 23:26:33 +03:00
nimlgen
7244ca863c
am: fix double read of sdma fw ( #9965 )
2025-04-21 23:04:34 +03:00
uuuvn
b35f94b6ec
Don't hardcode default CLOUDDEV ( #9935 )
2025-04-21 18:46:55 +01:00
Francis Lata
defa1e77f6
get the proper dataset count ( #9962 )
2025-04-21 12:11:37 -04:00
qazal
36ed3c3253
fix kernelize with VIEW children ( #9961 )
2025-04-21 23:38:46 +08:00
uuuvn
757533cbe6
Less verbose cloud multiprocessing start ( #9960 )
...
The set name before starting part used to be required for #9935 when
CLOUDDEV was a global variable, now just readability improvement
2025-04-21 16:19:54 +01:00
Francis Lata
d7e247f329
RetinaNet INITMLPERF support ( #9950 )
...
* fixes to make fake data work
* fix eval beam
* fix merge issue
2025-04-21 10:32:05 -04:00
kamilisjon
014f870733
rm ( #9959 )
...
Co-authored-by: KamilisJonkus <kamilis.jonkus@agmis.com >
2025-04-21 15:23:45 +01:00
chenyu
f68c7041c4
doc fix is_floating_point dtype.float -> dtypes.float ( #9958 )
2025-04-21 09:23:59 -04:00
akhuntsaria
2d423e6737
fix assertion message for supported device in export_model ( #9957 )
2025-04-21 09:23:44 -04:00
ttomsa
783a191925
rm mul from _masked_setitem ( #9951 )
2025-04-21 06:41:50 -04:00
nimlgen
46469f00a2
am: tiny changes in psp load ( #9952 )
2025-04-21 11:52:02 +03:00
qazal
0bee225a58
Tensor.kernelize docs ( #9946 )
...
* Tensor.kernelize docs
* syntax
* test_kernelize_bw
* Tensor.kernelize docstring
* pruning
* tiny details
* details 2
* becomes_map terminology
* more changes to becomes
2025-04-21 16:34:03 +08:00
Francis Lata
ea4cb2c715
small cleanups ( #9947 )
2025-04-20 20:33:20 -04:00
qazal
e8910540f6
Kernelize can be called multiple times on a Tensor ( #9949 )
...
* Kernelize can be called multiple times on a Tensor
* add (failing) test_kernelize_bw
2025-04-21 06:28:47 +08:00
qazal
1d90be2cff
match kernelize API in process replay ( #9948 )
2025-04-21 05:23:41 +08:00
qazal
343a5eb588
dedup assigns in grouper VIZ name function [pr] ( #9942 )
2025-04-20 21:42:25 +08:00
qazal
e20ef7196a
Tensor.kernelize ( #9845 )
...
* add kernelize
* remove that
* kernelize returns self
* update abstractions2.py
* kernelize in test_schedule
* temp: assert BUFFER_VIEW's existence
* ASSIGN must have a buffer or subbuffer target
* assert and shrink
* fix
* padded setitem
* var
* toposort once
* extra
* base_buffer
* end with BUFFER_VIEW
* setitem for disk
* test_setitem_becomes_subbuffer
* mul slice test
* torch backend fix 1
* non-deterministic
* keep subbuffer
2025-04-20 20:53:49 +08:00
qazal
dd16087f62
fold double ASSIGN to same target ( #9941 )
2025-04-20 19:06:38 +08:00
qazal
9a9aba4cd5
setitem tests (some failing) from kernelize ( #9940 )
2025-04-20 18:47:55 +08:00
chenyu
6c30948df6
hand_coded_optimizations returns list[Opt] [pr] ( #9938 )
...
new api looks like `k.apply_opts(hand_coded_optimizations(k))`
2025-04-19 20:26:59 -04:00
chenyu
720f20865b
remove required_optimizations ( #9848 )
2025-04-19 16:51:16 -04:00
qazal
218e01833d
update scheduler section for abstractions2.py [pr] ( #9927 )
2025-04-19 12:09:14 +03:00
chenyu
3fdba48fc7
update bert green and README ( #9934 )
...
submission candidate
2025-04-18 21:21:28 -04:00
George Hotz
b359125ebf
rewrite the linearizer ( #9885 )
...
* random speedups [pr]
* speeding up linearizer
* test_gemm passes
* progress
* test_gemm passes
* working
* simpler
* blockstart unneeded
* simpler
* bugfix
* work
* don't compare
* faster
* progress
* cleanups
* work
* cleanups
* working
* reorder
* name is dumb
* fix tests
* lin2 works
* clean ctx
* mostly bottom up
* passes
* same speed now
* new lin is faster
* dedup
* lines and tuples
* track that
* lin
* revert that
* tests should pass
* merge siblings
* cleaner expression
* only lin2
* finally, some speed
* simpler
* fix unmergables with blockends
2025-04-18 22:35:40 +01:00
Ignacio Sica
023b1c28a2
test_tensor_cores_padded refactor (#9724 )
...
* set pad t 3 for amd padded tc test
* change pad for amd regardless CI
* test tc padded uops and correctness separately
* add test_tensor_cores_padded_uops test to ci
* remove redundant chack for amd device
* cleanup
2025-04-18 17:05:54 -03:00
Ignacio Sica
afff82ba0f
fix ptx linearizer bug [pr] ( #9926 )
...
* fix ptx bug
* align 16
* revert align because it breaks pr
* smallest diff that fixes ptx bug
2025-04-18 13:48:43 -03:00
chenyu
617b45748f
fuse embedding for bert on red ( #9925 )
...
also updated BEAM param and use AMD driver for actual run. 535ms step
2025-04-18 07:20:25 -04:00
qazal
b58decac0c
fix diamond assigns before mapping tensors UOps to assigns ( #9855 )
...
* keep tensor_map until diamond assign fixup
* ctx
2025-04-18 14:17:43 +03:00
qazal
a37d921917
get name from SINK in process replay ( #9924 )
...
* get name from SINK in process replay
* space
2025-04-18 13:51:11 +03:00
George Hotz
aa98aff4cd
don't use ops name, just keep sink ( #9922 )
...
* don't use ops name, just keep sink
* fix test
* endif sink
2025-04-18 08:59:18 +01:00
George Hotz
8919370c76
hotfix: fix test_save_all_dtypes on METAL
2025-04-18 08:42:31 +01:00
qazal
16dfe0a902
upstream remu ( #9921 )
2025-04-18 01:57:36 +03:00
qazal
d287afe3b1
remove shapeless const check in full_shape [pr] ( #9911 )
...
* remove shapeless const check in full_shape [pr]
* those can go too
2025-04-18 00:00:26 +03:00
chenyu
fe6a482f1d
pin hypothesis version to 6.131.0 ( #9920 )
...
6.131.1 seems to cause timeout in CI
2025-04-17 16:34:10 -04:00
chenyu
f5256e0020
Kernel.apply_opts [pr] ( #9917 )
...
* Kernel.apply_opts [pr]
updated all `for opt in`. also updated a few test_liinearizer tests to not implcitly depend on hand_coded_optimization
* not you yet
2025-04-17 08:00:56 -04:00
chenyu
e2ed673c94
FUSE_ARANGE_UINT to not fuse uint ( #9915 )
...
hack to bypass rand, can FUSE_ARANGE on green for 6ms per step
2025-04-16 18:49:38 -04:00
qazal
497daa658a
hotfix: edge-labels go above the overlay ( #9910 )
2025-04-16 23:38:12 +08:00
qazal
e8e43c6dad
ensure edge labels are always on top ( #9908 )
2025-04-16 21:08:06 +08:00
qazal
5265f25088
add counter for incoming edges in viz ( #9907 )
2025-04-16 20:14:14 +08:00
Eitan Turok
2c7c205bc5
Fix dtype comparisons in vectorized transcendental + tests ( #9794 )
...
* init test
* cleanup
* init
* update
* fix
* fix python runtime for vectorized code
* awesome helper
* update
* update
* cleanup
* more cleaning
* cleanup more
* fix tests
* more cleaning
* cleanup more
* fix
* even cleaner
* failing tests is sad
* cleanup
* better name
* make tests pass
* remove vec from python runtime
* remove vec from eval_uop
* remove expected failues
* better name
2025-04-16 08:06:12 -04:00
qazal
929e5a9905
do not construct GrouperContext [pr] ( #9906 )
2025-04-16 18:26:31 +08:00
Xingyu
047c8fd70d
Add amax support to Tensor operations in Torch Backend ( #9905 )
...
* Add amax support to Tensor operations
- Implemented amax function in backend.py for tensor max operations.
- Added unit tests for amax in test.py to ensure correct functionality.
* Fix formatting in amax output function
- Adjusted spacing in the amax output lambda function in backend.py
- Improved code readability for better maintenance
2025-04-16 10:35:50 +01:00
uuuvn
d7f623dac2
Use Buffer in cloud server instead of opaques ( #9875 )
...
Not-quite-required but makes cloud graph a *lot* cleaner because unlike
raw compiled programs `GraphRunner` takes `Buffer`s like other runners.
Otherwise either of: adding a new option to not free on `__del__`,
(ab)using `external_ptr` to prevent free, or making something like a
`FakeBuffer` is required.
2025-04-16 10:17:32 +01:00
qazal
05334e0f3f
construct children from UOp.toposort [pr] ( #9882 )
...
* construct children from UOp.toposort [pr]
* only for bases
2025-04-16 16:55:59 +08:00
geohotstan
4e8f25109a
Revert "ONNX add output shape validation ( #9720 )" ( #9904 )
...
This reverts commit ac713e04db .
2025-04-16 03:15:56 -04:00
chenyu
e8024c8281
faster bert global_norm ( #9901 )
...
tinyamd 2% faster. also updated beam params that's 2-3% faster.
update mlperf doc and steps too
2025-04-15 18:24:44 -04:00