Commit Graph

10417 Commits

Author SHA1 Message Date
uuuvn
754d789f51 Fix and enable jit tests on CLOUD (#10031) 2025-04-24 18:39:31 +03:00
qazal
0b482fb824 add RDNA3 parser to remu (#10025)
* llvm ref

* work

* all of them

* salu

* cleaner

* start

* vector ops

* done

* replace SMEM

* vopd

* sop1

* SOPC

* null stays null_src

* sopp

* SOPK

* sop2

* vop1

* vop2

* remove allow(dead_code)

* vopc
2025-04-24 21:34:07 +08:00
uuuvn
0d903c9495 Print clouddev instead of cloudev's renderer (#10023)
This is kind of a bug because currently with DEBUG>=1 it will say that
remote has device and then an array of renderer props instead of a real
device name which doesn't make sense:

```
127.0.0.1 - - [24/Apr/2025 16:50:44] "GET /properties HTTP/1.1" 200 -
remote has device ['tinygrad.renderer.cstyle', 'MetalRenderer', []]
opened device CLOUD from pid:20210
```

Now it will actually print the name of device behind cloud:

```
127.0.0.1 - - [24/Apr/2025 16:56:29] "GET /properties HTTP/1.1" 200 -
remote has device METAL
opened device CLOUD from pid:20315
```
2025-04-24 09:32:08 -04:00
George Hotz
aec75f51ef fixup some slow CI tests [pr] (#10027)
* fixup some slow CI tests [pr]

* shrink test index
2025-04-24 09:20:49 -04:00
qazal
c990aac2b1 skip flaky test_transcribe_file1_OOB (#10026) 2025-04-24 21:09:43 +08:00
George Hotz
4e2ccfddc6 ci refactor to split AMD/NVIDIA [pr] (#10024)
* ci refactor to split AMD [pr]

* split

* split amd tests

* explicit 0
2025-04-24 08:59:54 -04:00
uuuvn
0c68e44d6f Cloud properties (#10021) 2025-04-24 08:17:01 -04:00
George Hotz
db00d88415 hotfix: handle bad z3 install like z3 import fail 2025-04-24 08:09:40 -04:00
Sieds Lykles
e75be6eafc [bounty] [pr] index validation with z3 (#9981)
* index validation with z3

* Change comment

* toposort -> toposort()

---------

Co-authored-by: George Hotz <72895+geohot@users.noreply.github.com>
2025-04-24 08:06:08 -04:00
quortus
9e49721c47 CPUGraph support for clang (#10014)
Co-authored-by: George Hotz <72895+geohot@users.noreply.github.com>
2025-04-24 07:52:35 -04:00
Park Jun
c3ad7b2a84 create randperm and support pytorch backend (#10019) 2025-04-24 07:29:02 -04:00
Matthew Daiter
b545338e59 isin_Tensor_out added (#10018) 2025-04-24 07:26:51 -04:00
chenyu
a25abf55e3 retinanet only call postprocess_detections with RUNMLPERF (#10017)
during setup only need to compile `_eval_step().numpy()`
2025-04-23 20:45:38 -04:00
nimlgen
7f53e80db9 hotfix: amd mmio on mi300 (#10016)
* hotfix: amd mmio on mi300

* fix

* ops
2025-04-24 01:08:18 +03:00
nimlgen
1c5e353249 am: use mmio iface (#10012)
* am: use mmio iface

* linters

* fixes

* fixes + cleanups

* mute

* mypy

* style
2025-04-24 00:27:04 +03:00
chenyu
65faa1d94b explicit device in mlperf scripts (#10015) 2025-04-23 17:11:52 -04:00
chenyu
a3f938dbee remove retinanet INITMLPERF from beam script (#10011)
it only controls logging, loading real data or not is solely controlled by RUNMLPERF
2025-04-23 14:32:54 -04:00
nimlgen
cc52b9c528 am: add entry() to PT (#10010) 2025-04-23 20:45:52 +03:00
nimlgen
c952cb965e amd: use mmio iface (#9997)
* amd: use mmio iface

* mypy

* rename
2025-04-23 20:13:00 +03:00
Francis Lata
5542aeb0e4 RetinaNet MLPerf flag updates (#10009)
* add RUNMLPERF and update INITMLPERF usage

* update scripts to use RUNMLPERF
2025-04-23 13:00:34 -04:00
George Hotz
de0504276b pop 0 is slow [pr] (#10007) 2025-04-23 17:00:59 +01:00
chenyu
d3a8d5c128 print postprocess_detections time in retinanet eval (#10005)
`BS=96 BASEDIR="/raid/datasets/openimages" MODEL=retinanet python examples/mlperf/model_eval.py`

```
...
loaded dataset             @  8.64s
loaded initial data        @ 12.57s
******  619.97 ms to enqueue, 46042.13 ms to realize ( 116.22 ms fetching, 45399.58 ms postprocess_detections).     0.09 examples/sec.  0.83 TFLOPS  @ 59.23s
******  147.49 ms to enqueue, 37362.16 ms to realize ( 146.96 ms fetching, 36618.84 ms postprocess_detections).     0.11 examples/sec.  1.03 TFLOPS  @ 96.74s
******  152.85 ms to enqueue, 37244.08 ms to realize ( 120.67 ms fetching, 36235.19 ms postprocess_detections).     0.11 examples/sec.  1.04 TFLOPS  @ 134.14s
******  146.39 ms to enqueue, 37279.85 ms to realize (  65.07 ms fetching, 36233.56 ms postprocess_detections).     0.11 examples/sec.  1.04 TFLOPS  @ 171.56s
******  152.41 ms to enqueue, 37264.04 ms to realize ( 127.08 ms fetching, 36196.10 ms postprocess_detections).     0.11 examples/sec.  1.04 TFLOPS  @ 208.98s
******  151.29 ms to enqueue, 36868.08 ms to realize ( 142.73 ms fetching, 36153.07 ms postprocess_detections).     0.11 examples/sec.  1.05 TFLOPS  @ 246.00s
******  136.41 ms to enqueue, 37325.04 ms to realize (  90.29 ms fetching, 36573.38 ms postprocess_detections).     0.11 examples/sec.  1.04 TFLOPS  @ 283.46s
```
2025-04-23 11:39:56 -04:00
George Hotz
2ed3acd767 toposort is a function [pr] (#10004) 2025-04-23 16:25:03 +01:00
uuuvn
0730ff0e50 Skip test that requires lru if device's allocator isn't lru (#10003) 2025-04-23 16:12:56 +01:00
George Hotz
954cb06957 deepwalk without recursion [pr] (#10002)
* deepwalk without recursion [pr]

* comment and remove that test
2025-04-23 15:57:50 +01:00
uuuvn
9de73ccc22 Skip test that requires python 3.12 on older versions (#10001)
`out.cast(it.dtype.fmt).tolist()` fails with `ValueError: memoryview: destination format must be a native single character format prefixed with an optional '@'`
2025-04-23 10:09:26 -04:00
George Hotz
71ecc7fa1a use a pattern matcher for upcast [pr] (#10000) 2025-04-23 14:24:23 +01:00
George Hotz
cc1087d2ec move simplify into views_to_indexed_uops (#9999)
* move simplify into views_to_indexed_uops

* cache that
2025-04-23 13:50:27 +01:00
chenyu
c39128133c retinanet green scripts (#9996)
also removed realize in data_get and used empty for fake data. slightly bigger lr. https://wandb.ai/chenyuxyz/MLPerf-RetinaNet/runs/8skid0e8?nw=nwuserchenyuxyz
2025-04-23 08:28:03 -04:00
George Hotz
a4a5f2d54a faster block order [pr] (#9998)
* faster block reorder [pr]

* ahh, that's even faster
2025-04-23 13:11:30 +01:00
chenyu
61bfd23881 update mlperf-logging version (#9995) 2025-04-22 19:32:39 -04:00
pkotzbach
dbbd755cba FP8s truncate (#9937)
* truncate fp8

* fix

* maybe like that?

* fix linters

* ruff

* move from extra and add ml_types to tests

* minor changes

* str to dtypes and nan support

---------

Co-authored-by: pkotzbach <pawkotz@gmail.com>
2025-04-22 19:12:49 -04:00
qazal
58180caad3 schedule linearize small cleanups [pr] (#9994) 2025-04-23 05:42:29 +08:00
qazal
f4ec57baff new schedule linearizer enqueues KERNEL UOps [pr] (#9993)
* new schedule linearizer enqueues kernels [pr]

* no defaultdict

* diff

* minor
2025-04-23 05:17:58 +08:00
George Hotz
d1f6701eb7 hotfix: lower amd threshold + improve block reorder test 2025-04-22 20:44:29 +01:00
nimlgen
db51133537 rename HWInterface -> FileIOInterface (#9989)
* rename HWInterface -> FileIOInterface

* ugh
2025-04-22 22:18:57 +03:00
George Hotz
c1539b0319 putting add first orders loads as expected (#9991) 2025-04-22 20:12:05 +01:00
nimlgen
bd580d8ea4 hcq: use mmio interface in nv (#9986)
* hcq: start mmio interface

* allow double cast

* revert

* faster?

* simpler, not needed more now

* dd

* types

* fix
2025-04-22 21:58:12 +03:00
George Hotz
feee6986c9 faster block reorder (#9990)
* faster block reorder [pr]

* that shouldn't change order

* key just in sorted

* ind
2025-04-22 19:18:57 +01:00
qazal
6cb2d18c03 refactor schedule linearize to defaultdict [pr] (#9984)
* refactor schedule linearize to defaultdict [pr]

* skip that

* don't need .get
2025-04-23 00:00:23 +08:00
chenyu
9e5e371999 make DISABLE_COMPILER_CACHE a ContextVar [pr] (#9983) 2025-04-22 10:32:54 -04:00
qazal
bbc324f5dc remove CAST_AFTER_EXPAND (#9980) 2025-04-22 21:06:11 +08:00
George Hotz
c519b553db non recursive toposort is 2x+ faster (#9979)
* non recursive toposort is 2x+ faster

* don't change the order
2025-04-22 13:59:38 +01:00
qazal
0d9014d021 place create_ast last, type_verify in the end (once) [pr] (#9977) 2025-04-22 20:15:23 +08:00
chenyu
fb89d9a584 retinanet eval combine output on GPUS[0] (#9966)
eval 35 sec -> 20 sec. it was spending 13 seconds assembling output tensor on CPU backend. GPUS[0] seems to have enough memory, otherwise we can lower EVAL_BS
2025-04-22 07:43:51 -04:00
qazal
7b55846e08 prep STORE UOp creation for multi output [pr] (#9975)
* prep STORE UOp creation for multi output [pr]

* test_multioutput_ast
2025-04-22 19:34:52 +08:00
George Hotz
e358e0a0c6 move metadata set to tensor [pr] (#9976)
* move metadata set to tensor [pr]

* only track that in tensor.py
2025-04-22 12:30:35 +01:00
qazal
f6271515fe refactor UOp.st [pr] (#9973) 2025-04-22 18:46:56 +08:00
George Hotz
f5dc70c624 microbenchmarks + micro speed ups (#9972)
* microbenchmarks

* forgot the ubenchs

* clean up type verify
2025-04-22 11:30:46 +01:00
qazal
1cf4e24ca5 fix kernelize usage with pm_gradient (#9953)
* fix kernelize usage with pm_gradient

* remove that
2025-04-22 17:26:05 +08:00