Commit Graph

6798 Commits

Author SHA1 Message Date
qazal
64ebaa72b5 schedule independent of lazy.py (#7655)
* make it compile

* allow allbufs

* _recursive_group starts to work

* forced_realize works

* _get_isolated_children almost works

* 80%

* 90%

* ocd behavior

* 100% for _get_isolated_children

* FUSE_CONV_BW=1 works

* this took long

* can be from buffer's arg too

* eventually i'll share these

* test_prefer_half_buffer

* FUSE_ARANGE=1 sorta

* start assign and cleanup

fix assign

* braindump

* diff reset

* --- day 3 ---

* make _recursive_group work

* very minimal groups

* BASE

* _get_isolated_children that actually works

* working version of FUSE_CONV_BW=1 and prefer_half

* FUSE_ARANGE=1 works

* fix assign

* one less problem
2024-11-14 17:01:59 +08:00
qazal
0914c2fec9 add TestLinearizerFailures test_failure_56 and test_failure_57 (#7682)
* add test_failure_56 and test_failure_57

* so it's only METAL=1
2024-11-14 12:00:33 +08:00
qazal
a87813f063 hotfix: early fold image to image cast store (#7681)
* hotfix: early fold image to image cast store

* count out meta ops
2024-11-14 11:35:59 +08:00
chenyu
e0ad083904 user ceildiv in shard and fix a typo (#7690) 2024-11-13 18:25:06 -05:00
chenyu
51afc3cc88 update env_vars doc on VIZ link (#7689)
existing one throws 404 because mkdocs does not allow traverse above doc root (i think?). so for now just stick the github link to it
2024-11-13 17:28:14 -05:00
chenyu
333f5f9f8b Tensor.bitwise_not (#7688)
implemented with xor in tensor for now to not add another op. also used it in Tensor.min to fix dtype int on -2**31
2024-11-13 16:31:52 -05:00
chenyu
0423db8d00 simpler nll_loss (#7686) 2024-11-13 15:10:08 -05:00
chenyu
fb933b79a6 add test case for nll_loss with input > 2D (#7685)
* failed test case for nll_loss with input > 2D

* fixed

* add more
2024-11-13 14:34:07 -05:00
geohotstan
9c41c376d3 add Tensor.nll_loss (#7683)
* move nll_loss to new branch

* make nll_loss examples practical

* self *is*

* add to docs

* small
2024-11-13 13:12:13 -05:00
chenyu
3c6fe4b79a fix Tensor.bitwise_and and Tensor.bitwise_or to support bool (#7684) 2024-11-13 13:10:39 -05:00
chenyu
3d82f8e340 simpler rand_like (#7680) 2024-11-13 12:28:41 -05:00
Roelof van Dijk
e75a855f51 refactor: efficient syntax [pr] (#7673) 2024-11-13 11:08:48 -05:00
Roelof van Dijk
433ebecee7 refactor: double if statement [pr] (#7674) 2024-11-13 11:06:59 -05:00
James
d4e4a084a1 fix: Tensor min function for unsigned ints (#7675)
* add failing tests for uint8 `min()`

* fix unsigned data type min()

* fix test data

* fix whitespace

---------

Co-authored-by: rezaarezvan <reza@rezvan.xyz>
Co-authored-by: Jamesb <experimentallearning0@gmail.com>
2024-11-13 11:04:27 -05:00
chenyu
d1dfd598a2 assert specifying device to rand_like a multi tensor (#7678)
* assert specifying device to rand_like a multi tensor

raise RuntimeError instead of dropping it silently

* fix that
2024-11-13 10:24:40 -05:00
chenyu
51432bfbff add rand_like test case with device specified (#7663)
in single device or copied multi case, device is applied. but for sharded case the device is silently ignored now. maybe similar to rand we just don't allow tuple device in rand_like
2024-11-13 09:32:55 -05:00
Reza Rezvan
23363dee55 Add: failing tests for uint8 min() (#7669)
* add failing tests for uint8 `min()`

* mark as expected failure
2024-11-13 22:12:53 +08:00
qazal
29508504ea uop style prefer small dtype + cleanups [pr] (#7671)
* just this

* space

* typing 2
2024-11-13 21:32:34 +08:00
qazal
e84d089ef1 delete ReduceOps, only use REDUCE_AXIS (#7667) 2024-11-13 19:04:27 +08:00
qazal
217c006103 buffer access on UOp [pr] (#7665)
* add .buffer access on uop

* rename to buf_uop

* start smaller

* ptr != buffer!!
2024-11-13 17:04:19 +08:00
qazal
5da149d23c uop can have base [pr] (#7666) 2024-11-13 16:53:49 +08:00
qazal
ca99c67d78 refactors from the delete lazy diff [pr] (#7664)
* dedup parent shapetrackers [pr]

* arg -> dtype

* move to ops

* arg
2024-11-13 16:23:53 +08:00
chenyu
e6cfaaa496 metal benchmark JIT=2 -> JIT=1 (#7661) 2024-11-12 22:55:27 -05:00
chenyu
4c5f7ddf1f flux set model path in args (#7660)
in addition to default downloading through fetch, add an arg to pass model path directly
2024-11-12 22:11:40 -05:00
chenyu
08706c2ea4 more readable rand [pr] (#7659)
no walrus inside walrus
2024-11-12 19:02:27 -05:00
chenyu
1884f021e3 add conv3x3 to speed_v_theoretical (#7658)
* add conv3x3 to speed_v_theoretical

* show test duration
2024-11-12 16:41:56 -05:00
ignaciosica
54c0abcb2b cleaner code_for_op order [pr] (#7653)
* cleaner code_for_op order

* mantain unary-bin-tern order

* might as well reorder for cuda and amd
2024-11-12 15:13:56 -05:00
chenyu
962dafb467 use randn in speed_v_theoretical instead of rand (#7656)
* use randn in speed_v_theoretical instead of rand

this made green gemv 20% faster... but why?

* update threshold
2024-11-12 15:00:32 -05:00
chenyu
397a2e6eb6 no special case for int32 in truncate [pr] (#7657)
this masked an issue that idx is not data, and should never need truncate
2024-11-12 14:52:14 -05:00
chenyu
6159790ab8 add gemv to speed_v_theoretical (#7654)
* add gemv to speed_v_theoretical

getting ~300GB/s if we just count the memory of inputs and output

* better green numbers

* flip
2024-11-12 11:19:35 -05:00
qazal
e07d2d0966 skip TestBeamSearch.test_large_ast (#7652) 2024-11-12 20:52:22 +08:00
qazal
0f02573830 save lines in assign tracking [pr] (#7651) 2024-11-12 20:49:13 +08:00
qazal
fbad4900bf move groups to uop [pr] (#7640)
* override group post chase [pr]

* key reduceop on ubuf

* fix type
2024-11-12 20:09:13 +08:00
George Hotz
4f1f823021 add tiny test for randomness + remove ulong buffers (#7648)
* add tiny test for randomness

* Tensor._device_seeds is a Tuple

* no tuple, just a 2 element tensor

* no more longs

* fix tests, and maybe ocelot works now

* NV still doesn't work. cleanup rules

* test + two more rules
2024-11-12 12:45:52 +08:00
chenyu
c06a5a9c72 Tensor.linspace raises for dtype.bool (#7649)
also fixed an assert when passing str dtype to randint
2024-11-11 23:05:14 -05:00
geohotstan
5eef59d732 add Tensor.linspace (#7609)
* add linspace

* shave off tests and forgot to add to docs crap

* WHOOPS

* better tests
2024-11-12 10:29:36 +08:00
chenyu
99f29e50b2 update speed_v_theoretical numbers (#7647)
better amd after set compute profile
2024-11-11 20:05:13 -05:00
chenyu
035e39f900 remove copied is_dtype_supported from onnx [pr] (#7646) 2024-11-11 19:20:32 -05:00
Ahmed Harmouche
9c63c3d8ab These casts should only happen if these are supported (#7644) 2024-11-12 07:56:50 +08:00
chenyu
a88a15c7e8 setup perflevel in red CI (#7645)
runs v4.1 bert setup.
```
rocm-smi --setprofile compute
rocm-smi --setmclk 3
rocm-smi --setperflevel high
```
2024-11-11 18:44:55 -05:00
chenyu
773d5b60bf beam benchmark tests (#7638)
* beam benchmark tests

* lower AMD number somehow

* less flaky
2024-11-11 18:11:18 -05:00
chenyu
bfab03288d fix HALF=1 in test_speed_v_torch (#7642)
* fix HALF=1 in test_speed_v_torch

"operation cache defeats" adds 1 to all arg, which were centered around 0. adding 1 makes big matmul and matvec go inf.

fixed by subtract 1 after and bumpped tolerance for half input

* bigger tol for BIG=2, update CI too

* bigger tol
2024-11-11 14:29:37 -05:00
nimlgen
4d81b7952a qcom match texture/sampler descriptors to OpenCL (#7622)
* qcom ioctl compare more regs

* bug fix
2024-11-11 21:56:51 +03:00
qazal
0b66a0d688 only lookup buf_uops in fuse.py [pr] (#7641) 2024-11-11 19:14:30 +02:00
qazal
08b9f055f2 don't need outputs in fuse.py [pr] (#7639) 2024-11-11 18:35:31 +02:00
George Hotz
b4cb6b89f9 hotfix: CI mac uses python 3.11 2024-11-11 23:42:35 +08:00
George Hotz
9648372ee6 hotfix: mac uses python 3.12 2024-11-11 23:23:48 +08:00
George Hotz
aaa8059aec python 3.10 is minimum [pr] (#7636) 2024-11-11 23:05:50 +08:00
Kinvert
6a0ed46b1c adding viz to env_vars docs (#7630) 2024-11-11 21:28:27 +08:00
George Hotz
d40673505f new cloud is cloudy [pr] (#7631)
* new cloud is cloudy [pr]

* waste lines to add security

* safety, with speed and less lines

* timing and del

* lines

* cleanups

* restore CloudSession

* bump to 3.10

* quotes

* renderer security
2024-11-11 20:18:04 +08:00