chenyu
51432bfbff
add rand_like test case with device specified ( #7663 )
...
in single device or copied multi case, device is applied. but for sharded case the device is silently ignored now. maybe similar to rand we just don't allow tuple device in rand_like
2024-11-13 09:32:55 -05:00
Reza Rezvan
23363dee55
Add: failing tests for uint8 min() ( #7669 )
...
* add failing tests for uint8 `min()`
* mark as expected failure
2024-11-13 22:12:53 +08:00
qazal
29508504ea
uop style prefer small dtype + cleanups [pr] ( #7671 )
...
* just this
* space
* typing 2
2024-11-13 21:32:34 +08:00
qazal
e84d089ef1
delete ReduceOps, only use REDUCE_AXIS ( #7667 )
2024-11-13 19:04:27 +08:00
qazal
217c006103
buffer access on UOp [pr] ( #7665 )
...
* add .buffer access on uop
* rename to buf_uop
* start smaller
* ptr != buffer!!
2024-11-13 17:04:19 +08:00
qazal
5da149d23c
uop can have base [pr] ( #7666 )
2024-11-13 16:53:49 +08:00
qazal
ca99c67d78
refactors from the delete lazy diff [pr] ( #7664 )
...
* dedup parent shapetrackers [pr]
* arg -> dtype
* move to ops
* arg
2024-11-13 16:23:53 +08:00
chenyu
e6cfaaa496
metal benchmark JIT=2 -> JIT=1 ( #7661 )
2024-11-12 22:55:27 -05:00
chenyu
4c5f7ddf1f
flux set model path in args ( #7660 )
...
in addition to default downloading through fetch, add an arg to pass model path directly
2024-11-12 22:11:40 -05:00
chenyu
08706c2ea4
more readable rand [pr] ( #7659 )
...
no walrus inside walrus
2024-11-12 19:02:27 -05:00
chenyu
1884f021e3
add conv3x3 to speed_v_theoretical ( #7658 )
...
* add conv3x3 to speed_v_theoretical
* show test duration
2024-11-12 16:41:56 -05:00
ignaciosica
54c0abcb2b
cleaner code_for_op order [pr] ( #7653 )
...
* cleaner code_for_op order
* mantain unary-bin-tern order
* might as well reorder for cuda and amd
2024-11-12 15:13:56 -05:00
chenyu
962dafb467
use randn in speed_v_theoretical instead of rand ( #7656 )
...
* use randn in speed_v_theoretical instead of rand
this made green gemv 20% faster... but why?
* update threshold
2024-11-12 15:00:32 -05:00
chenyu
397a2e6eb6
no special case for int32 in truncate [pr] ( #7657 )
...
this masked an issue that idx is not data, and should never need truncate
2024-11-12 14:52:14 -05:00
chenyu
6159790ab8
add gemv to speed_v_theoretical ( #7654 )
...
* add gemv to speed_v_theoretical
getting ~300GB/s if we just count the memory of inputs and output
* better green numbers
* flip
2024-11-12 11:19:35 -05:00
qazal
e07d2d0966
skip TestBeamSearch.test_large_ast ( #7652 )
2024-11-12 20:52:22 +08:00
qazal
0f02573830
save lines in assign tracking [pr] ( #7651 )
2024-11-12 20:49:13 +08:00
qazal
fbad4900bf
move groups to uop [pr] ( #7640 )
...
* override group post chase [pr]
* key reduceop on ubuf
* fix type
2024-11-12 20:09:13 +08:00
George Hotz
4f1f823021
add tiny test for randomness + remove ulong buffers ( #7648 )
...
* add tiny test for randomness
* Tensor._device_seeds is a Tuple
* no tuple, just a 2 element tensor
* no more longs
* fix tests, and maybe ocelot works now
* NV still doesn't work. cleanup rules
* test + two more rules
2024-11-12 12:45:52 +08:00
chenyu
c06a5a9c72
Tensor.linspace raises for dtype.bool ( #7649 )
...
also fixed an assert when passing str dtype to randint
2024-11-11 23:05:14 -05:00
geohotstan
5eef59d732
add Tensor.linspace ( #7609 )
...
* add linspace
* shave off tests and forgot to add to docs crap
* WHOOPS
* better tests
2024-11-12 10:29:36 +08:00
chenyu
99f29e50b2
update speed_v_theoretical numbers ( #7647 )
...
better amd after set compute profile
2024-11-11 20:05:13 -05:00
chenyu
035e39f900
remove copied is_dtype_supported from onnx [pr] ( #7646 )
2024-11-11 19:20:32 -05:00
Ahmed Harmouche
9c63c3d8ab
These casts should only happen if these are supported ( #7644 )
2024-11-12 07:56:50 +08:00
chenyu
a88a15c7e8
setup perflevel in red CI ( #7645 )
...
runs v4.1 bert setup.
```
rocm-smi --setprofile compute
rocm-smi --setmclk 3
rocm-smi --setperflevel high
```
2024-11-11 18:44:55 -05:00
chenyu
773d5b60bf
beam benchmark tests ( #7638 )
...
* beam benchmark tests
* lower AMD number somehow
* less flaky
2024-11-11 18:11:18 -05:00
chenyu
bfab03288d
fix HALF=1 in test_speed_v_torch ( #7642 )
...
* fix HALF=1 in test_speed_v_torch
"operation cache defeats" adds 1 to all arg, which were centered around 0. adding 1 makes big matmul and matvec go inf.
fixed by subtract 1 after and bumpped tolerance for half input
* bigger tol for BIG=2, update CI too
* bigger tol
2024-11-11 14:29:37 -05:00
nimlgen
4d81b7952a
qcom match texture/sampler descriptors to OpenCL ( #7622 )
...
* qcom ioctl compare more regs
* bug fix
2024-11-11 21:56:51 +03:00
qazal
0b66a0d688
only lookup buf_uops in fuse.py [pr] ( #7641 )
2024-11-11 19:14:30 +02:00
qazal
08b9f055f2
don't need outputs in fuse.py [pr] ( #7639 )
2024-11-11 18:35:31 +02:00
George Hotz
b4cb6b89f9
hotfix: CI mac uses python 3.11
2024-11-11 23:42:35 +08:00
George Hotz
9648372ee6
hotfix: mac uses python 3.12
2024-11-11 23:23:48 +08:00
George Hotz
aaa8059aec
python 3.10 is minimum [pr] ( #7636 )
2024-11-11 23:05:50 +08:00
Kinvert
6a0ed46b1c
adding viz to env_vars docs ( #7630 )
2024-11-11 21:28:27 +08:00
George Hotz
d40673505f
new cloud is cloudy [pr] ( #7631 )
...
* new cloud is cloudy [pr]
* waste lines to add security
* safety, with speed and less lines
* timing and del
* lines
* cleanups
* restore CloudSession
* bump to 3.10
* quotes
* renderer security
2024-11-11 20:18:04 +08:00
qazal
766a680588
swizzle parents with graph rewrite ( #7625 )
...
* delete st_fixup
* refactor
* minimal diff
2024-11-11 16:50:38 +08:00
qazal
fec977b966
calling view on graph edges is fine [pr] ( #7632 )
2024-11-11 16:35:18 +08:00
George Hotz
bbc64bf305
x|(x&y) -> x ( #7629 )
...
* x|(x&y) -> x
* fix tests
2024-11-11 10:00:18 +08:00
uuuvn
94a484542b
Hook memoryview via class instead of a function ( #7627 )
2024-11-11 09:07:06 +08:00
qazal
a8da84cce0
recursive swizzle with just graph_rewrite [pr] ( #7626 )
2024-11-10 20:14:21 +02:00
qazal
7275cfb9d8
cleanup swizzle upats [pr] ( #7624 )
2024-11-10 17:05:27 +02:00
qazal
092a441748
test swizzle post permute ( #7623 )
...
* test swizzle post permute
* add st_fixup assert
2024-11-10 16:18:22 +02:00
George Hotz
745316493c
hotfix: add test_simple_conv2d_bias
2024-11-10 18:36:42 +08:00
George Hotz
44c1fd5661
add optional llvm opt [pr] ( #7619 )
2024-11-10 13:26:49 +08:00
George Hotz
0a411b4f68
replace llvm with new llvm ( #7616 )
...
* replace llvm with new llvm
* fix test_linearizer
* minor fixups
* fix alloca
* don't use alloca
* fix DEFINE_ACC
* lines
* comments and lines
* a little tighter
2024-11-10 11:28:52 +08:00
qazal
b61266eb97
late fusion spec for big graph [pr] ( #7613 )
2024-11-09 23:43:11 +08:00
qazal
9d6b03d691
early assert swizzle in kernel [pr] ( #7610 )
...
* early assert swizzle in kernel [pr]
* better
* note changes
* TestIndexing 2
2024-11-09 21:54:43 +08:00
chenyu
8ca422e21a
script to compare kernel opt with BEAM ( #7604 )
...
intersting that on m1 max hcopt wins BEAM 2 about 20% of the time
2024-11-08 17:40:28 -05:00
chenyu
573f145dcf
METAL raise RuntimeError with no compiler and bad src ( #7603 )
...
fixed BEAM if src is invalid on METAL. it currently only accept RuntimeError in `_time_program`
2024-11-08 17:09:12 -05:00
chenyu
74b4d1c1e1
rewrite idx again in real_strides after uop_given_valid ( #7600 )
...
uop_given_valid does not guarantee output to be flat. fixed one last real_strides test.
2024-11-08 14:30:32 -05:00