chenyu
f54508549f
don't search conv weight init in speed_v_theoretical ( #7943 )
2024-11-28 10:03:18 -05:00
geohotstan
cea5853cfa
add Tensor.scatter ( #7737 )
...
* working I think
* where are my onnx scatter tests??
* forward_only for now
* try if nan hack fix NV
* looks like issue is different... CUDA WHY
* oops that was wrong. Try if this fixes CUDA
* simpler multiply
* actually finish this up tmrw morning :x
* fix tests?
* improve tests
* improve test and implementation
* fix ruff
* complete but lots of expected failure...
* reviewed tests
* add onnx tests
* is this a processing op?
* add return type to indicate that it's not in-place
* final cleanups
* use or and improve tests a little
* add masked_index_select
* call it masked_setitem instead
* try
* FIXED
---------
Co-authored-by: chenyu <chenyu@fastmail.com >
2024-11-27 10:52:04 -05:00
George Hotz
9d0038bccb
small changes from block linearizer [pr] ( #7888 )
...
* small changes from block linearizer [pr]
* fix test_gc
2024-11-25 15:27:04 +08:00
chenyu
5c5b1b994c
less flaky benchmarks ( #7855 )
...
JIT=2 for metal cifar with HALF, and lower tflops for nv test_gemm_4096. failures in https://github.com/tinygrad/tinygrad/actions/runs/11980239535/job/33404098428?pr=7830
2024-11-22 16:39:39 -05:00
qazal
9828277c03
view doesn't have buffer, fix the tests [pr] ( #7841 )
...
* view doesn't have buffer, fix the tests [pr]
* need assigns
2024-11-22 20:41:55 +08:00
George Hotz
e9ae2ccd09
_prg to match _buf [pr] ( #7816 )
2024-11-21 12:44:48 +08:00
George Hotz
c5d458ce02
BufferSpec and ProgramSpec [pr] ( #7814 )
...
* BufferSpec and ProgramSpec [pr]
* delete preallocate, it's unused
* Revert "delete preallocate, it's unused"
This reverts commit dcfcfaccde .
2024-11-21 12:18:05 +08:00
George Hotz
9df5a62c5e
unify to HWQueue [pr] ( #7812 )
...
* unify to HWCommandQueue [pr]
* all is HWQueue
2024-11-21 10:33:08 +08:00
chenyu
11cea00090
lower vs_theoretical conv tflops threshold for nv ( #7811 )
...
less flaky
2024-11-20 20:03:49 -05:00
George Hotz
eb0bb7dc0b
final dname to device [pr] ( #7806 )
...
* final dname to device [pr]
* oops, fix nv
2024-11-20 20:20:28 +08:00
George Hotz
bc977fec53
dname -> device [pr] ( #7804 )
...
* dname -> device [pr]
* a few more
* only one left
2024-11-20 17:57:14 +08:00
George Hotz
d71fe7faa5
rename allocator methods to not conflict [pr] ( #7788 )
...
* rename allocator methods to not conflict [pr]
* forgot those
* transfer + offset
2024-11-20 00:10:29 +08:00
qazal
1e31b5ba6b
hotfix: ctx doesn't impact process replay [pr] ( #7785 )
2024-11-19 20:17:01 +08:00
ignaciosica
597a239e28
Remove UnaryOps, BinaryOps, TernaryOps, MetaOps [pr] ( #7725 )
...
* remove unaryops
* remove ternaryops
* remove metaops
* hotfix
* remove binaryops
* hotfix: test_pattern_matcher
---------
Co-authored-by: qazal <77887910+Qazalin@users.noreply.github.com >
2024-11-16 20:56:56 +08:00
qazal
e84d089ef1
delete ReduceOps, only use REDUCE_AXIS ( #7667 )
2024-11-13 19:04:27 +08:00
chenyu
1884f021e3
add conv3x3 to speed_v_theoretical ( #7658 )
...
* add conv3x3 to speed_v_theoretical
* show test duration
2024-11-12 16:41:56 -05:00
chenyu
962dafb467
use randn in speed_v_theoretical instead of rand ( #7656 )
...
* use randn in speed_v_theoretical instead of rand
this made green gemv 20% faster... but why?
* update threshold
2024-11-12 15:00:32 -05:00
chenyu
6159790ab8
add gemv to speed_v_theoretical ( #7654 )
...
* add gemv to speed_v_theoretical
getting ~300GB/s if we just count the memory of inputs and output
* better green numbers
* flip
2024-11-12 11:19:35 -05:00
chenyu
99f29e50b2
update speed_v_theoretical numbers ( #7647 )
...
better amd after set compute profile
2024-11-11 20:05:13 -05:00
chenyu
773d5b60bf
beam benchmark tests ( #7638 )
...
* beam benchmark tests
* lower AMD number somehow
* less flaky
2024-11-11 18:11:18 -05:00
nimlgen
4d81b7952a
qcom match texture/sampler descriptors to OpenCL ( #7622 )
...
* qcom ioctl compare more regs
* bug fix
2024-11-11 21:56:51 +03:00
chenyu
8ca422e21a
script to compare kernel opt with BEAM ( #7604 )
...
intersting that on m1 max hcopt wins BEAM 2 about 20% of the time
2024-11-08 17:40:28 -05:00
Harald Schäfer
e7cbc29f48
openpilot benchmark: add cast from numpy to benchmark ( #7593 )
...
* openpilot benchmark: add cast from numpy to benchmark
* whitespace
* comment
2024-11-08 19:31:00 +08:00
George Hotz
205befa788
move is_dtype_supported to device [pr] ( #7575 )
2024-11-07 20:38:03 +08:00
Carl Basho
630a7f37cf
update tests ( #7554 )
...
Co-authored-by: John Doe <null@mail.com >
Co-authored-by: chenyu <chenyu@fastmail.com >
2024-11-05 11:35:15 -05:00
chenyu
207bca6cea
set PAGE_SIZE=1 and generate new dataset ( #7559 )
...
13080 rows in total. both generating and loading this are pretty broken now. filters are wrong for example
2024-11-05 11:25:01 -05:00
George Hotz
99bd4372a5
Ops.ALU is no more, the arg is just an op ( #7525 )
...
* op arg alu [pr]
* more
* more passing
* fix more tests
* more tests passing
* fix single failing test
* so much cleaner
* noop to not have process replay trigger
* fix ptx
2024-11-05 00:22:22 +08:00
George Hotz
0c19b6298b
rename ops to have unique names ( #7522 )
2024-11-04 17:09:45 +08:00
George Hotz
c8bf09b7d4
s/UOps/Ops ( #7500 )
...
* s/UOps/Ops [pr]
* fix
2024-11-03 11:26:10 +08:00
qazal
e955aa1bee
hotfix: process replay ( #7418 )
2024-10-30 22:45:40 +02:00
George Hotz
4e2895f8d2
safe changes from new dtype branch [pr] ( #7397 )
...
* safe changes from new dtype branch [pr]
* only image test on GPU
2024-10-30 17:18:48 +08:00
qazal
51c0c8d27e
cachable small graph rewrite ( #7371 )
2024-10-29 22:28:13 +08:00
qazal
e46edc22aa
use unittest helpers in TestTensorMetadata [pr] ( #7329 )
...
* use unittest helpers in TestTensorMetadata [pr]
* fix that
* 5 args
2024-10-28 18:38:30 +08:00
qazal
8d9459f281
always run process replay with contextvars ( #7323 )
...
* always run process replay with contextvars [pr]
* not the last two
* extra
* no pr
2024-10-27 20:44:42 +02:00
nimlgen
293714610a
capture beam log runtime errors ( #7311 )
2024-10-26 13:59:45 +03:00
qazal
d482d927a8
hotfix: nobody uses [run_process_replay] [pr] ( #7264 )
2024-10-24 13:37:29 +03:00
chenyu
f890d1cbbd
remove PUSH_PERMUTES from external_test_opt ( #7232 )
...
remove old comments and update kernel count for test_convnext
2024-10-23 00:11:34 -04:00
qazal
dae908299e
full_ast_rewrite api with ScheduleItemContext ( #7223 )
2024-10-22 23:17:05 +03:00
chenyu
ea016b55d1
don't throw in fuzz_linearizer ( #7148 )
...
already broken on master and needs fix. don't throw to not block other pr
2024-10-18 09:28:30 -04:00
nimlgen
45db7d9045
fuzz qcom vs opencl ( #7130 )
...
* fuzz qcom vs opencl
* fix nv
* bettre?
* typo
* open both devs
2024-10-17 18:49:08 +03:00
George Hotz
ded1b38b84
minor dtype cleanup [pr] ( #7124 )
...
* minor dtype cleanup [pr]
* use ptr() function
2024-10-17 17:41:23 +08:00
nimlgen
39ab67e9ef
beam capture and replay in fuzz ( #7099 )
...
* beam capture and reply in fuzz
* clean a bit
2024-10-16 20:26:58 +03:00
qazal
40f33c110b
big graph var_vals as rewrite context ( #7007 )
...
* var_vals as rewrite context
* no default arg
* add st var_vals
* delete some stuff
* add the rewrite rule again
* extra
* this whole part is preschedule
* test with a second context
* redo
* i always forget tensor variable
2024-10-16 07:31:44 +03:00
qazal
390171d686
delete SAVE_SCHEDULE=1 [pr] ( #7087 )
2024-10-16 07:13:20 +03:00
George Hotz
3169cb386d
remove graph [pr] ( #7085 )
2024-10-16 11:40:07 +08:00
nimlgen
b025495e5c
fuzz nv vs cuda ( #7066 )
...
* fuzz nv vs cuda
* fixes
* smth
* um
* cmp the same
* dnrt
* correct gpfifo scan
* fix
2024-10-15 22:22:40 +03:00
qazal
09de958855
move print_diff to test/helpers ( #7071 )
2024-10-15 22:00:39 +03:00
chenyu
fbaab30fe3
add timing to fuzz_linearizer ( #7056 )
...
and applied smaller FUZZ_MAX_SIZE. this is getting quite slow in CI
2024-10-14 11:57:41 -04:00
qazal
0ef186d4be
scheduler internal api cleanups [pr] ( #7052 )
...
* delete external_benchmark_ast.py [pr]
* cleanup 2
* random
2024-10-14 15:56:10 +03:00
chenyu
bd8ecf7fd6
remove NumNode ( #7035 )
2024-10-13 16:42:19 -04:00