chenyu
9789a83064
hotfix DEBUG in speed_v_theoretical.py conv ( #8266 )
...
infinite loop with manual DEBUG set `DEBUG=2 python test/external/speed_v_theoretical.py -k conv`
```
File "/Users/chenyu/code/tinygrad/tinygrad/helpers.py", line 95, in __ge__
def __ge__(self, x): return self.value >= x
^^^^^^^^^^^^^^^
[Previous line repeated 4984 more times]
RecursionError: maximum recursion depth exceeded in comparison
```
2024-12-15 19:44:45 -05:00
qazal
67e66ac1ab
hotfix: schedule_uop in process replay ( #8260 )
...
* hotfix: schedule_uop in process replay
* notes
2024-12-15 21:24:54 +08:00
chenyu
62e19649c0
lower test_conv_3x3_256_32_32_256_256 ( #8226 )
...
tiny7 is slow
2024-12-13 17:15:53 -05:00
qazal
5864627abe
process replay filter warnings [pr] ( #8199 )
2024-12-13 17:43:43 +08:00
George Hotz
8a04a3a77a
rename LazyBuffer -> UOp [pr] ( #8169 )
...
* rename LazyBuffer -> UOp [pr]
* fix docs
2024-12-11 16:15:52 -08:00
chenyu
155f7df599
lower test_gemm_4096 expectation on green ( #8152 )
...
getting 119 sometimes, so lowered to 115
2024-12-10 18:05:12 -05:00
qazal
07b6d5cf63
assign early folding ( #8093 )
...
* assign early folding [pr]
* move to to_si
* -
* fix generate_dataset
* diff too big
* no recreation, no diff
* gzip
* new sops from tiny10
* final try
2024-12-07 17:02:55 +08:00
chenyu
564b3a3e1b
onnx Bitwise ops ( #8095 )
...
free stuff!
2024-12-06 16:58:09 -05:00
chenyu
d000c08f04
fix return type of Tensor.pow ( #8091 )
...
int to power of int should return int etc, it hints that we would like to have Ops.POW
2024-12-06 13:38:29 -05:00
geohotstan
5184410fc3
combine get inputs and type_parse function in onnx [fixed] ( #8081 )
...
* 1 is simpler than 2
* variable name
* change error wording
* shapes for sequence type must be homogeneous
* bug fix for model benchmark
* fix comments too
---------
Co-authored-by: chenyu <chenyu@fastmail.com >
2024-12-06 12:34:47 -05:00
chenyu
b73d9a7d24
Revert "combine get inputs and type_parse function in onnx ( #8069 )" ( #8079 )
...
This reverts commit 074a67a6eb .
2024-12-06 08:04:21 -05:00
geohotstan
074a67a6eb
combine get inputs and type_parse function in onnx ( #8069 )
...
* 1 is simpler than 2
* variable name
* change error wording
* shapes for sequence type must be homogeneous
2024-12-06 07:42:35 -05:00
chenyu
5c6ed5dba6
lower test_conv_3x3_256_32_32_256_256 expectation ( #8060 )
...
failed https://github.com/tinygrad/tinygrad/actions/runs/12182799887/job/33982676812#step:9:210
2024-12-05 10:30:56 -05:00
George Hotz
20878be2af
lower test_gemv_4096_16384 expectations
2024-12-05 12:08:26 +08:00
geohotstan
5ce8090d42
simple onnx_ops cleanups ( #8003 )
...
* simple clean ups first
* more work
* kinda have adam
* ooo momentum worked nicely
* almost there
* wow.. is the onnx test wrong
* nicer optim stuff
* just skip that test
* small comment changes
* use naming convention from other parts of codebase
---------
Co-authored-by: chenyu <chenyu@fastmail.com >
2024-12-04 15:33:03 -05:00
chenyu
0693158d28
lower v_theoretical gemv on red ( #8042 )
...
tiny7 is still slower https://github.com/tinygrad/tinygrad/actions/runs/12166149038/job/33931736130#step:8:209
2024-12-04 13:59:40 -05:00
George Hotz
08657cb7b0
hotfix: bump expectations in speed_v_theoretical
2024-12-04 19:00:33 +08:00
George Hotz
ea65c79ba2
hotfix: don't spam BEAM debug in speed_v_theoretical
2024-12-04 18:47:16 +08:00
George Hotz
09b00b1b04
hotfix: use kernel timings instead of python timings in speed_v_theoretical
2024-12-04 18:36:17 +08:00
uuuvn
e9c5b23ba1
Use MTLCompiler directly (v2) ( #7920 )
...
* Use MTLCompiler directly (v2)
* to_block_literal and REQUEST_TYPE_COMPILE
* Rewrite command encoding
* Revert to_block_literal
* Maybe that's more readable to some people?
* Typo and comment about stdlib caching
* Update ops_metal.py
* Update ops_metal.py
* Update ops_metal.py
---------
Co-authored-by: George Hotz <72895+geohot@users.noreply.github.com >
2024-12-04 16:36:48 +08:00
George Hotz
09eac42fd6
cache indexed uops in st [pr] ( #8008 )
...
* cache indexed uops in st [pr]
* remove arg from range
2024-12-03 21:27:07 +08:00
George Hotz
b8bf5b2787
minor uop speedups [pr] ( #8002 )
...
* minor uop cleaner [pr]
* free uop creation speed by removing WeakValueDictionary
* a lil faster
* disable that test
* lines
* and it doesn't print non hit patterns
2024-12-03 17:04:48 +08:00
George Hotz
0905f87b68
hotfix: print only kernel time
2024-12-03 14:25:08 +08:00
chenyu
b91fa24387
script to run regressed sd conv on metal ( #7995 )
...
* script to run regressed sd conv on metal
this and other similar `conv2d + add` kernels contributed to most of the speed regression
* # ruff: noqa: E501
2024-12-02 15:34:27 -05:00
qazal
b797aee720
uop global buf number tracking try 2 [pr] ( #7912 )
...
* uop buffer init small refactor [pr]
* add early
* this way it doesn't need late
* buffer_num
* itertools.count
* count from 0
* down to 380
2024-12-02 14:45:17 +08:00
George Hotz
cbcc1c20eb
second try at block linearize ( #7892 )
...
* second try at block linearize
* weeee, works for lil matmul
* it's so beautiful
* test tiny passes
* fix bugs
* combine matching BLOCKENDS
* wrapping
* test lin failures passes
* those failures were fake
* flip sort order
* fix ptx tests
* deal with store better
* dumb ptx fix
* expect less
* reduce lines
* reduce lines
* less lines and cleaner
* no defaultdict
* tighter
* simpler block_parent_count
2024-12-02 13:43:09 +08:00
George Hotz
6c1efb9a72
hotfix: amd gemv was flaky
2024-12-02 11:08:24 +08:00
chenyu
bb23469f93
lower conv threshold on red ( #7948 )
2024-11-28 13:31:06 -05:00
chenyu
f54508549f
don't search conv weight init in speed_v_theoretical ( #7943 )
2024-11-28 10:03:18 -05:00
geohotstan
cea5853cfa
add Tensor.scatter ( #7737 )
...
* working I think
* where are my onnx scatter tests??
* forward_only for now
* try if nan hack fix NV
* looks like issue is different... CUDA WHY
* oops that was wrong. Try if this fixes CUDA
* simpler multiply
* actually finish this up tmrw morning :x
* fix tests?
* improve tests
* improve test and implementation
* fix ruff
* complete but lots of expected failure...
* reviewed tests
* add onnx tests
* is this a processing op?
* add return type to indicate that it's not in-place
* final cleanups
* use or and improve tests a little
* add masked_index_select
* call it masked_setitem instead
* try
* FIXED
---------
Co-authored-by: chenyu <chenyu@fastmail.com >
2024-11-27 10:52:04 -05:00
George Hotz
9d0038bccb
small changes from block linearizer [pr] ( #7888 )
...
* small changes from block linearizer [pr]
* fix test_gc
2024-11-25 15:27:04 +08:00
chenyu
5c5b1b994c
less flaky benchmarks ( #7855 )
...
JIT=2 for metal cifar with HALF, and lower tflops for nv test_gemm_4096. failures in https://github.com/tinygrad/tinygrad/actions/runs/11980239535/job/33404098428?pr=7830
2024-11-22 16:39:39 -05:00
qazal
9828277c03
view doesn't have buffer, fix the tests [pr] ( #7841 )
...
* view doesn't have buffer, fix the tests [pr]
* need assigns
2024-11-22 20:41:55 +08:00
George Hotz
e9ae2ccd09
_prg to match _buf [pr] ( #7816 )
2024-11-21 12:44:48 +08:00
George Hotz
c5d458ce02
BufferSpec and ProgramSpec [pr] ( #7814 )
...
* BufferSpec and ProgramSpec [pr]
* delete preallocate, it's unused
* Revert "delete preallocate, it's unused"
This reverts commit dcfcfaccde .
2024-11-21 12:18:05 +08:00
George Hotz
9df5a62c5e
unify to HWQueue [pr] ( #7812 )
...
* unify to HWCommandQueue [pr]
* all is HWQueue
2024-11-21 10:33:08 +08:00
chenyu
11cea00090
lower vs_theoretical conv tflops threshold for nv ( #7811 )
...
less flaky
2024-11-20 20:03:49 -05:00
George Hotz
eb0bb7dc0b
final dname to device [pr] ( #7806 )
...
* final dname to device [pr]
* oops, fix nv
2024-11-20 20:20:28 +08:00
George Hotz
bc977fec53
dname -> device [pr] ( #7804 )
...
* dname -> device [pr]
* a few more
* only one left
2024-11-20 17:57:14 +08:00
George Hotz
d71fe7faa5
rename allocator methods to not conflict [pr] ( #7788 )
...
* rename allocator methods to not conflict [pr]
* forgot those
* transfer + offset
2024-11-20 00:10:29 +08:00
qazal
1e31b5ba6b
hotfix: ctx doesn't impact process replay [pr] ( #7785 )
2024-11-19 20:17:01 +08:00
ignaciosica
597a239e28
Remove UnaryOps, BinaryOps, TernaryOps, MetaOps [pr] ( #7725 )
...
* remove unaryops
* remove ternaryops
* remove metaops
* hotfix
* remove binaryops
* hotfix: test_pattern_matcher
---------
Co-authored-by: qazal <77887910+Qazalin@users.noreply.github.com >
2024-11-16 20:56:56 +08:00
qazal
e84d089ef1
delete ReduceOps, only use REDUCE_AXIS ( #7667 )
2024-11-13 19:04:27 +08:00
chenyu
1884f021e3
add conv3x3 to speed_v_theoretical ( #7658 )
...
* add conv3x3 to speed_v_theoretical
* show test duration
2024-11-12 16:41:56 -05:00
chenyu
962dafb467
use randn in speed_v_theoretical instead of rand ( #7656 )
...
* use randn in speed_v_theoretical instead of rand
this made green gemv 20% faster... but why?
* update threshold
2024-11-12 15:00:32 -05:00
chenyu
6159790ab8
add gemv to speed_v_theoretical ( #7654 )
...
* add gemv to speed_v_theoretical
getting ~300GB/s if we just count the memory of inputs and output
* better green numbers
* flip
2024-11-12 11:19:35 -05:00
chenyu
99f29e50b2
update speed_v_theoretical numbers ( #7647 )
...
better amd after set compute profile
2024-11-11 20:05:13 -05:00
chenyu
773d5b60bf
beam benchmark tests ( #7638 )
...
* beam benchmark tests
* lower AMD number somehow
* less flaky
2024-11-11 18:11:18 -05:00
nimlgen
4d81b7952a
qcom match texture/sampler descriptors to OpenCL ( #7622 )
...
* qcom ioctl compare more regs
* bug fix
2024-11-11 21:56:51 +03:00
chenyu
8ca422e21a
script to compare kernel opt with BEAM ( #7604 )
...
intersting that on m1 max hcopt wins BEAM 2 about 20% of the time
2024-11-08 17:40:28 -05:00