Roelof van Dijk
56b7fadc2f
perf: skip type verify with -O ( #6319 )
2024-08-29 13:47:51 -07:00
qazal
7a08b881ed
st_fixup explicit UOp init [run_process_replay] ( #6320 )
2024-08-29 23:21:10 +03:00
qazal
539654fbe1
graph_rewrite complexity tests [run_process_replay] ( #6317 )
2024-08-29 22:39:08 +03:00
qazal
07942ef361
Proposal: Better UOps.SWIZZLE ( #6309 )
...
* better UOps.SWIZZLE
* test_swizzle_rewrite
* add it to docs
* show a diff
* a lil more verbose
* two teeny notes
* hotfix: sink
2024-08-29 15:39:48 +03:00
qazal
8c50ef8b7c
start uop docs ( #6291 )
...
* start uop docs
* only need show_labels
* sink comes first
* hotfix: invalid
* touchups
* 2 space indent works
* limit some buffer uops
* better BARRIER doc, Op -> UOp when it makes sense.
* make KernelInfo optional
* more work
relative links don't work
* this can be local in multi reduce+pads
* add UOps.SHAPETRACKER details
* UOps.CONST both types
* nit: local buffer isn't device Buffer, habit
* nit2: dtype -> DType
2024-08-29 15:22:39 +03:00
qazal
dd4e5f1c8d
process replay rewrite ( #6284 )
...
* process replay rewrite
p2
* start some unittests + exceptions and exits
* shebang
* remove extra kernel init
2024-08-29 15:08:27 +03:00
pedro
7de4eac8f7
add support and tests for nearest modes in interpolate, adapt uint8 bilinear to torch implementation ( #6308 )
...
* add `nearest` mode to interpolate
matching pytorch `nearest` which is knowingly buggy
+ relevant TestsOps
* add `nearest-exact` mode to interpolate
matching pytorch `nearest-exact`
+ relevant TestOps
* fix uint8 bilinear interpolation
by matching custom torch implementation
* implement uint8 lerp with torch interpolation trick
without converting it to float
2024-08-28 21:59:51 -07:00
George Hotz
638b4843da
fix for metal ICB issue on M1/M2 [run_process_replay] ( #6313 )
...
* this is a working fix
* better comment
* repro
2024-08-28 21:31:14 -07:00
wozeparrot
cb61cfce24
feat: example and extra tweaks ( #6310 )
2024-08-28 19:26:11 -07:00
wozeparrot
ea5b7910b7
AMD support gfx103x ( #5926 )
2024-08-28 14:17:08 -07:00
gswangg
94a72d44d2
update CI tests in extra with UOp AST ( #6290 )
2024-08-28 22:26:50 +03:00
Tobias Fischer
3517aa89d9
sdxl batched inference fixes ( #6293 )
2024-08-28 07:44:58 -04:00
Roelof van Dijk
85591bd1ae
no need for functools here ( #6303 )
2024-08-28 01:19:57 -07:00
nimlgen
b1e5343133
nv better error msg for p2p failure ( #6301 )
...
* nv better error msg for p2p failure
* linetr
* from
* mypy
2024-08-28 01:40:45 +03:00
nimlgen
ac303146ca
nv sure qmd addr less than 40bits ( #6288 )
2024-08-27 20:47:38 +03:00
George Hotz
5ed6c6ef3e
hotfix: 220V 15A -> 220V 20A
2024-08-27 10:20:43 -07:00
qazal
ec34d9ee36
start benchmarking ast graph rewrite ( #6297 )
...
* ast_rewrite to ctx var
* add external_benchmark_ast
* refactor to asts
* track lazybuffers
* more work
* record checkpoint
* cleanup
2024-08-27 18:18:44 +03:00
qazal
552fbd5527
update llm.c with UOp ast [run_process_replay] ( #6296 )
2024-08-27 15:04:54 +03:00
Tobias Fischer
211bfb6d8a
fixed batched clip computation ( #6292 )
2024-08-26 20:48:15 -04:00
ignaciosica
3918f6eea0
refactor amd render_kernel ( #6223 )
...
* refactor amd render_kernel
* fix spacing
* add half alias back
* use itemsize * 8 insted of fixed values
* reverting becasue it broke as no longer 32 was default
* remove comment
* remove nested tuples
* hotfix: prefix.append
* hotfix2: is not None
* more diff cleanups
* hotfix 4: spacing changes must not be in the same diff
* revert wmma dtype rendering
---------
Co-authored-by: qazal <qazal.software@gmail.com >
2024-08-27 00:28:36 +08:00
ignaciosica
3132449086
refactor _make_{cuda/clang}_dtype into render_vector_prefix ( #6287 )
2024-08-26 09:14:44 -07:00
Max-We
ab2714423b
Add einsum tests ( #6286 )
...
Co-authored-by: Maximilian Weichart <maximilian.weichart@icloud.com >
2024-08-26 09:09:25 -07:00
chenyu
b76f0c875e
lazy const fold idiv 1 ( #6285 )
2024-08-26 10:29:59 -04:00
chenyu
af7c04ff57
Tensor.__floordiv__ ( #6283 )
...
support Tensor.__floordiv__ and friends
2024-08-26 09:43:40 -04:00
qazal
d2f8eeed2e
make [compare_schedule] the default [run_process_replay] ( #6273 )
...
* make [compare_schedule] the default
* capture ctx
* logging
* set capture to false
2024-08-26 21:40:03 +08:00
qazal
067aeaeb2f
single arange fusion with graph rewrite ( #6160 )
2024-08-26 18:18:16 +08:00
qazal
b4381e9777
uop output_st is Optional [run_process_replay] ( #6282 )
2024-08-26 17:58:55 +08:00
qazal
1c0456af89
add UOps.SWIZZLE ( #6271 )
...
* add UOps.SWIZZLE
* flip swizzle init
* generic st_fixup
2024-08-26 16:08:51 +08:00
CaltropHungerton
002f60b4c3
fix intel wmma flop counting, add flop counting tests for different tensor cores ( #6192 )
...
* fix wmma flop counting on intel, add count tests
* half
* add half gemm
* Update test.yml
* one test
* Update test_uops_stats.py
* Update test_uops_stats.py
* Update test_uops_stats.py
* smaller matrix, use unittest skipUnless decorator
2024-08-25 18:37:05 -07:00
Tobias Fischer
331b0f5477
new clip gather ( #6277 )
2024-08-25 19:27:24 -04:00
qazal
f0cc8ca5f2
generic st_fixup in scheduler graph rewrite [compare_schedule] ( #6278 )
2024-08-25 11:02:17 +03:00
qazal
70015bd89c
move permute_reduces to uop movementops [run_process_replay] ( #6272 )
2024-08-25 10:25:51 +03:00
chenyu
b86907c6c7
UOp.const(x.dtype, y) -> x.const(y) [run_process_replay] ( #6276 )
2024-08-24 21:39:50 -04:00
chenyu
00282afa41
identity element of binary ops ( #6275 )
...
helper for the number reduce acc is inited to (0 for ADD, 1 for MUL and -inf for MAX)
2024-08-24 18:10:19 -04:00
qazal
ee245b48a9
refactor reduceop swizzling (prep for UOps.SWIZZLE) [compare_schedule] ( #6269 )
2024-08-24 18:17:19 +03:00
gswangg
3cf507ae7f
remove extra.ops and LazyOp support from Kernel ( #6267 )
...
* remove extra.ops and BufferOps
* remove extra.ops and LazyOp support in Kernel
2024-08-24 16:44:38 +03:00
qazal
ccb05d8baa
fixup neg tests [run_process_replay] ( #6268 )
2024-08-24 16:35:43 +03:00
gswangg
ea76b93814
migrate test_linearizer_dumb.py to UOp AST ( #6241 )
...
* add imports and update test_unmerged_ifs to UOp AST
* test_max_simplify_and_cancel
* test_expander_new_srcs
* test_llama_embedding
* test_unaligns_idxs
* test_unrolled_float4_align
* test_upcasted_stores_out_of_order
* remove LazyOp
* remove extra/ops and replace ReduceOps.SUM with BinaryOps.ADD
2024-08-24 16:27:29 +03:00
gswangg
e44653e25a
migrate test_linearizer_failures.py to UOp AST ( #6240 )
...
* add imports and update test_failure_1 to UOp AST
* update test_failure_2 with UOp AST
* update test_failure_3
* test_failure_5
* test_failure_6
* test_failure_7
* test_failure_8
* test_failure_9
* test_failure_10
* test_failure_11
* test_failure_12
* test_failure_12_multireduce
* uncomment skip and migrate test_failure_13
* test_failure_14
* test_failure_15
* test_failure_16
* test_failure_17
* test_failure_18
* test_failure_19
* test_failure_20
* test_failure_21
* test_failure_22
* test_failure_23
* test_failure_24
* test_failure_25
* test_failure_26
* test_failure_27
* test_failure_28
* test_failure_29
* test_failure_30
* test_failure_31
* test_failure_32
* test_failure_33
* test_failure_34
* test_failure_36
* test_failure_37
* test_failure_38
* test_update_39
* test_failure_40
* test_failure_41
* test_failure_42
* test_failure_43
* test_failure_44
* test_failure_45
* test_failure_46
* test_failure_47
* test_failure_48
* test_failure_49
* test_failure_50
* remove LazyOp
* reskip test_failure_22
* remove extra/ops
* replace ReduceOps with BinaryOps
* fixup that import
---------
Co-authored-by: qazal <qazal.software@gmail.com >
2024-08-24 16:26:58 +03:00
qazal
1b4ad982e5
share REDUCE_ALU in multi and schedule [run_process_replay] ( #6266 )
2024-08-24 16:16:38 +03:00
gswangg
1dc6040877
migrate test_search.py to UOp AST ( #6245 )
...
* add imports and update test_kernel_count with UOp AST
* test_filter_global_buffer
* remove LazyOp
* remove extra.ops and ReduceOps
---------
Co-authored-by: qazal <77887910+Qazalin@users.noreply.github.com >
2024-08-24 16:13:53 +03:00
qazal
ae23540d6e
refresh process replay schedule ref in reset.py ( #6265 )
2024-08-24 16:12:51 +03:00
gswangg
7be5eede71
migrate test_linearizer_overflows.py to UOp AST ( #6244 )
...
* add imports, remove ConstBuffer, and update test_overflow_1 with UOp AST
* test_overflow_2
* test_overflow_3
* test_overflow_4
* test_overflow_5
* test_overflow_6
* test_overflow_7
* TestLinearizerOverflowAlt::test_overflow_1
* TestLinearizerOverflowAlt::test_overflow_2
* remove LazyOp
* remove extra.ops
* remove ReduceOps
2024-08-24 16:10:29 +03:00
chenyu
943ab97d24
fix Tensor.prod for multitensor ( #6264 )
2024-08-24 08:52:24 -04:00
qazal
bcb2f1caa3
init REDUCE_AXIS with BinaryOps ( #6256 )
...
* REDUCE_AXIS arg with BinaryOps
* more work in kernel.py
fixup sops.gz
* fix TestGraphRewriteEfficiency
2024-08-24 11:28:41 +03:00
chenyu
da5cf11859
fix acc init value for MUL ( #6263 )
2024-08-23 23:19:44 -04:00
wozeparrot
a7bf20c7cd
feat: updated tinybox docs ( #6261 )
...
* feat: updated tinybox docs
* fix: grammar
2024-08-23 18:27:46 -07:00
George Hotz
26498b322e
add BEAM to external_benchmark_schedule.py
2024-08-23 18:10:46 -07:00
George Hotz
53a73038e3
hotfix: TestGraphRewriteEfficiency.test_create_many_uops
2024-08-23 15:51:57 -07:00
George Hotz
7c3ba3fa8a
improve match stats + custom early reject [run_process_replay] ( #6260 )
...
* improve match stats [run_process_replay]
* custom_early_reject
2024-08-23 15:28:57 -07:00