Commit Graph

10417 Commits

Author SHA1 Message Date
Roelof van Dijk
56b7fadc2f perf: skip type verify with -O (#6319) 2024-08-29 13:47:51 -07:00
qazal
7a08b881ed st_fixup explicit UOp init [run_process_replay] (#6320) 2024-08-29 23:21:10 +03:00
qazal
539654fbe1 graph_rewrite complexity tests [run_process_replay] (#6317) 2024-08-29 22:39:08 +03:00
qazal
07942ef361 Proposal: Better UOps.SWIZZLE (#6309)
* better UOps.SWIZZLE

* test_swizzle_rewrite

* add it to docs

* show a diff

* a lil more verbose

* two teeny notes

* hotfix: sink
2024-08-29 15:39:48 +03:00
qazal
8c50ef8b7c start uop docs (#6291)
* start uop docs

* only need show_labels

* sink comes first

* hotfix: invalid

* touchups

* 2 space indent works

* limit some buffer uops

* better BARRIER doc, Op -> UOp when it makes sense.

* make KernelInfo optional

* more work

relative links don't work

* this can be local in multi reduce+pads

* add UOps.SHAPETRACKER details

* UOps.CONST both types

* nit: local buffer isn't device Buffer, habit

* nit2: dtype -> DType
2024-08-29 15:22:39 +03:00
qazal
dd4e5f1c8d process replay rewrite (#6284)
* process replay rewrite

p2

* start some unittests + exceptions and exits

* shebang

* remove extra kernel init
2024-08-29 15:08:27 +03:00
pedro
7de4eac8f7 add support and tests for nearest modes in interpolate, adapt uint8 bilinear to torch implementation (#6308)
* add `nearest` mode to interpolate

matching pytorch `nearest` which is knowingly buggy

+ relevant TestsOps

* add `nearest-exact` mode to interpolate

matching pytorch `nearest-exact`

+ relevant TestOps

* fix uint8 bilinear interpolation

by matching custom torch implementation

* implement uint8 lerp with torch interpolation trick

without converting it to float
2024-08-28 21:59:51 -07:00
George Hotz
638b4843da fix for metal ICB issue on M1/M2 [run_process_replay] (#6313)
* this is a working fix

* better comment

* repro
2024-08-28 21:31:14 -07:00
wozeparrot
cb61cfce24 feat: example and extra tweaks (#6310) 2024-08-28 19:26:11 -07:00
wozeparrot
ea5b7910b7 AMD support gfx103x (#5926) 2024-08-28 14:17:08 -07:00
gswangg
94a72d44d2 update CI tests in extra with UOp AST (#6290) 2024-08-28 22:26:50 +03:00
Tobias Fischer
3517aa89d9 sdxl batched inference fixes (#6293) 2024-08-28 07:44:58 -04:00
Roelof van Dijk
85591bd1ae no need for functools here (#6303) 2024-08-28 01:19:57 -07:00
nimlgen
b1e5343133 nv better error msg for p2p failure (#6301)
* nv better error msg for p2p failure

* linetr

* from

* mypy
2024-08-28 01:40:45 +03:00
nimlgen
ac303146ca nv sure qmd addr less than 40bits (#6288) 2024-08-27 20:47:38 +03:00
George Hotz
5ed6c6ef3e hotfix: 220V 15A -> 220V 20A 2024-08-27 10:20:43 -07:00
qazal
ec34d9ee36 start benchmarking ast graph rewrite (#6297)
* ast_rewrite to ctx var

* add external_benchmark_ast

* refactor to asts

* track lazybuffers

* more work

* record checkpoint

* cleanup
2024-08-27 18:18:44 +03:00
qazal
552fbd5527 update llm.c with UOp ast [run_process_replay] (#6296) 2024-08-27 15:04:54 +03:00
Tobias Fischer
211bfb6d8a fixed batched clip computation (#6292) 2024-08-26 20:48:15 -04:00
ignaciosica
3918f6eea0 refactor amd render_kernel (#6223)
* refactor amd render_kernel

* fix spacing

* add half alias back

* use itemsize * 8 insted of fixed values

* reverting becasue it broke as no longer 32 was default

* remove comment

* remove nested tuples

* hotfix: prefix.append

* hotfix2: is not None

* more diff cleanups

* hotfix 4: spacing changes must not be in the same diff

* revert wmma dtype rendering

---------

Co-authored-by: qazal <qazal.software@gmail.com>
2024-08-27 00:28:36 +08:00
ignaciosica
3132449086 refactor _make_{cuda/clang}_dtype into render_vector_prefix (#6287) 2024-08-26 09:14:44 -07:00
Max-We
ab2714423b Add einsum tests (#6286)
Co-authored-by: Maximilian Weichart <maximilian.weichart@icloud.com>
2024-08-26 09:09:25 -07:00
chenyu
b76f0c875e lazy const fold idiv 1 (#6285) 2024-08-26 10:29:59 -04:00
chenyu
af7c04ff57 Tensor.__floordiv__ (#6283)
support Tensor.__floordiv__ and friends
2024-08-26 09:43:40 -04:00
qazal
d2f8eeed2e make [compare_schedule] the default [run_process_replay] (#6273)
* make [compare_schedule] the default

* capture ctx

* logging

* set capture to false
2024-08-26 21:40:03 +08:00
qazal
067aeaeb2f single arange fusion with graph rewrite (#6160) 2024-08-26 18:18:16 +08:00
qazal
b4381e9777 uop output_st is Optional [run_process_replay] (#6282) 2024-08-26 17:58:55 +08:00
qazal
1c0456af89 add UOps.SWIZZLE (#6271)
* add UOps.SWIZZLE

* flip swizzle init

* generic st_fixup
2024-08-26 16:08:51 +08:00
CaltropHungerton
002f60b4c3 fix intel wmma flop counting, add flop counting tests for different tensor cores (#6192)
* fix wmma flop counting on intel, add count tests

* half

* add half gemm

* Update test.yml

* one test

* Update test_uops_stats.py

* Update test_uops_stats.py

* Update test_uops_stats.py

* smaller matrix, use unittest skipUnless decorator
2024-08-25 18:37:05 -07:00
Tobias Fischer
331b0f5477 new clip gather (#6277) 2024-08-25 19:27:24 -04:00
qazal
f0cc8ca5f2 generic st_fixup in scheduler graph rewrite [compare_schedule] (#6278) 2024-08-25 11:02:17 +03:00
qazal
70015bd89c move permute_reduces to uop movementops [run_process_replay] (#6272) 2024-08-25 10:25:51 +03:00
chenyu
b86907c6c7 UOp.const(x.dtype, y) -> x.const(y) [run_process_replay] (#6276) 2024-08-24 21:39:50 -04:00
chenyu
00282afa41 identity element of binary ops (#6275)
helper for the number reduce acc is inited to (0 for ADD, 1 for MUL and -inf for MAX)
2024-08-24 18:10:19 -04:00
qazal
ee245b48a9 refactor reduceop swizzling (prep for UOps.SWIZZLE) [compare_schedule] (#6269) 2024-08-24 18:17:19 +03:00
gswangg
3cf507ae7f remove extra.ops and LazyOp support from Kernel (#6267)
* remove extra.ops and BufferOps

* remove extra.ops and LazyOp support in Kernel
2024-08-24 16:44:38 +03:00
qazal
ccb05d8baa fixup neg tests [run_process_replay] (#6268) 2024-08-24 16:35:43 +03:00
gswangg
ea76b93814 migrate test_linearizer_dumb.py to UOp AST (#6241)
* add imports and update test_unmerged_ifs to UOp AST

* test_max_simplify_and_cancel

* test_expander_new_srcs

* test_llama_embedding

* test_unaligns_idxs

* test_unrolled_float4_align

* test_upcasted_stores_out_of_order

* remove LazyOp

* remove extra/ops and replace ReduceOps.SUM with BinaryOps.ADD
2024-08-24 16:27:29 +03:00
gswangg
e44653e25a migrate test_linearizer_failures.py to UOp AST (#6240)
* add imports and update test_failure_1 to UOp AST

* update test_failure_2 with UOp AST

* update test_failure_3

* test_failure_5

* test_failure_6

* test_failure_7

* test_failure_8

* test_failure_9

* test_failure_10

* test_failure_11

* test_failure_12

* test_failure_12_multireduce

* uncomment skip and migrate test_failure_13

* test_failure_14

* test_failure_15

* test_failure_16

* test_failure_17

* test_failure_18

* test_failure_19

* test_failure_20

* test_failure_21

* test_failure_22

* test_failure_23

* test_failure_24

* test_failure_25

* test_failure_26

* test_failure_27

* test_failure_28

* test_failure_29

* test_failure_30

* test_failure_31

* test_failure_32

* test_failure_33

* test_failure_34

* test_failure_36

* test_failure_37

* test_failure_38

* test_update_39

* test_failure_40

* test_failure_41

* test_failure_42

* test_failure_43

* test_failure_44

* test_failure_45

* test_failure_46

* test_failure_47

* test_failure_48

* test_failure_49

* test_failure_50

* remove LazyOp

* reskip test_failure_22

* remove extra/ops

* replace ReduceOps with BinaryOps

* fixup that import

---------

Co-authored-by: qazal <qazal.software@gmail.com>
2024-08-24 16:26:58 +03:00
qazal
1b4ad982e5 share REDUCE_ALU in multi and schedule [run_process_replay] (#6266) 2024-08-24 16:16:38 +03:00
gswangg
1dc6040877 migrate test_search.py to UOp AST (#6245)
* add imports and update test_kernel_count with UOp AST

* test_filter_global_buffer

* remove LazyOp

* remove extra.ops and ReduceOps

---------

Co-authored-by: qazal <77887910+Qazalin@users.noreply.github.com>
2024-08-24 16:13:53 +03:00
qazal
ae23540d6e refresh process replay schedule ref in reset.py (#6265) 2024-08-24 16:12:51 +03:00
gswangg
7be5eede71 migrate test_linearizer_overflows.py to UOp AST (#6244)
* add imports, remove ConstBuffer, and update test_overflow_1 with UOp AST

* test_overflow_2

* test_overflow_3

* test_overflow_4

* test_overflow_5

* test_overflow_6

* test_overflow_7

* TestLinearizerOverflowAlt::test_overflow_1

* TestLinearizerOverflowAlt::test_overflow_2

* remove LazyOp

* remove extra.ops

* remove ReduceOps
2024-08-24 16:10:29 +03:00
chenyu
943ab97d24 fix Tensor.prod for multitensor (#6264) 2024-08-24 08:52:24 -04:00
qazal
bcb2f1caa3 init REDUCE_AXIS with BinaryOps (#6256)
* REDUCE_AXIS arg with BinaryOps

* more work in kernel.py
fixup sops.gz

* fix TestGraphRewriteEfficiency
2024-08-24 11:28:41 +03:00
chenyu
da5cf11859 fix acc init value for MUL (#6263) 2024-08-23 23:19:44 -04:00
wozeparrot
a7bf20c7cd feat: updated tinybox docs (#6261)
* feat: updated tinybox docs

* fix: grammar
2024-08-23 18:27:46 -07:00
George Hotz
26498b322e add BEAM to external_benchmark_schedule.py 2024-08-23 18:10:46 -07:00
George Hotz
53a73038e3 hotfix: TestGraphRewriteEfficiency.test_create_many_uops 2024-08-23 15:51:57 -07:00
George Hotz
7c3ba3fa8a improve match stats + custom early reject [run_process_replay] (#6260)
* improve match stats [run_process_replay]

* custom_early_reject
2024-08-23 15:28:57 -07:00