Commit Graph

2190 Commits

Author SHA1 Message Date
chenyu
71a64d8252 UOps.MUL bound when one is negative (#5781)
* UOps.MUL bound when one is negative

also one more distribute_mul rule

* don't always expand
2024-07-28 19:02:47 -04:00
qazal
b775db6b60 high-level benchmark timing diff (#5776)
* high level timings

benchmark times

fix defs

* use the name map

* skip last task
2024-07-28 23:42:57 +03:00
chenyu
600a39771d fix Tensor.arange if (stop-start) and step have different signs (#5775) 2024-07-28 14:34:10 -04:00
David González Martínez
d0fd84e617 feat: allow passing gradient to .backward() to compute vjp (#5771)
* feat: allow passing gradient to .backward() to compute vjp

* fix

* refactor

* fix trailing whitespace
2024-07-28 11:13:18 -07:00
qazal
e0e7293b0a make process replay unique in retries [run_process_replay] (#5773) 2024-07-28 20:44:15 +03:00
qazal
95dda8dadf more unmatching vectorize/gep asserts [run_process_replay] (#5760)
* merge vectorize/gep rules [run_process_replay]

* assert dtypes

* src=

* float2=(float4.x,float4.y)
2024-07-28 15:08:54 +08:00
chenyu
bfbd7c5461 more generic UOp mul mod folding (#5765) 2024-07-27 20:20:35 -04:00
chenyu
80c6475757 update test_uop_symbolic to test UOp min and max (#5764)
covers #5750, #5748, #5741
2024-07-27 19:53:21 -04:00
nimlgen
ed1d784077 test profiler timer sync across devs (#5751)
* test profiler timer sync across devs

* more correct

* typo
2024-07-27 16:47:37 +03:00
qazal
3e49d86c01 process replay diffs 3 things now (#5731)
* github api infra

* process replay is 3 parts now

* parse benchmarks

* add gh_token

* complete diff

* move process replay tests

* last successful run

* add tempdir

* skip master
2024-07-27 12:52:20 +03:00
qazal
57b4a8e98d assert process replay asserts (#5737)
* assert process replay asserts

* one ci job is fine

* test: Revert "separate process replay main loop (#5734)"

This reverts commit 94d578396f.

* mac sed needs that

* Revert "test: Revert "separate process replay main loop (#5734)""

This reverts commit e4ad7684d5.

* disable process replay capture

* save time

* amd is tiny

* send to /dev/null
2024-07-27 12:07:50 +03:00
George Hotz
f8972ace38 test flops (and allow wide ALU in UOps) [run_process_replay] (#5749)
* flops test in external_test_speed_theoretical.py

* test speed theo

* min SZMAX

* allow wide ALU for things that support it

* needed for mypy
2024-07-26 21:07:28 -07:00
George Hotz
2fde2d2914 hotfix: external_test_speed_theoretical works on 24GB 2024-07-26 18:41:52 -07:00
George Hotz
829262a5ee add external_test_speed_theoretical 2024-07-26 17:45:22 -07:00
kormann
a5ede535ef NOp field name [run_process_replay] (#5742)
* rm def name

* add field name
2024-07-26 18:45:59 -04:00
George Hotz
c50e374bb6 multiple locals + get_kernel_modifier + fix valid (#5739)
* multiple locals + get_kernel_modifier + fix valid

* fix test pattern matcher
2024-07-26 15:10:10 -07:00
chenyu
dc7483ee6f UOp simple div folding (#5740)
made UOp.divides return the Optional[quotient] and used it for simple div folding
2024-07-26 17:14:32 -04:00
chenyu
671259417f reuse UOp __repr__ for NOp (#5738) 2024-07-26 16:59:55 -04:00
kormann
b0c1dba299 named UOp class "NOP" [run_process_replay] (#5728)
* NOP

* fix const + simplify compile

* rm VAR for NOOP

---------

Co-authored-by: George Hotz <72895+geohot@users.noreply.github.com>
2024-07-26 13:25:53 -07:00
George Hotz
4df46eac67 clean up tensor cores [run_process_replay] (#5736)
* clean up tensor cores [run_process_replay]

* remove tuple(wmma_sz), self.opts.device

* remove tls, leave DEVICE
2024-07-26 13:21:23 -07:00
qazal
94d578396f separate process replay main loop (#5734)
* separate process replay main loop

* [run_process_replay]

* add kernel_changed

* test with [run_process_replay]

* revert temp [run_process_replay]
2024-07-26 21:43:08 +03:00
chenyu
a4e9ebc68a update test_uop_symbolic (#5733)
enabled more passed tests
2024-07-26 13:46:09 -04:00
chenyu
2cc55a3095 UOp simple mul add div fold (#5726) 2024-07-25 22:00:30 -04:00
chenyu
5521b6d437 UOp simple mul-add-lt fold (#5721) 2024-07-25 20:49:38 -04:00
qazal
1b53207b4f revert isolated dags scheduling (#5724) 2024-07-25 19:45:12 -04:00
chenyu
845b0d1c9d UOp more generic div folding (#5722)
old: `x // c` can fold if `0 <= x.vmin <= x.vmax < c`
new: `x // c` can fold if `0 < c and x.vmin // c == x.vmax // c`
2024-07-25 17:49:14 -04:00
chenyu
a82815262c more test_pattern_matcher fixups (#5714) 2024-07-25 14:12:21 -04:00
chenyu
05e02ddfb3 fixup test_pattern_matcher (#5712) 2024-07-25 13:48:52 -04:00
qazal
9ceb3a3d1f beautiful_mnist -4.3% kernels (#5709)
* add is_complete

* partially delete forced_realized

* p2

* start

* refactor to can_group

* remove steps

* _get_inputs is nicer

* fix the cache

* cache is dict now

* rename to group
2024-07-25 20:30:49 +03:00
kormann
1e2eac755d Fix repr upat (#5705)
* test

* fix

* x fix

* simpler

* rm extra space
2024-07-25 12:05:48 -04:00
qazal
1c992de257 hotfix: compare_schedule defaults to false (#5707) 2024-07-25 17:08:28 +03:00
qazal
489cda827a more scheduler process replay tooling (#5706)
* more scheduler process replay tooling

* refactor to compare_schedule
2024-07-25 15:47:18 +03:00
qazal
4e070a2c89 start work on indexing fusion (#5590)
* start base

* the views add up

base reduceop st:
ShapeTracker(views=(View(shape=(60000, 1), strides=(1, 0), offset=0, mask=None, contiguous=True),))

top st:

ShapeTracker(views=(View(shape=(512, 6000, 1, 28, 28, 10), strides=(0, 1, 0, 0, 0, 6000), offset=0, mask=None, contiguous=False), View(shape=(512, 6000, 1, 28, 28, 10), strides=(47040000, 784, 0, 28, 1, 4704000), offset=0, mask=None, contiguous=False)))

merged buf.st+st:
ShapeTracker(views=(View(shape=(512, 6000, 1, 28, 28, 10), strides=(0, 1, 0, 0, 0, 6000), offset=0, mask=None, contiguous=False), View(shape=(512, 6000, 1, 28, 28, 10), strides=(47040000, 784, 0, 28, 1, 4704000), offset=0, mask=None, contiguous=False)))

* p1

* some cleanups

* more cleanups

* one kernel

* more

* late fuse arange

* less lines

* more work

* fix st strides 1

* update test_schedule, start argmax

* test_tiny_argmax

* add FUSE_ARANGE

* more cleanup

* add utils

* reduce merging

* fix axis and fold if needed

* more fusion

* need to figure this out

* now fixing all of these

* todos+save a line

* ready for p1
2024-07-25 13:23:38 +03:00
nimlgen
08f47d7dc3 more info on failure 41 (#5704) 2024-07-25 12:14:28 +03:00
nimlgen
69d4f474d8 amd resnet pf (#5703) 2024-07-25 11:21:22 +03:00
chenyu
46e1151c02 UOp more generic mul -> mod folding (#5698) 2024-07-24 21:41:25 -04:00
chenyu
66a9c372af UOp mod reduction (#5697) 2024-07-24 20:36:00 -04:00
chenyu
8648fb2636 UOp vmin/vmax on ADD (#5689) 2024-07-24 19:09:42 -04:00
chenyu
85710e86cb UOps div folding (#5690)
#5689, with just div folding and new test cases
2024-07-24 14:21:44 -04:00
chenyu
a7a77dfd83 UOp mul lt fold (#5677) 2024-07-24 02:49:25 -04:00
chenyu
4e85761d40 UOp mod folding (#5668) 2024-07-24 00:10:47 -04:00
George Hotz
053550c3f3 remove MERGE opt, cleanup wmma upcast (#5669)
* remove MERGE opt, cleanup wmma upcast

* upcast first

* fix broken vectorize folding rule
2024-07-23 20:43:42 -07:00
chenyu
3060e0be4f add vmin vmax of SPECIAL (#5670)
* add vmin vmax of SPECIAL

folded stuff like (-1 < gidx0)

* flaky
2024-07-23 22:55:54 -04:00
George Hotz
fa14f7b4fd switch contract arg to match expand arg [run_process_replay] (#5667)
* switch contract arg to match expand arg [run_process_replay]

* support multiaxis contract too, it's easy

* cancel contract/expand
2024-07-23 18:08:33 -07:00
George Hotz
a85493bdbe multiaxis contract test 2024-07-23 15:09:15 -07:00
George Hotz
e3f00ac77d Fix cuda tc emu test (#5663)
* fix acc folding for NV tensor cores

* fix correctness of reduce_before_expand

* fix test emulated CUDA tensor cores

* test_gemm_fp16 on some devices
2024-07-23 15:04:25 -07:00
chenyu
16c27ae400 update UOp.SPECIAL arg spec [run_process_replay] (#5661)
* update UOp.SPECIAL arg spec [run_process_replay]

from `(0, "gid0", 4)` to just `("gid0", 4)`. closer to a Variable

* fix ptx
2024-07-23 16:58:12 -04:00
chenyu
01fe00e055 skip test_failure_39 in CI (#5660)
took more than 2 minutes in ci metal, it's basically the same as test_failure_37 but 20X bigger
2024-07-23 14:47:05 -04:00
chenyu
199b3bf02b simple UOp lt/ge folding (#5657)
works if lhs is a DEFINE_VAR.
folds trivial x < -math.inf now, need to change SPECIAL to use DEFINE_VAR to fold more
2024-07-23 14:11:05 -04:00
qazal
b0fc5a4c6f start scheduler process replay (#5656) 2024-07-23 20:02:51 +03:00