Commit Graph

4618 Commits

Author SHA1 Message Date
chenyu
2cc55a3095 UOp simple mul add div fold (#5726) 2024-07-25 22:00:30 -04:00
chenyu
5521b6d437 UOp simple mul-add-lt fold (#5721) 2024-07-25 20:49:38 -04:00
qazal
1b53207b4f revert isolated dags scheduling (#5724) 2024-07-25 19:45:12 -04:00
chenyu
845b0d1c9d UOp more generic div folding (#5722)
old: `x // c` can fold if `0 <= x.vmin <= x.vmax < c`
new: `x // c` can fold if `0 < c and x.vmin // c == x.vmax // c`
2024-07-25 17:49:14 -04:00
chenyu
a82815262c more test_pattern_matcher fixups (#5714) 2024-07-25 14:12:21 -04:00
chenyu
05e02ddfb3 fixup test_pattern_matcher (#5712) 2024-07-25 13:48:52 -04:00
qazal
9ceb3a3d1f beautiful_mnist -4.3% kernels (#5709)
* add is_complete

* partially delete forced_realized

* p2

* start

* refactor to can_group

* remove steps

* _get_inputs is nicer

* fix the cache

* cache is dict now

* rename to group
2024-07-25 20:30:49 +03:00
kormann
1e2eac755d Fix repr upat (#5705)
* test

* fix

* x fix

* simpler

* rm extra space
2024-07-25 12:05:48 -04:00
qazal
1c992de257 hotfix: compare_schedule defaults to false (#5707) 2024-07-25 17:08:28 +03:00
qazal
489cda827a more scheduler process replay tooling (#5706)
* more scheduler process replay tooling

* refactor to compare_schedule
2024-07-25 15:47:18 +03:00
qazal
4e070a2c89 start work on indexing fusion (#5590)
* start base

* the views add up

base reduceop st:
ShapeTracker(views=(View(shape=(60000, 1), strides=(1, 0), offset=0, mask=None, contiguous=True),))

top st:

ShapeTracker(views=(View(shape=(512, 6000, 1, 28, 28, 10), strides=(0, 1, 0, 0, 0, 6000), offset=0, mask=None, contiguous=False), View(shape=(512, 6000, 1, 28, 28, 10), strides=(47040000, 784, 0, 28, 1, 4704000), offset=0, mask=None, contiguous=False)))

merged buf.st+st:
ShapeTracker(views=(View(shape=(512, 6000, 1, 28, 28, 10), strides=(0, 1, 0, 0, 0, 6000), offset=0, mask=None, contiguous=False), View(shape=(512, 6000, 1, 28, 28, 10), strides=(47040000, 784, 0, 28, 1, 4704000), offset=0, mask=None, contiguous=False)))

* p1

* some cleanups

* more cleanups

* one kernel

* more

* late fuse arange

* less lines

* more work

* fix st strides 1

* update test_schedule, start argmax

* test_tiny_argmax

* add FUSE_ARANGE

* more cleanup

* add utils

* reduce merging

* fix axis and fold if needed

* more fusion

* need to figure this out

* now fixing all of these

* todos+save a line

* ready for p1
2024-07-25 13:23:38 +03:00
nimlgen
08f47d7dc3 more info on failure 41 (#5704) 2024-07-25 12:14:28 +03:00
nimlgen
69d4f474d8 amd resnet pf (#5703) 2024-07-25 11:21:22 +03:00
chenyu
46e1151c02 UOp more generic mul -> mod folding (#5698) 2024-07-24 21:41:25 -04:00
chenyu
66a9c372af UOp mod reduction (#5697) 2024-07-24 20:36:00 -04:00
chenyu
8648fb2636 UOp vmin/vmax on ADD (#5689) 2024-07-24 19:09:42 -04:00
chenyu
85710e86cb UOps div folding (#5690)
#5689, with just div folding and new test cases
2024-07-24 14:21:44 -04:00
chenyu
a7a77dfd83 UOp mul lt fold (#5677) 2024-07-24 02:49:25 -04:00
chenyu
4e85761d40 UOp mod folding (#5668) 2024-07-24 00:10:47 -04:00
George Hotz
053550c3f3 remove MERGE opt, cleanup wmma upcast (#5669)
* remove MERGE opt, cleanup wmma upcast

* upcast first

* fix broken vectorize folding rule
2024-07-23 20:43:42 -07:00
chenyu
3060e0be4f add vmin vmax of SPECIAL (#5670)
* add vmin vmax of SPECIAL

folded stuff like (-1 < gidx0)

* flaky
2024-07-23 22:55:54 -04:00
George Hotz
fa14f7b4fd switch contract arg to match expand arg [run_process_replay] (#5667)
* switch contract arg to match expand arg [run_process_replay]

* support multiaxis contract too, it's easy

* cancel contract/expand
2024-07-23 18:08:33 -07:00
George Hotz
a85493bdbe multiaxis contract test 2024-07-23 15:09:15 -07:00
George Hotz
e3f00ac77d Fix cuda tc emu test (#5663)
* fix acc folding for NV tensor cores

* fix correctness of reduce_before_expand

* fix test emulated CUDA tensor cores

* test_gemm_fp16 on some devices
2024-07-23 15:04:25 -07:00
chenyu
16c27ae400 update UOp.SPECIAL arg spec [run_process_replay] (#5661)
* update UOp.SPECIAL arg spec [run_process_replay]

from `(0, "gid0", 4)` to just `("gid0", 4)`. closer to a Variable

* fix ptx
2024-07-23 16:58:12 -04:00
chenyu
01fe00e055 skip test_failure_39 in CI (#5660)
took more than 2 minutes in ci metal, it's basically the same as test_failure_37 but 20X bigger
2024-07-23 14:47:05 -04:00
chenyu
199b3bf02b simple UOp lt/ge folding (#5657)
works if lhs is a DEFINE_VAR.
folds trivial x < -math.inf now, need to change SPECIAL to use DEFINE_VAR to fold more
2024-07-23 14:11:05 -04:00
qazal
b0fc5a4c6f start scheduler process replay (#5656) 2024-07-23 20:02:51 +03:00
chenyu
e210c87b4a uop mod-mod simplification (#5650) 2024-07-23 12:33:55 -04:00
nimlgen
1384f08cd4 hcq profile tests (#5654)
* profile tests

* fixes

* remove linter
2024-07-23 18:40:33 +03:00
qazal
5f394fc9c6 more work toward non-blocking process replay (#5653)
* non-blocking process replay

* more actionable

* test it

* revert the test

* %s/logging.warn/logging.warning
2024-07-23 14:26:31 +03:00
qazal
7cb67e6fb2 merge gated stores spec (#5652)
* test_unmerged_ifs should merge ifs

* test_tiny_gate_store

* test_merge_ifs_alt

* assert assert asserts
2024-07-23 18:53:27 +08:00
George Hotz
7c4b177e3a add tests for uops stats (#5649)
* add tests for uops stats

* no locals skip is fine

* eh
2024-07-22 21:57:03 -07:00
chenyu
4f83da626e uop symbolic simple mul mod (#5648) 2024-07-22 23:17:41 -04:00
chenyu
f2d2afdaa4 dumb linearizer example that max is not simplified (#5644)
* dumb linearizer example that max is not simplified

this might just get fix once basic mod simplification is done

* need local
2024-07-22 18:37:26 -04:00
chenyu
24505199fb UOp.const(x.dtype, y) -> x.const(y) [run_process_replay] (#5642) 2024-07-22 17:09:40 -04:00
chenyu
97b116bb1d UOp mul div simplification (#5637)
* UOp mul div simplification

* != 0 is fine
2024-07-22 16:14:12 -04:00
nimlgen
26fc4610a0 amd more accurate cache managment (#5631)
* amd more accurate cache managment

* fix amd

* add memory_barrier + copies tests

* tranfer test as well

* linter
2024-07-22 19:07:01 +03:00
Vyacheslav Pachkov
edc58e6b6e hcq: remove duplicate allocation of kernel args by abstracting (#5633) 2024-07-22 18:29:41 +03:00
George Hotz
dc21e63bd2 test: put conv in one reduce (#4441)
* test: put conv in one reduce

* put reduce at the end

* more expand

* generic, and that expand was breaking things

* ratio

* don't undo the expand

* arg 1

* strides

* warning, for resnet

* warning removed

* disable cast

* handle cast

* op

* err, that's right

* fixup

* fix that

* a test to play with

* add double_reduces

* working up to final reshape

* fold the last reshape

* moved to schedule

* fix axis

* ci, need to bring arange back

* FUSE_CONV_BW maybe

* valid in 3.9

* test_expand_reduce_is_folded_on_different_axes

* add FUSE_CONV_BW=1

* test_fold_batchnorm_backward

* test_sgd_4convs_fuse

---------

Co-authored-by: qazal <qazal.software@gmail.com>
2024-07-22 12:16:13 +03:00
George Hotz
386fb5e7f8 folding without UNMUL (#5628)
* folding without UNMUL

* fix failures, index_collapse

* import ReduceOps

* test_arange_4096 isn't folding
2024-07-21 20:14:44 -07:00
George Hotz
7f5282b2f5 tests if the linearizer is generating dumb code (#5611)
* tests if the linearizer is generating dumb code

* push consts to the end

* sort adds

* sorted add and mul

* this better

* simple expand/contract

* no math contract/expand
2024-07-20 20:36:32 -07:00
George Hotz
b399ccd6ef BEAM bugfix, kernels dedup now (#5617)
* BEAM bugfix, kernels dedup now

* getenv is default
2024-07-20 19:43:50 -07:00
chenyu
92e7e65712 one more test case for symbolic mod mul (#5615) 2024-07-20 17:23:06 -04:00
qazal
3ab5fe4e1b test argmax multireduce failure (#5609) 2024-07-20 21:33:03 +08:00
chenyu
b991097d41 move UPat and PatternMatcher from uopgraph.py to uops.py (#5597)
* move UPat and PatternMatcher from uopgraph.py to uops.py

towards instant UOps rewrite on UOp.alu

[run_process_replay]

* fix imports
2024-07-19 19:28:24 -04:00
George Hotz
2e617ca59e lowerer img index (#5592) 2024-07-19 14:22:02 -07:00
nimlgen
b1782e3fef hcq refactor signal into class (#5575)
* hcq refactor signal into class

* fix amd

* amd do not use amd_signal_t

* cleanup

* signal setter

* fix linter

* docs

* more docs + types

* fix types
2024-07-19 23:23:05 +03:00
George Hotz
d0ab20a5e5 careful memory counting (with tests to specify behavior) (#5587) 2024-07-19 11:37:34 -07:00
chenyu
37dd233650 always reverse global dim (#5586)
* always reverse global dim

* one more test
2024-07-19 13:58:05 -04:00