Commit Graph

5245 Commits

Author SHA1 Message Date
George Hotz
fa14f7b4fd switch contract arg to match expand arg [run_process_replay] (#5667)
* switch contract arg to match expand arg [run_process_replay]

* support multiaxis contract too, it's easy

* cancel contract/expand
2024-07-23 18:08:33 -07:00
chenyu
ea99efe815 remove UOps lt pattern of booleans (#5666)
covered by the generic lt fold pattern
2024-07-23 20:11:21 -04:00
chenyu
e196640d71 more generic lt folding (#5665) 2024-07-23 19:50:59 -04:00
chenyu
7c8fe0fe47 skip interpolate tests for PYTHON=1 (#5664) 2024-07-23 18:47:15 -04:00
George Hotz
a85493bdbe multiaxis contract test 2024-07-23 15:09:15 -07:00
George Hotz
e3f00ac77d Fix cuda tc emu test (#5663)
* fix acc folding for NV tensor cores

* fix correctness of reduce_before_expand

* fix test emulated CUDA tensor cores

* test_gemm_fp16 on some devices
2024-07-23 15:04:25 -07:00
chenyu
c34f9db0f7 remove ptx PTXRenderer.gdim gid lid [run_process_replay] (#5662)
gdim is not used, gid and lid do not need to be attributes
2024-07-23 17:33:20 -04:00
chenyu
16c27ae400 update UOp.SPECIAL arg spec [run_process_replay] (#5661)
* update UOp.SPECIAL arg spec [run_process_replay]

from `(0, "gid0", 4)` to just `("gid0", 4)`. closer to a Variable

* fix ptx
2024-07-23 16:58:12 -04:00
George Hotz
4d47968580 fix acc folding for NV tensor cores (#5658)
* fix acc folding for NV tensor cores

* fix correctness of reduce_before_expand
2024-07-23 13:03:02 -07:00
chenyu
01fe00e055 skip test_failure_39 in CI (#5660)
took more than 2 minutes in ci metal, it's basically the same as test_failure_37 but 20X bigger
2024-07-23 14:47:05 -04:00
chenyu
fdc72ba102 reorder UOps.DEFINE_VAR in runtime [run_process_replay] (#5659)
prep rewrite SPECIAL using DEFINE_VAR
2024-07-23 14:32:10 -04:00
chenyu
199b3bf02b simple UOp lt/ge folding (#5657)
works if lhs is a DEFINE_VAR.
folds trivial x < -math.inf now, need to change SPECIAL to use DEFINE_VAR to fold more
2024-07-23 14:11:05 -04:00
qazal
b0fc5a4c6f start scheduler process replay (#5656) 2024-07-23 20:02:51 +03:00
chenyu
e210c87b4a uop mod-mod simplification (#5650) 2024-07-23 12:33:55 -04:00
nimlgen
1384f08cd4 hcq profile tests (#5654)
* profile tests

* fixes

* remove linter
2024-07-23 18:40:33 +03:00
qazal
5f394fc9c6 more work toward non-blocking process replay (#5653)
* non-blocking process replay

* more actionable

* test it

* revert the test

* %s/logging.warn/logging.warning
2024-07-23 14:26:31 +03:00
nimlgen
a93982ef42 hcq move out program call to base class (#5638)
* hcq move out program call to base class

* fix
2024-07-23 14:25:38 +03:00
qazal
7cb67e6fb2 merge gated stores spec (#5652)
* test_unmerged_ifs should merge ifs

* test_tiny_gate_store

* test_merge_ifs_alt

* assert assert asserts
2024-07-23 18:53:27 +08:00
nimlgen
4dcca0a6d4 amd tiny cleanups (#5651) 2024-07-23 13:06:23 +03:00
George Hotz
7c4b177e3a add tests for uops stats (#5649)
* add tests for uops stats

* no locals skip is fine

* eh
2024-07-22 21:57:03 -07:00
chenyu
4f83da626e uop symbolic simple mul mod (#5648) 2024-07-22 23:17:41 -04:00
George Hotz
4042bc2399 hotfix: put that space back in DEBUG=2 2024-07-22 20:11:15 -07:00
George Hotz
2a436fa5c6 memory estimate of cache also (#5646)
* print cache/mem ratio

* lds update

* min mem and lds

* cleanup
2024-07-22 19:56:36 -07:00
chenyu
efc7bf37a2 reuse UOp.sparents in UOps.vars [run_process_replay] (#5647)
also simplified a few set.union
2024-07-22 22:50:19 -04:00
chenyu
f2d2afdaa4 dumb linearizer example that max is not simplified (#5644)
* dumb linearizer example that max is not simplified

this might just get fix once basic mod simplification is done

* need local
2024-07-22 18:37:26 -04:00
chenyu
fe17ea5c88 typo in ops_amd invalidate_caches (#5643)
lead to silently not being called
2024-07-22 18:37:11 -04:00
George Hotz
ed2ee52b8b fix arange 4096 with more folding rules (#5641) 2024-07-22 14:21:11 -07:00
chenyu
24505199fb UOp.const(x.dtype, y) -> x.const(y) [run_process_replay] (#5642) 2024-07-22 17:09:40 -04:00
chenyu
97b116bb1d UOp mul div simplification (#5637)
* UOp mul div simplification

* != 0 is fine
2024-07-22 16:14:12 -04:00
nimlgen
ee633c1988 hcq move out synchronize to base class (#5634) 2024-07-22 20:36:04 +03:00
nimlgen
26fc4610a0 amd more accurate cache managment (#5631)
* amd more accurate cache managment

* fix amd

* add memory_barrier + copies tests

* tranfer test as well

* linter
2024-07-22 19:07:01 +03:00
qazal
fe6f9b2048 more actionable verify_lazyop assert (#5635) 2024-07-23 00:06:11 +08:00
Vyacheslav Pachkov
edc58e6b6e hcq: remove duplicate allocation of kernel args by abstracting (#5633) 2024-07-22 18:29:41 +03:00
nimlgen
08a9c0ae5e hcq cache invalidation for beam (#5630)
* nv full cache invalidation

* the same command on amd

* linter

* fix amd

* nv no hardcoded consts

* beam default
2024-07-22 18:13:17 +03:00
qazal
c64e9591e3 replace gates in uopgraph [run_process_replay] (#5632)
* test

* replace gates in uopgraph [run_process_replay]

* rewrite all gates

* hmm, process replay passes?

* Revert "rewrite all gates"

This reverts commit 2425a443f3.

* yea that makes sense

* remove unsued

* replace source should be up there
2024-07-22 15:56:55 +03:00
George Hotz
dc21e63bd2 test: put conv in one reduce (#4441)
* test: put conv in one reduce

* put reduce at the end

* more expand

* generic, and that expand was breaking things

* ratio

* don't undo the expand

* arg 1

* strides

* warning, for resnet

* warning removed

* disable cast

* handle cast

* op

* err, that's right

* fixup

* fix that

* a test to play with

* add double_reduces

* working up to final reshape

* fold the last reshape

* moved to schedule

* fix axis

* ci, need to bring arange back

* FUSE_CONV_BW maybe

* valid in 3.9

* test_expand_reduce_is_folded_on_different_axes

* add FUSE_CONV_BW=1

* test_fold_batchnorm_backward

* test_sgd_4convs_fuse

---------

Co-authored-by: qazal <qazal.software@gmail.com>
2024-07-22 12:16:13 +03:00
George Hotz
386fb5e7f8 folding without UNMUL (#5628)
* folding without UNMUL

* fix failures, index_collapse

* import ReduceOps

* test_arange_4096 isn't folding
2024-07-21 20:14:44 -07:00
Vyacheslav Pachkov
583829ab44 helpers: remove duplicate data64 helpers in amd/nv (#5627) 2024-07-21 16:50:59 -07:00
George Hotz
6c6d74d922 parallel mcts (#5626)
* start work on parallel mcts

* compile was linearizing twice

* typing + more early stopping

* fix compiler error
2024-07-21 14:53:23 -07:00
chenyu
c56c9c7519 move ufix inside UOp [run_process_replay] (#5621)
dtype is always the dtype of the caller
2024-07-21 17:30:37 -04:00
George Hotz
ef179087a4 mcts exit condition wasn't right, also use it with BEAM>=100 (#5619)
* mcts exit condition wasn't right, also use it with BEAM>=100

* mcts touchups

* clean up sample
2024-07-21 10:16:47 -07:00
chenyu
a823759dc5 simpler pattern matcher rules [run_process_replay] (#5620) 2024-07-21 04:05:01 -04:00
George Hotz
0f67ef4674 mcts graph and dedup support (#5618)
* mcts graph and dedup support

* usable graph

* mcts colors

* C=4 seems better

* C=3 even better

* sample_tree

* backprop is external function

* late expand to match algo
2024-07-20 23:29:14 -07:00
George Hotz
7f5282b2f5 tests if the linearizer is generating dumb code (#5611)
* tests if the linearizer is generating dumb code

* push consts to the end

* sort adds

* sorted add and mul

* this better

* simple expand/contract

* no math contract/expand
2024-07-20 20:36:32 -07:00
chenyu
eddc5bcfd7 MCTS tweaks (#5616)
MCTS 500 is competitive with BEAM=8 on resnet on M1 Max.
- increment trial times even with compiled error and runtime error.
- use best time of children as the node value.
2024-07-20 19:45:59 -07:00
George Hotz
b399ccd6ef BEAM bugfix, kernels dedup now (#5617)
* BEAM bugfix, kernels dedup now

* getenv is default
2024-07-20 19:43:50 -07:00
chenyu
92e7e65712 one more test case for symbolic mod mul (#5615) 2024-07-20 17:23:06 -04:00
chenyu
d71308ed68 copy mlperf 4.0 to mlperf 4.1 (#5614) 2024-07-20 16:12:00 -04:00
George Hotz
1113e47f96 print best in MCTS + light up the winner in hcopt 2024-07-20 09:39:36 -07:00
nimlgen
0de5812032 hcq move map to allocator (#5610)
* hcq move map to allocator

* fix
2024-07-20 19:02:45 +03:00