Commit Graph

5269 Commits

Author SHA1 Message Date
qazal
e2e70bd90b bring unbind back in Varaible const (#5687)
* bring unbind back in Varaible const

* this shows my experience with symbolic
2024-07-24 18:37:00 -04:00
nimlgen
b026312a31 nv ptx print log (#5691) 2024-07-24 21:40:58 +03:00
George Hotz
bf24be4c8c triton gets 163 TFLOPS on 4090 2024-07-24 18:32:29 +00:00
chenyu
85710e86cb UOps div folding (#5690)
#5689, with just div folding and new test cases
2024-07-24 14:21:44 -04:00
chenyu
fb1b51811b unify UOp min/max default [run_process_replay] (#5688)
* unify UOp min/max default [run_process_replay]

* fix that
2024-07-24 13:05:26 -04:00
George Hotz
33d44f00ae first fold, then expand (#5673)
Co-authored-by: chenyu <chenyu@fastmail.com>
2024-07-24 09:43:09 -07:00
qazal
b7b4c7844f shorter BufferOps.LOAD creation (#5685) 2024-07-24 18:53:07 +03:00
qazal
365e7afd4d make fusion deterministic (#5684)
* make fusion deterministic

* not this one yet

* line saving
2024-07-24 18:37:31 +03:00
nimlgen
2ea54176e2 docs: add more info on HCQProgram (#5683)
* docs: add more info on HCQProgram

* linter

* linter2

* one more type
2024-07-24 17:20:18 +03:00
nimlgen
baface413a nv better nvdisasm fail message (#5682)
* nv better nvdisasm message

* cuda
2024-07-24 16:19:26 +03:00
qazal
37347528bf shorter BufferOps.CONST creation (#5681) 2024-07-24 19:33:04 +08:00
qazal
6dcdff3bfd share fusion behavior for r3 kernels (#5680)
* use groups

* this is the next one

* should check the whole graph
2024-07-24 19:07:10 +08:00
qazal
3ffb1059a0 scheduling infra for isolated dags (#5679)
* refactor to get_isolated_children

* move assign
2024-07-24 17:14:26 +08:00
chenyu
e6e2d86fcf replace RANGE max fold with generic max fold (#5676) 2024-07-24 03:15:39 -04:00
chenyu
a7a77dfd83 UOp mul lt fold (#5677) 2024-07-24 02:49:25 -04:00
chenyu
67b036bdfd generic UOp max folding (#5675) 2024-07-24 01:30:32 -04:00
chenyu
d1d81b359f UOp compute min and max in one call [run_process_replay] (#5674)
easier to handle cases like *-1 that flip the bounds
2024-07-24 00:51:23 -04:00
chenyu
4e85761d40 UOp mod folding (#5668) 2024-07-24 00:10:47 -04:00
George Hotz
f46ba37f8f increase amount of float2/float4 folding (#5672) 2024-07-23 20:52:56 -07:00
George Hotz
053550c3f3 remove MERGE opt, cleanup wmma upcast (#5669)
* remove MERGE opt, cleanup wmma upcast

* upcast first

* fix broken vectorize folding rule
2024-07-23 20:43:42 -07:00
George Hotz
918eebb1b1 simple TC change [run_process_replay] (#5671)
* Revert "Revert "this fails too""

This reverts commit 5de43e7073.

* fix dont_expand_args
2024-07-23 20:11:31 -07:00
chenyu
3060e0be4f add vmin vmax of SPECIAL (#5670)
* add vmin vmax of SPECIAL

folded stuff like (-1 < gidx0)

* flaky
2024-07-23 22:55:54 -04:00
George Hotz
5de43e7073 Revert "this fails too"
This reverts commit df20c4602a.
2024-07-24 02:20:53 +00:00
George Hotz
df20c4602a this fails too 2024-07-24 02:19:55 +00:00
George Hotz
fa14f7b4fd switch contract arg to match expand arg [run_process_replay] (#5667)
* switch contract arg to match expand arg [run_process_replay]

* support multiaxis contract too, it's easy

* cancel contract/expand
2024-07-23 18:08:33 -07:00
chenyu
ea99efe815 remove UOps lt pattern of booleans (#5666)
covered by the generic lt fold pattern
2024-07-23 20:11:21 -04:00
chenyu
e196640d71 more generic lt folding (#5665) 2024-07-23 19:50:59 -04:00
chenyu
7c8fe0fe47 skip interpolate tests for PYTHON=1 (#5664) 2024-07-23 18:47:15 -04:00
George Hotz
a85493bdbe multiaxis contract test 2024-07-23 15:09:15 -07:00
George Hotz
e3f00ac77d Fix cuda tc emu test (#5663)
* fix acc folding for NV tensor cores

* fix correctness of reduce_before_expand

* fix test emulated CUDA tensor cores

* test_gemm_fp16 on some devices
2024-07-23 15:04:25 -07:00
chenyu
c34f9db0f7 remove ptx PTXRenderer.gdim gid lid [run_process_replay] (#5662)
gdim is not used, gid and lid do not need to be attributes
2024-07-23 17:33:20 -04:00
chenyu
16c27ae400 update UOp.SPECIAL arg spec [run_process_replay] (#5661)
* update UOp.SPECIAL arg spec [run_process_replay]

from `(0, "gid0", 4)` to just `("gid0", 4)`. closer to a Variable

* fix ptx
2024-07-23 16:58:12 -04:00
George Hotz
4d47968580 fix acc folding for NV tensor cores (#5658)
* fix acc folding for NV tensor cores

* fix correctness of reduce_before_expand
2024-07-23 13:03:02 -07:00
chenyu
01fe00e055 skip test_failure_39 in CI (#5660)
took more than 2 minutes in ci metal, it's basically the same as test_failure_37 but 20X bigger
2024-07-23 14:47:05 -04:00
chenyu
fdc72ba102 reorder UOps.DEFINE_VAR in runtime [run_process_replay] (#5659)
prep rewrite SPECIAL using DEFINE_VAR
2024-07-23 14:32:10 -04:00
chenyu
199b3bf02b simple UOp lt/ge folding (#5657)
works if lhs is a DEFINE_VAR.
folds trivial x < -math.inf now, need to change SPECIAL to use DEFINE_VAR to fold more
2024-07-23 14:11:05 -04:00
qazal
b0fc5a4c6f start scheduler process replay (#5656) 2024-07-23 20:02:51 +03:00
chenyu
e210c87b4a uop mod-mod simplification (#5650) 2024-07-23 12:33:55 -04:00
nimlgen
1384f08cd4 hcq profile tests (#5654)
* profile tests

* fixes

* remove linter
2024-07-23 18:40:33 +03:00
qazal
5f394fc9c6 more work toward non-blocking process replay (#5653)
* non-blocking process replay

* more actionable

* test it

* revert the test

* %s/logging.warn/logging.warning
2024-07-23 14:26:31 +03:00
nimlgen
a93982ef42 hcq move out program call to base class (#5638)
* hcq move out program call to base class

* fix
2024-07-23 14:25:38 +03:00
qazal
7cb67e6fb2 merge gated stores spec (#5652)
* test_unmerged_ifs should merge ifs

* test_tiny_gate_store

* test_merge_ifs_alt

* assert assert asserts
2024-07-23 18:53:27 +08:00
nimlgen
4dcca0a6d4 amd tiny cleanups (#5651) 2024-07-23 13:06:23 +03:00
George Hotz
7c4b177e3a add tests for uops stats (#5649)
* add tests for uops stats

* no locals skip is fine

* eh
2024-07-22 21:57:03 -07:00
chenyu
4f83da626e uop symbolic simple mul mod (#5648) 2024-07-22 23:17:41 -04:00
George Hotz
4042bc2399 hotfix: put that space back in DEBUG=2 2024-07-22 20:11:15 -07:00
George Hotz
2a436fa5c6 memory estimate of cache also (#5646)
* print cache/mem ratio

* lds update

* min mem and lds

* cleanup
2024-07-22 19:56:36 -07:00
chenyu
efc7bf37a2 reuse UOp.sparents in UOps.vars [run_process_replay] (#5647)
also simplified a few set.union
2024-07-22 22:50:19 -04:00
chenyu
f2d2afdaa4 dumb linearizer example that max is not simplified (#5644)
* dumb linearizer example that max is not simplified

this might just get fix once basic mod simplification is done

* need local
2024-07-22 18:37:26 -04:00
chenyu
fe17ea5c88 typo in ops_amd invalidate_caches (#5643)
lead to silently not being called
2024-07-22 18:37:11 -04:00