qazal
|
e2e70bd90b
|
bring unbind back in Varaible const (#5687)
* bring unbind back in Varaible const
* this shows my experience with symbolic
|
2024-07-24 18:37:00 -04:00 |
|
nimlgen
|
b026312a31
|
nv ptx print log (#5691)
|
2024-07-24 21:40:58 +03:00 |
|
George Hotz
|
bf24be4c8c
|
triton gets 163 TFLOPS on 4090
|
2024-07-24 18:32:29 +00:00 |
|
chenyu
|
85710e86cb
|
UOps div folding (#5690)
#5689, with just div folding and new test cases
|
2024-07-24 14:21:44 -04:00 |
|
chenyu
|
fb1b51811b
|
unify UOp min/max default [run_process_replay] (#5688)
* unify UOp min/max default [run_process_replay]
* fix that
|
2024-07-24 13:05:26 -04:00 |
|
George Hotz
|
33d44f00ae
|
first fold, then expand (#5673)
Co-authored-by: chenyu <chenyu@fastmail.com>
|
2024-07-24 09:43:09 -07:00 |
|
qazal
|
b7b4c7844f
|
shorter BufferOps.LOAD creation (#5685)
|
2024-07-24 18:53:07 +03:00 |
|
qazal
|
365e7afd4d
|
make fusion deterministic (#5684)
* make fusion deterministic
* not this one yet
* line saving
|
2024-07-24 18:37:31 +03:00 |
|
nimlgen
|
2ea54176e2
|
docs: add more info on HCQProgram (#5683)
* docs: add more info on HCQProgram
* linter
* linter2
* one more type
|
2024-07-24 17:20:18 +03:00 |
|
nimlgen
|
baface413a
|
nv better nvdisasm fail message (#5682)
* nv better nvdisasm message
* cuda
|
2024-07-24 16:19:26 +03:00 |
|
qazal
|
37347528bf
|
shorter BufferOps.CONST creation (#5681)
|
2024-07-24 19:33:04 +08:00 |
|
qazal
|
6dcdff3bfd
|
share fusion behavior for r3 kernels (#5680)
* use groups
* this is the next one
* should check the whole graph
|
2024-07-24 19:07:10 +08:00 |
|
qazal
|
3ffb1059a0
|
scheduling infra for isolated dags (#5679)
* refactor to get_isolated_children
* move assign
|
2024-07-24 17:14:26 +08:00 |
|
chenyu
|
e6e2d86fcf
|
replace RANGE max fold with generic max fold (#5676)
|
2024-07-24 03:15:39 -04:00 |
|
chenyu
|
a7a77dfd83
|
UOp mul lt fold (#5677)
|
2024-07-24 02:49:25 -04:00 |
|
chenyu
|
67b036bdfd
|
generic UOp max folding (#5675)
|
2024-07-24 01:30:32 -04:00 |
|
chenyu
|
d1d81b359f
|
UOp compute min and max in one call [run_process_replay] (#5674)
easier to handle cases like *-1 that flip the bounds
|
2024-07-24 00:51:23 -04:00 |
|
chenyu
|
4e85761d40
|
UOp mod folding (#5668)
|
2024-07-24 00:10:47 -04:00 |
|
George Hotz
|
f46ba37f8f
|
increase amount of float2/float4 folding (#5672)
|
2024-07-23 20:52:56 -07:00 |
|
George Hotz
|
053550c3f3
|
remove MERGE opt, cleanup wmma upcast (#5669)
* remove MERGE opt, cleanup wmma upcast
* upcast first
* fix broken vectorize folding rule
|
2024-07-23 20:43:42 -07:00 |
|
George Hotz
|
918eebb1b1
|
simple TC change [run_process_replay] (#5671)
* Revert "Revert "this fails too""
This reverts commit 5de43e7073.
* fix dont_expand_args
|
2024-07-23 20:11:31 -07:00 |
|
chenyu
|
3060e0be4f
|
add vmin vmax of SPECIAL (#5670)
* add vmin vmax of SPECIAL
folded stuff like (-1 < gidx0)
* flaky
|
2024-07-23 22:55:54 -04:00 |
|
George Hotz
|
5de43e7073
|
Revert "this fails too"
This reverts commit df20c4602a.
|
2024-07-24 02:20:53 +00:00 |
|
George Hotz
|
df20c4602a
|
this fails too
|
2024-07-24 02:19:55 +00:00 |
|
George Hotz
|
fa14f7b4fd
|
switch contract arg to match expand arg [run_process_replay] (#5667)
* switch contract arg to match expand arg [run_process_replay]
* support multiaxis contract too, it's easy
* cancel contract/expand
|
2024-07-23 18:08:33 -07:00 |
|
chenyu
|
ea99efe815
|
remove UOps lt pattern of booleans (#5666)
covered by the generic lt fold pattern
|
2024-07-23 20:11:21 -04:00 |
|
chenyu
|
e196640d71
|
more generic lt folding (#5665)
|
2024-07-23 19:50:59 -04:00 |
|
chenyu
|
7c8fe0fe47
|
skip interpolate tests for PYTHON=1 (#5664)
|
2024-07-23 18:47:15 -04:00 |
|
George Hotz
|
a85493bdbe
|
multiaxis contract test
|
2024-07-23 15:09:15 -07:00 |
|
George Hotz
|
e3f00ac77d
|
Fix cuda tc emu test (#5663)
* fix acc folding for NV tensor cores
* fix correctness of reduce_before_expand
* fix test emulated CUDA tensor cores
* test_gemm_fp16 on some devices
|
2024-07-23 15:04:25 -07:00 |
|
chenyu
|
c34f9db0f7
|
remove ptx PTXRenderer.gdim gid lid [run_process_replay] (#5662)
gdim is not used, gid and lid do not need to be attributes
|
2024-07-23 17:33:20 -04:00 |
|
chenyu
|
16c27ae400
|
update UOp.SPECIAL arg spec [run_process_replay] (#5661)
* update UOp.SPECIAL arg spec [run_process_replay]
from `(0, "gid0", 4)` to just `("gid0", 4)`. closer to a Variable
* fix ptx
|
2024-07-23 16:58:12 -04:00 |
|
George Hotz
|
4d47968580
|
fix acc folding for NV tensor cores (#5658)
* fix acc folding for NV tensor cores
* fix correctness of reduce_before_expand
|
2024-07-23 13:03:02 -07:00 |
|
chenyu
|
01fe00e055
|
skip test_failure_39 in CI (#5660)
took more than 2 minutes in ci metal, it's basically the same as test_failure_37 but 20X bigger
|
2024-07-23 14:47:05 -04:00 |
|
chenyu
|
fdc72ba102
|
reorder UOps.DEFINE_VAR in runtime [run_process_replay] (#5659)
prep rewrite SPECIAL using DEFINE_VAR
|
2024-07-23 14:32:10 -04:00 |
|
chenyu
|
199b3bf02b
|
simple UOp lt/ge folding (#5657)
works if lhs is a DEFINE_VAR.
folds trivial x < -math.inf now, need to change SPECIAL to use DEFINE_VAR to fold more
|
2024-07-23 14:11:05 -04:00 |
|
qazal
|
b0fc5a4c6f
|
start scheduler process replay (#5656)
|
2024-07-23 20:02:51 +03:00 |
|
chenyu
|
e210c87b4a
|
uop mod-mod simplification (#5650)
|
2024-07-23 12:33:55 -04:00 |
|
nimlgen
|
1384f08cd4
|
hcq profile tests (#5654)
* profile tests
* fixes
* remove linter
|
2024-07-23 18:40:33 +03:00 |
|
qazal
|
5f394fc9c6
|
more work toward non-blocking process replay (#5653)
* non-blocking process replay
* more actionable
* test it
* revert the test
* %s/logging.warn/logging.warning
|
2024-07-23 14:26:31 +03:00 |
|
nimlgen
|
a93982ef42
|
hcq move out program call to base class (#5638)
* hcq move out program call to base class
* fix
|
2024-07-23 14:25:38 +03:00 |
|
qazal
|
7cb67e6fb2
|
merge gated stores spec (#5652)
* test_unmerged_ifs should merge ifs
* test_tiny_gate_store
* test_merge_ifs_alt
* assert assert asserts
|
2024-07-23 18:53:27 +08:00 |
|
nimlgen
|
4dcca0a6d4
|
amd tiny cleanups (#5651)
|
2024-07-23 13:06:23 +03:00 |
|
George Hotz
|
7c4b177e3a
|
add tests for uops stats (#5649)
* add tests for uops stats
* no locals skip is fine
* eh
|
2024-07-22 21:57:03 -07:00 |
|
chenyu
|
4f83da626e
|
uop symbolic simple mul mod (#5648)
|
2024-07-22 23:17:41 -04:00 |
|
George Hotz
|
4042bc2399
|
hotfix: put that space back in DEBUG=2
|
2024-07-22 20:11:15 -07:00 |
|
George Hotz
|
2a436fa5c6
|
memory estimate of cache also (#5646)
* print cache/mem ratio
* lds update
* min mem and lds
* cleanup
|
2024-07-22 19:56:36 -07:00 |
|
chenyu
|
efc7bf37a2
|
reuse UOp.sparents in UOps.vars [run_process_replay] (#5647)
also simplified a few set.union
|
2024-07-22 22:50:19 -04:00 |
|
chenyu
|
f2d2afdaa4
|
dumb linearizer example that max is not simplified (#5644)
* dumb linearizer example that max is not simplified
this might just get fix once basic mod simplification is done
* need local
|
2024-07-22 18:37:26 -04:00 |
|
chenyu
|
fe17ea5c88
|
typo in ops_amd invalidate_caches (#5643)
lead to silently not being called
|
2024-07-22 18:37:11 -04:00 |
|