qazal
489cda827a
more scheduler process replay tooling ( #5706 )
...
* more scheduler process replay tooling
* refactor to compare_schedule
2024-07-25 15:47:18 +03:00
qazal
4e070a2c89
start work on indexing fusion ( #5590 )
...
* start base
* the views add up
base reduceop st:
ShapeTracker(views=(View(shape=(60000, 1), strides=(1, 0), offset=0, mask=None, contiguous=True),))
top st:
ShapeTracker(views=(View(shape=(512, 6000, 1, 28, 28, 10), strides=(0, 1, 0, 0, 0, 6000), offset=0, mask=None, contiguous=False), View(shape=(512, 6000, 1, 28, 28, 10), strides=(47040000, 784, 0, 28, 1, 4704000), offset=0, mask=None, contiguous=False)))
merged buf.st+st:
ShapeTracker(views=(View(shape=(512, 6000, 1, 28, 28, 10), strides=(0, 1, 0, 0, 0, 6000), offset=0, mask=None, contiguous=False), View(shape=(512, 6000, 1, 28, 28, 10), strides=(47040000, 784, 0, 28, 1, 4704000), offset=0, mask=None, contiguous=False)))
* p1
* some cleanups
* more cleanups
* one kernel
* more
* late fuse arange
* less lines
* more work
* fix st strides 1
* update test_schedule, start argmax
* test_tiny_argmax
* add FUSE_ARANGE
* more cleanup
* add utils
* reduce merging
* fix axis and fold if needed
* more fusion
* need to figure this out
* now fixing all of these
* todos+save a line
* ready for p1
2024-07-25 13:23:38 +03:00
nimlgen
08f47d7dc3
more info on failure 41 ( #5704 )
2024-07-25 12:14:28 +03:00
nimlgen
69d4f474d8
amd resnet pf ( #5703 )
2024-07-25 11:21:22 +03:00
nimlgen
1038482a66
enable hip tc ( #5702 )
2024-07-25 11:12:11 +03:00
qazal
5b38ff8679
shorter llvm and ptx rendering [run_process_replay] ( #5686 )
...
* src_dtype
* that's a upat
* the assert in vectorize is in type_verify
* uops asserts vectorizing a vectorize
* assert this
* for internal casts it's fine
2024-07-25 10:42:25 +03:00
chenyu
46e1151c02
UOp more generic mul -> mod folding ( #5698 )
2024-07-24 21:41:25 -04:00
chenyu
66a9c372af
UOp mod reduction ( #5697 )
2024-07-24 20:36:00 -04:00
George Hotz
489a5b99a5
hotfix: triton_nv_matmul touchups
2024-07-24 23:24:29 +00:00
chenyu
8648fb2636
UOp vmin/vmax on ADD ( #5689 )
2024-07-24 19:09:42 -04:00
qazal
e2e70bd90b
bring unbind back in Varaible const ( #5687 )
...
* bring unbind back in Varaible const
* this shows my experience with symbolic
2024-07-24 18:37:00 -04:00
nimlgen
b026312a31
nv ptx print log ( #5691 )
2024-07-24 21:40:58 +03:00
George Hotz
bf24be4c8c
triton gets 163 TFLOPS on 4090
2024-07-24 18:32:29 +00:00
chenyu
85710e86cb
UOps div folding ( #5690 )
...
#5689 , with just div folding and new test cases
2024-07-24 14:21:44 -04:00
chenyu
fb1b51811b
unify UOp min/max default [run_process_replay] ( #5688 )
...
* unify UOp min/max default [run_process_replay]
* fix that
2024-07-24 13:05:26 -04:00
George Hotz
33d44f00ae
first fold, then expand ( #5673 )
...
Co-authored-by: chenyu <chenyu@fastmail.com >
2024-07-24 09:43:09 -07:00
qazal
b7b4c7844f
shorter BufferOps.LOAD creation ( #5685 )
2024-07-24 18:53:07 +03:00
qazal
365e7afd4d
make fusion deterministic ( #5684 )
...
* make fusion deterministic
* not this one yet
* line saving
2024-07-24 18:37:31 +03:00
nimlgen
2ea54176e2
docs: add more info on HCQProgram ( #5683 )
...
* docs: add more info on HCQProgram
* linter
* linter2
* one more type
2024-07-24 17:20:18 +03:00
nimlgen
baface413a
nv better nvdisasm fail message ( #5682 )
...
* nv better nvdisasm message
* cuda
2024-07-24 16:19:26 +03:00
qazal
37347528bf
shorter BufferOps.CONST creation ( #5681 )
2024-07-24 19:33:04 +08:00
qazal
6dcdff3bfd
share fusion behavior for r3 kernels ( #5680 )
...
* use groups
* this is the next one
* should check the whole graph
2024-07-24 19:07:10 +08:00
qazal
3ffb1059a0
scheduling infra for isolated dags ( #5679 )
...
* refactor to get_isolated_children
* move assign
2024-07-24 17:14:26 +08:00
chenyu
e6e2d86fcf
replace RANGE max fold with generic max fold ( #5676 )
2024-07-24 03:15:39 -04:00
chenyu
a7a77dfd83
UOp mul lt fold ( #5677 )
2024-07-24 02:49:25 -04:00
chenyu
67b036bdfd
generic UOp max folding ( #5675 )
2024-07-24 01:30:32 -04:00
chenyu
d1d81b359f
UOp compute min and max in one call [run_process_replay] ( #5674 )
...
easier to handle cases like *-1 that flip the bounds
2024-07-24 00:51:23 -04:00
chenyu
4e85761d40
UOp mod folding ( #5668 )
2024-07-24 00:10:47 -04:00
George Hotz
f46ba37f8f
increase amount of float2/float4 folding ( #5672 )
2024-07-23 20:52:56 -07:00
George Hotz
053550c3f3
remove MERGE opt, cleanup wmma upcast ( #5669 )
...
* remove MERGE opt, cleanup wmma upcast
* upcast first
* fix broken vectorize folding rule
2024-07-23 20:43:42 -07:00
George Hotz
918eebb1b1
simple TC change [run_process_replay] ( #5671 )
...
* Revert "Revert "this fails too""
This reverts commit 5de43e7073 .
* fix dont_expand_args
2024-07-23 20:11:31 -07:00
chenyu
3060e0be4f
add vmin vmax of SPECIAL ( #5670 )
...
* add vmin vmax of SPECIAL
folded stuff like (-1 < gidx0)
* flaky
2024-07-23 22:55:54 -04:00
George Hotz
5de43e7073
Revert "this fails too"
...
This reverts commit df20c4602a .
2024-07-24 02:20:53 +00:00
George Hotz
df20c4602a
this fails too
2024-07-24 02:19:55 +00:00
George Hotz
fa14f7b4fd
switch contract arg to match expand arg [run_process_replay] ( #5667 )
...
* switch contract arg to match expand arg [run_process_replay]
* support multiaxis contract too, it's easy
* cancel contract/expand
2024-07-23 18:08:33 -07:00
chenyu
ea99efe815
remove UOps lt pattern of booleans ( #5666 )
...
covered by the generic lt fold pattern
2024-07-23 20:11:21 -04:00
chenyu
e196640d71
more generic lt folding ( #5665 )
2024-07-23 19:50:59 -04:00
chenyu
7c8fe0fe47
skip interpolate tests for PYTHON=1 ( #5664 )
2024-07-23 18:47:15 -04:00
George Hotz
a85493bdbe
multiaxis contract test
2024-07-23 15:09:15 -07:00
George Hotz
e3f00ac77d
Fix cuda tc emu test ( #5663 )
...
* fix acc folding for NV tensor cores
* fix correctness of reduce_before_expand
* fix test emulated CUDA tensor cores
* test_gemm_fp16 on some devices
2024-07-23 15:04:25 -07:00
chenyu
c34f9db0f7
remove ptx PTXRenderer.gdim gid lid [run_process_replay] ( #5662 )
...
gdim is not used, gid and lid do not need to be attributes
2024-07-23 17:33:20 -04:00
chenyu
16c27ae400
update UOp.SPECIAL arg spec [run_process_replay] ( #5661 )
...
* update UOp.SPECIAL arg spec [run_process_replay]
from `(0, "gid0", 4)` to just `("gid0", 4)`. closer to a Variable
* fix ptx
2024-07-23 16:58:12 -04:00
George Hotz
4d47968580
fix acc folding for NV tensor cores ( #5658 )
...
* fix acc folding for NV tensor cores
* fix correctness of reduce_before_expand
2024-07-23 13:03:02 -07:00
chenyu
01fe00e055
skip test_failure_39 in CI ( #5660 )
...
took more than 2 minutes in ci metal, it's basically the same as test_failure_37 but 20X bigger
2024-07-23 14:47:05 -04:00
chenyu
fdc72ba102
reorder UOps.DEFINE_VAR in runtime [run_process_replay] ( #5659 )
...
prep rewrite SPECIAL using DEFINE_VAR
2024-07-23 14:32:10 -04:00
chenyu
199b3bf02b
simple UOp lt/ge folding ( #5657 )
...
works if lhs is a DEFINE_VAR.
folds trivial x < -math.inf now, need to change SPECIAL to use DEFINE_VAR to fold more
2024-07-23 14:11:05 -04:00
qazal
b0fc5a4c6f
start scheduler process replay ( #5656 )
2024-07-23 20:02:51 +03:00
chenyu
e210c87b4a
uop mod-mod simplification ( #5650 )
2024-07-23 12:33:55 -04:00
nimlgen
1384f08cd4
hcq profile tests ( #5654 )
...
* profile tests
* fixes
* remove linter
2024-07-23 18:40:33 +03:00
qazal
5f394fc9c6
more work toward non-blocking process replay ( #5653 )
...
* non-blocking process replay
* more actionable
* test it
* revert the test
* %s/logging.warn/logging.warning
2024-07-23 14:26:31 +03:00