George Hotz
829262a5ee
add external_test_speed_theoretical
2024-07-26 17:45:22 -07:00
chenyu
5f168e7499
remove the optimization in AndNode.substitute ( #5747 )
...
was used in the old linearizer but longer needed. still need substitute because some fuzz tests calls sym_infer on AndNode
2024-07-26 20:08:07 -04:00
kormann
c50e354936
NOp clean up any_len passing [run_process_replay] ( #5743 )
...
* clean allow_any_len
* min
2024-07-26 17:00:31 -07:00
George Hotz
db1d093b29
reenable LLaMA-3 8B BEAM on NV ( #5746 )
2024-07-26 16:56:41 -07:00
chenyu
c6b2d96474
minor uop uopgraph cleanups ( #5745 )
2024-07-26 19:23:48 -04:00
chenyu
3686b6726a
move GraphException to jit.py ( #5744 )
...
same place where GraphRunner is defined
2024-07-26 19:01:12 -04:00
kormann
a5ede535ef
NOp field name [run_process_replay] ( #5742 )
...
* rm def name
* add field name
2024-07-26 18:45:59 -04:00
chenyu
0d7d4dd731
UOp._min_max for MUL and MOD ( #5741 )
2024-07-26 18:38:10 -04:00
George Hotz
c50e374bb6
multiple locals + get_kernel_modifier + fix valid ( #5739 )
...
* multiple locals + get_kernel_modifier + fix valid
* fix test pattern matcher
2024-07-26 15:10:10 -07:00
nimlgen
f6c0e17a2c
optimize symbolic-related updates in graphs ( #5727 )
...
* try
* faster
* cleaner
* better?
* better?
* cleaner
* fixes
* unused
* mypy
* fix clang
* remove comment
* better var names
* rename
* fix cuda
* rename
2024-07-27 00:57:59 +03:00
chenyu
dc7483ee6f
UOp simple div folding ( #5740 )
...
made UOp.divides return the Optional[quotient] and used it for simple div folding
2024-07-26 17:14:32 -04:00
chenyu
671259417f
reuse UOp __repr__ for NOp ( #5738 )
2024-07-26 16:59:55 -04:00
kormann
b0c1dba299
named UOp class "NOP" [run_process_replay] ( #5728 )
...
* NOP
* fix const + simplify compile
* rm VAR for NOOP
---------
Co-authored-by: George Hotz <72895+geohot@users.noreply.github.com >
2024-07-26 13:25:53 -07:00
George Hotz
4df46eac67
clean up tensor cores [run_process_replay] ( #5736 )
...
* clean up tensor cores [run_process_replay]
* remove tuple(wmma_sz), self.opts.device
* remove tls, leave DEVICE
2024-07-26 13:21:23 -07:00
qazal
94d578396f
separate process replay main loop ( #5734 )
...
* separate process replay main loop
* [run_process_replay]
* add kernel_changed
* test with [run_process_replay]
* revert temp [run_process_replay]
2024-07-26 21:43:08 +03:00
chenyu
9838c1a6ff
update import style in runtime ( #5735 )
2024-07-26 14:00:23 -04:00
chenyu
a4e9ebc68a
update test_uop_symbolic ( #5733 )
...
enabled more passed tests
2024-07-26 13:46:09 -04:00
George Hotz
5c688560bc
move CUDA/HIP compilers to their own files [run_process_replay] ( #5732 )
2024-07-26 10:00:15 -07:00
chenyu
2cc55a3095
UOp simple mul add div fold ( #5726 )
2024-07-25 22:00:30 -04:00
chenyu
78f75aa80d
remove redundant symbolic mod rule [run_process_replay] ( #5725 )
2024-07-25 21:21:02 -04:00
chenyu
5521b6d437
UOp simple mul-add-lt fold ( #5721 )
2024-07-25 20:49:38 -04:00
qazal
1b53207b4f
revert isolated dags scheduling ( #5724 )
2024-07-25 19:45:12 -04:00
chenyu
845b0d1c9d
UOp more generic div folding ( #5722 )
...
old: `x // c` can fold if `0 <= x.vmin <= x.vmax < c`
new: `x // c` can fold if `0 < c and x.vmin // c == x.vmax // c`
2024-07-25 17:49:14 -04:00
nimlgen
fb8148077e
hcq do not update the same signal ( #5719 )
...
* hcq do not update the same signal
* import them
2024-07-26 00:24:45 +03:00
nimlgen
6ec9ea9ddd
hcq update_exec with optional params ( #5708 )
2024-07-26 00:04:57 +03:00
George Hotz
8b34ee2f52
remove global_size and local_size from Kernel class [run_process_replay] ( #5720 )
...
* remove global_size and local_size from Kernel class [run_process_replay]
* sizes from the prg
2024-07-25 13:55:08 -07:00
George Hotz
142b7fb22f
faster beam [run_process_replay] ( #5718 )
2024-07-25 11:58:41 -07:00
chenyu
eff7c5fd2c
halve kernel counts in metal Fuzz Test linearizer ( #5716 )
...
the test time has increased to 3 minutes
2024-07-25 14:35:11 -04:00
George Hotz
e877ed9688
cleaner uop expand [run_process_replay] ( #5715 )
...
* cleaner uop expand [run_process_replay]
* comments
2024-07-25 11:29:53 -07:00
chenyu
a82815262c
more test_pattern_matcher fixups ( #5714 )
2024-07-25 14:12:21 -04:00
George Hotz
b8b5411845
move Function to Developer section of docs
2024-07-25 11:05:23 -07:00
qazal
f02124ffa0
rename to realize_reduceop ( #5713 )
...
* rename to realize_reduceop
* shorter comment
2024-07-25 20:57:33 +03:00
chenyu
05e02ddfb3
fixup test_pattern_matcher ( #5712 )
2024-07-25 13:48:52 -04:00
qazal
9ceb3a3d1f
beautiful_mnist -4.3% kernels ( #5709 )
...
* add is_complete
* partially delete forced_realized
* p2
* start
* refactor to can_group
* remove steps
* _get_inputs is nicer
* fix the cache
* cache is dict now
* rename to group
2024-07-25 20:30:49 +03:00
kormann
92eefab4b0
method alu ( #5711 )
...
Co-authored-by: chenyu <chenyu@fastmail.com >
2024-07-25 13:25:38 -04:00
qazal
76877df518
map groupable children ( #5710 )
...
* map groupable children
* remove setitem
2024-07-25 19:27:48 +03:00
kormann
1e2eac755d
Fix repr upat ( #5705 )
...
* test
* fix
* x fix
* simpler
* rm extra space
2024-07-25 12:05:48 -04:00
qazal
1c992de257
hotfix: compare_schedule defaults to false ( #5707 )
2024-07-25 17:08:28 +03:00
qazal
489cda827a
more scheduler process replay tooling ( #5706 )
...
* more scheduler process replay tooling
* refactor to compare_schedule
2024-07-25 15:47:18 +03:00
qazal
4e070a2c89
start work on indexing fusion ( #5590 )
...
* start base
* the views add up
base reduceop st:
ShapeTracker(views=(View(shape=(60000, 1), strides=(1, 0), offset=0, mask=None, contiguous=True),))
top st:
ShapeTracker(views=(View(shape=(512, 6000, 1, 28, 28, 10), strides=(0, 1, 0, 0, 0, 6000), offset=0, mask=None, contiguous=False), View(shape=(512, 6000, 1, 28, 28, 10), strides=(47040000, 784, 0, 28, 1, 4704000), offset=0, mask=None, contiguous=False)))
merged buf.st+st:
ShapeTracker(views=(View(shape=(512, 6000, 1, 28, 28, 10), strides=(0, 1, 0, 0, 0, 6000), offset=0, mask=None, contiguous=False), View(shape=(512, 6000, 1, 28, 28, 10), strides=(47040000, 784, 0, 28, 1, 4704000), offset=0, mask=None, contiguous=False)))
* p1
* some cleanups
* more cleanups
* one kernel
* more
* late fuse arange
* less lines
* more work
* fix st strides 1
* update test_schedule, start argmax
* test_tiny_argmax
* add FUSE_ARANGE
* more cleanup
* add utils
* reduce merging
* fix axis and fold if needed
* more fusion
* need to figure this out
* now fixing all of these
* todos+save a line
* ready for p1
2024-07-25 13:23:38 +03:00
nimlgen
08f47d7dc3
more info on failure 41 ( #5704 )
2024-07-25 12:14:28 +03:00
nimlgen
69d4f474d8
amd resnet pf ( #5703 )
2024-07-25 11:21:22 +03:00
nimlgen
1038482a66
enable hip tc ( #5702 )
2024-07-25 11:12:11 +03:00
qazal
5b38ff8679
shorter llvm and ptx rendering [run_process_replay] ( #5686 )
...
* src_dtype
* that's a upat
* the assert in vectorize is in type_verify
* uops asserts vectorizing a vectorize
* assert this
* for internal casts it's fine
2024-07-25 10:42:25 +03:00
chenyu
46e1151c02
UOp more generic mul -> mod folding ( #5698 )
2024-07-24 21:41:25 -04:00
chenyu
66a9c372af
UOp mod reduction ( #5697 )
2024-07-24 20:36:00 -04:00
George Hotz
489a5b99a5
hotfix: triton_nv_matmul touchups
2024-07-24 23:24:29 +00:00
chenyu
8648fb2636
UOp vmin/vmax on ADD ( #5689 )
2024-07-24 19:09:42 -04:00
qazal
e2e70bd90b
bring unbind back in Varaible const ( #5687 )
...
* bring unbind back in Varaible const
* this shows my experience with symbolic
2024-07-24 18:37:00 -04:00
nimlgen
b026312a31
nv ptx print log ( #5691 )
2024-07-24 21:40:58 +03:00