chenyu
2cc55a3095
UOp simple mul add div fold ( #5726 )
2024-07-25 22:00:30 -04:00
chenyu
5521b6d437
UOp simple mul-add-lt fold ( #5721 )
2024-07-25 20:49:38 -04:00
qazal
1b53207b4f
revert isolated dags scheduling ( #5724 )
2024-07-25 19:45:12 -04:00
chenyu
845b0d1c9d
UOp more generic div folding ( #5722 )
...
old: `x // c` can fold if `0 <= x.vmin <= x.vmax < c`
new: `x // c` can fold if `0 < c and x.vmin // c == x.vmax // c`
2024-07-25 17:49:14 -04:00
chenyu
a82815262c
more test_pattern_matcher fixups ( #5714 )
2024-07-25 14:12:21 -04:00
chenyu
05e02ddfb3
fixup test_pattern_matcher ( #5712 )
2024-07-25 13:48:52 -04:00
qazal
9ceb3a3d1f
beautiful_mnist -4.3% kernels ( #5709 )
...
* add is_complete
* partially delete forced_realized
* p2
* start
* refactor to can_group
* remove steps
* _get_inputs is nicer
* fix the cache
* cache is dict now
* rename to group
2024-07-25 20:30:49 +03:00
kormann
1e2eac755d
Fix repr upat ( #5705 )
...
* test
* fix
* x fix
* simpler
* rm extra space
2024-07-25 12:05:48 -04:00
qazal
1c992de257
hotfix: compare_schedule defaults to false ( #5707 )
2024-07-25 17:08:28 +03:00
qazal
489cda827a
more scheduler process replay tooling ( #5706 )
...
* more scheduler process replay tooling
* refactor to compare_schedule
2024-07-25 15:47:18 +03:00
qazal
4e070a2c89
start work on indexing fusion ( #5590 )
...
* start base
* the views add up
base reduceop st:
ShapeTracker(views=(View(shape=(60000, 1), strides=(1, 0), offset=0, mask=None, contiguous=True),))
top st:
ShapeTracker(views=(View(shape=(512, 6000, 1, 28, 28, 10), strides=(0, 1, 0, 0, 0, 6000), offset=0, mask=None, contiguous=False), View(shape=(512, 6000, 1, 28, 28, 10), strides=(47040000, 784, 0, 28, 1, 4704000), offset=0, mask=None, contiguous=False)))
merged buf.st+st:
ShapeTracker(views=(View(shape=(512, 6000, 1, 28, 28, 10), strides=(0, 1, 0, 0, 0, 6000), offset=0, mask=None, contiguous=False), View(shape=(512, 6000, 1, 28, 28, 10), strides=(47040000, 784, 0, 28, 1, 4704000), offset=0, mask=None, contiguous=False)))
* p1
* some cleanups
* more cleanups
* one kernel
* more
* late fuse arange
* less lines
* more work
* fix st strides 1
* update test_schedule, start argmax
* test_tiny_argmax
* add FUSE_ARANGE
* more cleanup
* add utils
* reduce merging
* fix axis and fold if needed
* more fusion
* need to figure this out
* now fixing all of these
* todos+save a line
* ready for p1
2024-07-25 13:23:38 +03:00
nimlgen
08f47d7dc3
more info on failure 41 ( #5704 )
2024-07-25 12:14:28 +03:00
nimlgen
69d4f474d8
amd resnet pf ( #5703 )
2024-07-25 11:21:22 +03:00
chenyu
46e1151c02
UOp more generic mul -> mod folding ( #5698 )
2024-07-24 21:41:25 -04:00
chenyu
66a9c372af
UOp mod reduction ( #5697 )
2024-07-24 20:36:00 -04:00
chenyu
8648fb2636
UOp vmin/vmax on ADD ( #5689 )
2024-07-24 19:09:42 -04:00
chenyu
85710e86cb
UOps div folding ( #5690 )
...
#5689 , with just div folding and new test cases
2024-07-24 14:21:44 -04:00
chenyu
a7a77dfd83
UOp mul lt fold ( #5677 )
2024-07-24 02:49:25 -04:00
chenyu
4e85761d40
UOp mod folding ( #5668 )
2024-07-24 00:10:47 -04:00
George Hotz
053550c3f3
remove MERGE opt, cleanup wmma upcast ( #5669 )
...
* remove MERGE opt, cleanup wmma upcast
* upcast first
* fix broken vectorize folding rule
2024-07-23 20:43:42 -07:00
chenyu
3060e0be4f
add vmin vmax of SPECIAL ( #5670 )
...
* add vmin vmax of SPECIAL
folded stuff like (-1 < gidx0)
* flaky
2024-07-23 22:55:54 -04:00
George Hotz
fa14f7b4fd
switch contract arg to match expand arg [run_process_replay] ( #5667 )
...
* switch contract arg to match expand arg [run_process_replay]
* support multiaxis contract too, it's easy
* cancel contract/expand
2024-07-23 18:08:33 -07:00
George Hotz
a85493bdbe
multiaxis contract test
2024-07-23 15:09:15 -07:00
George Hotz
e3f00ac77d
Fix cuda tc emu test ( #5663 )
...
* fix acc folding for NV tensor cores
* fix correctness of reduce_before_expand
* fix test emulated CUDA tensor cores
* test_gemm_fp16 on some devices
2024-07-23 15:04:25 -07:00
chenyu
16c27ae400
update UOp.SPECIAL arg spec [run_process_replay] ( #5661 )
...
* update UOp.SPECIAL arg spec [run_process_replay]
from `(0, "gid0", 4)` to just `("gid0", 4)`. closer to a Variable
* fix ptx
2024-07-23 16:58:12 -04:00
chenyu
01fe00e055
skip test_failure_39 in CI ( #5660 )
...
took more than 2 minutes in ci metal, it's basically the same as test_failure_37 but 20X bigger
2024-07-23 14:47:05 -04:00
chenyu
199b3bf02b
simple UOp lt/ge folding ( #5657 )
...
works if lhs is a DEFINE_VAR.
folds trivial x < -math.inf now, need to change SPECIAL to use DEFINE_VAR to fold more
2024-07-23 14:11:05 -04:00
qazal
b0fc5a4c6f
start scheduler process replay ( #5656 )
2024-07-23 20:02:51 +03:00
chenyu
e210c87b4a
uop mod-mod simplification ( #5650 )
2024-07-23 12:33:55 -04:00
nimlgen
1384f08cd4
hcq profile tests ( #5654 )
...
* profile tests
* fixes
* remove linter
2024-07-23 18:40:33 +03:00
qazal
5f394fc9c6
more work toward non-blocking process replay ( #5653 )
...
* non-blocking process replay
* more actionable
* test it
* revert the test
* %s/logging.warn/logging.warning
2024-07-23 14:26:31 +03:00
qazal
7cb67e6fb2
merge gated stores spec ( #5652 )
...
* test_unmerged_ifs should merge ifs
* test_tiny_gate_store
* test_merge_ifs_alt
* assert assert asserts
2024-07-23 18:53:27 +08:00
George Hotz
7c4b177e3a
add tests for uops stats ( #5649 )
...
* add tests for uops stats
* no locals skip is fine
* eh
2024-07-22 21:57:03 -07:00
chenyu
4f83da626e
uop symbolic simple mul mod ( #5648 )
2024-07-22 23:17:41 -04:00
chenyu
f2d2afdaa4
dumb linearizer example that max is not simplified ( #5644 )
...
* dumb linearizer example that max is not simplified
this might just get fix once basic mod simplification is done
* need local
2024-07-22 18:37:26 -04:00
chenyu
24505199fb
UOp.const(x.dtype, y) -> x.const(y) [run_process_replay] ( #5642 )
2024-07-22 17:09:40 -04:00
chenyu
97b116bb1d
UOp mul div simplification ( #5637 )
...
* UOp mul div simplification
* != 0 is fine
2024-07-22 16:14:12 -04:00
nimlgen
26fc4610a0
amd more accurate cache managment ( #5631 )
...
* amd more accurate cache managment
* fix amd
* add memory_barrier + copies tests
* tranfer test as well
* linter
2024-07-22 19:07:01 +03:00
Vyacheslav Pachkov
edc58e6b6e
hcq: remove duplicate allocation of kernel args by abstracting ( #5633 )
2024-07-22 18:29:41 +03:00
George Hotz
dc21e63bd2
test: put conv in one reduce ( #4441 )
...
* test: put conv in one reduce
* put reduce at the end
* more expand
* generic, and that expand was breaking things
* ratio
* don't undo the expand
* arg 1
* strides
* warning, for resnet
* warning removed
* disable cast
* handle cast
* op
* err, that's right
* fixup
* fix that
* a test to play with
* add double_reduces
* working up to final reshape
* fold the last reshape
* moved to schedule
* fix axis
* ci, need to bring arange back
* FUSE_CONV_BW maybe
* valid in 3.9
* test_expand_reduce_is_folded_on_different_axes
* add FUSE_CONV_BW=1
* test_fold_batchnorm_backward
* test_sgd_4convs_fuse
---------
Co-authored-by: qazal <qazal.software@gmail.com >
2024-07-22 12:16:13 +03:00
George Hotz
386fb5e7f8
folding without UNMUL ( #5628 )
...
* folding without UNMUL
* fix failures, index_collapse
* import ReduceOps
* test_arange_4096 isn't folding
2024-07-21 20:14:44 -07:00
George Hotz
7f5282b2f5
tests if the linearizer is generating dumb code ( #5611 )
...
* tests if the linearizer is generating dumb code
* push consts to the end
* sort adds
* sorted add and mul
* this better
* simple expand/contract
* no math contract/expand
2024-07-20 20:36:32 -07:00
George Hotz
b399ccd6ef
BEAM bugfix, kernels dedup now ( #5617 )
...
* BEAM bugfix, kernels dedup now
* getenv is default
2024-07-20 19:43:50 -07:00
chenyu
92e7e65712
one more test case for symbolic mod mul ( #5615 )
2024-07-20 17:23:06 -04:00
qazal
3ab5fe4e1b
test argmax multireduce failure ( #5609 )
2024-07-20 21:33:03 +08:00
chenyu
b991097d41
move UPat and PatternMatcher from uopgraph.py to uops.py ( #5597 )
...
* move UPat and PatternMatcher from uopgraph.py to uops.py
towards instant UOps rewrite on UOp.alu
[run_process_replay]
* fix imports
2024-07-19 19:28:24 -04:00
George Hotz
2e617ca59e
lowerer img index ( #5592 )
2024-07-19 14:22:02 -07:00
nimlgen
b1782e3fef
hcq refactor signal into class ( #5575 )
...
* hcq refactor signal into class
* fix amd
* amd do not use amd_signal_t
* cleanup
* signal setter
* fix linter
* docs
* more docs + types
* fix types
2024-07-19 23:23:05 +03:00
George Hotz
d0ab20a5e5
careful memory counting (with tests to specify behavior) ( #5587 )
2024-07-19 11:37:34 -07:00
chenyu
37dd233650
always reverse global dim ( #5586 )
...
* always reverse global dim
* one more test
2024-07-19 13:58:05 -04:00