Commit Graph

5720 Commits

Author SHA1 Message Date
George Hotz
26498b322e add BEAM to external_benchmark_schedule.py 2024-08-23 18:10:46 -07:00
George Hotz
53a73038e3 hotfix: TestGraphRewriteEfficiency.test_create_many_uops 2024-08-23 15:51:57 -07:00
George Hotz
7c3ba3fa8a improve match stats + custom early reject [run_process_replay] (#6260)
* improve match stats [run_process_replay]

* custom_early_reject
2024-08-23 15:28:57 -07:00
George Hotz
0b0a8829fb allowed_len early stop [run_process_replay] (#6257)
* vectorize single rule [run_process_replay]

* allowed_len gate

* i mean, i guess i like the rule

* cleaner way to write that, and faster
2024-08-23 13:31:07 -07:00
George Hotz
a18744188f more early reject [run_process_replay] (#6254)
* simple matcher in alu [run_process_replay]

* never mind, i don't like simple matcher

* allowed_len == 0 is okay sometimes

* more generic matcher
2024-08-23 12:16:44 -07:00
qazal
0d4887e9df use UOps.WMMA everywhere (#6255)
* add UOps.WMMA_AXIS

* delete ReduceOps.WMMA from ops
2024-08-23 15:03:26 -04:00
chenyu
66d0b14a20 simpler CMPLT UOp _min_max [run_process_replay] (#6251) 2024-08-23 10:36:16 -04:00
chenyu
590c0922b6 Tensor.prod (#6250)
* Tensor.prod

a new reduce op!

* onnx ReduceProd
2024-08-23 10:06:32 -04:00
qazal
78d6bd8b41 start graph rewrite in the scheduler (#6248)
* start graph rewrite in the scheduler

* test: enable it

* test timings

* only fails in multi reduce

* more isolated tests
2024-08-23 13:15:55 +03:00
chenyu
75700edf73 minor bitcast touchup (#6246)
`not A == B` -> `A != B`
2024-08-22 20:25:28 -04:00
chenyu
4d40de867b remove redundant c1-(x+c2) rule [run_process_replay] (#6243) 2024-08-22 16:45:49 -04:00
George Hotz
238896ca02 loooking into graph rewrite speed (#6239)
* loooking into graph rewrite speed

* track, replace is slow

* if all same, no permutations [run_process_replay]

* types so compile works

* no implied comprehension

* TRACK_MATCH_STATS=2
2024-08-22 13:17:55 -07:00
chenyu
f62c4b3b5f remove redundant -(x*c) pattern [run_process_replay] (#6242)
covered by `x*c0*c1`
2024-08-22 16:11:02 -04:00
chenyu
e745e16441 remove UnaryOps.NEG (#6238)
* Remove UnaryOps.NEG

generated new dataset with
```
time JIT=2 PYTHONPATH=. ./extra/optimization/generate_dataset.sh
gzip /tmp/sops
mv /tmp/sops.gz extra/datasets/
```

* fix that
2024-08-22 14:21:39 -04:00
nimlgen
6c4ddd6260 hcq skip tests when no multidev (#6235)
* hcq skip tests when no multidev

* linter

* a bit higher tinout
2024-08-22 18:27:16 +03:00
chenyu
08539f08b0 fix UOp repr with Variable in arg (#6236) 2024-08-22 11:06:33 -04:00
chenyu
3fc8203475 remove NEG from handwritten ast in tests (#6234)
* remove NEG from handwritten ast in tests

* test_linearizer_failures
2024-08-22 09:06:59 -04:00
chenyu
1c5ef5b793 format test_linearizer_failure (#6231)
made it easier to remove NEG
2024-08-21 21:10:56 -04:00
George Hotz
5cdec79469 simpler expand without dont_expand_args [run_process_replay] (#6230)
* simpler expand without dont_expand_args [run_process_replay]

* Revert "simpler expand without dont_expand_args [run_process_replay]"

This reverts commit 81693024c097c31e601f1a199a631e9eda0d9638.

* exclude_args

* why does that fix it

* correct fix

* _swizzle_args should be fast

* add comment

* zip is tuples
2024-08-21 17:48:45 -07:00
nimlgen
78c94abe9c raise time limit for ci in test_profile_multidev_transfer (#6227) 2024-08-21 22:42:03 +03:00
gswangg
c74b318458 migrate test_linearizer.py to UOp AST, pt. 2 (#6228) 2024-08-21 22:16:11 +03:00
George Hotz
c3168952f0 wip: tracking pattern matcher [run_process_replay] (#6225)
* wip: tracking pattern matcher

* better

* proper dedup

* timing

* early reject

* mergable match stats

* TrackedPattenMatcher

* fix TrackedPattenMatcher

* cleanups

* clean that too

* remove early_reject

* Revert "remove early_reject"

This reverts commit dc2aef14b8f5da58f5ec9566daf252513cac394c.

* total

* sort by time

* match_stats cleanup
2024-08-21 11:57:26 -07:00
chenyu
a666450e4d UOp pattern x + x -> x * 2 (#6224)
* UOp pattern x + x -> x * 2

now there's no NEG, with this it covers all kinds of a*x+b*x

* can remove x-x
2024-08-21 12:06:19 -04:00
chenyu
c9a9631818 no UnaryOps.NEG in generated UOp patterns (#6209)
* no UnaryOps.NEG in generated UOp patterns

removed pattern `x * (-1) -> -x`  and `x != True`

* those are fine because NEG became CMPNE and True

* fix sd validation L2 norm
2024-08-21 11:08:22 -04:00
qazal
3b8cc5a3e0 more multireduce tests prep for neg removal [run_process_replay] (#6220) 2024-08-21 12:45:24 +03:00
qazal
86c036f0d3 reorder uops.py [run_process_replay] (#6219)
* reorder uops.py [run_process_replay]

* nop spacing
2024-08-21 11:39:55 +03:00
qazal
f03e5a4b3b test_multireduce const has a shape (#6218) 2024-08-21 11:02:45 +03:00
George Hotz
911bf7216c remove unused match rules [run_process_replay] (#6217) 2024-08-21 00:16:04 -07:00
George Hotz
2c42e9c2c6 faster rewrite, no folder in expand/reduce [run_process_replay] (#6216)
* faster rewrite, no folder in expand/reduce [run_process_replay]

* is removing the expander there okay

* parens

* don't reconstruct exact match uop

* fast do_reduce

* expand pyint

* most of the parents gains with less lines
2024-08-20 23:36:58 -07:00
George Hotz
16f420f7a7 split full_graph_rewrite and linearize_uop [run_process_replay] (#6215)
* split full_graph_rewrite and linearize_uop

* fix tests

* graph rewrite in test uops

* add types
2024-08-20 20:12:33 -07:00
George Hotz
9faf205601 CIFAR trainer + various bugfixes / improvements (#6146)
* move cifar into datasets

* support for pathlib Tensors, tar_extract, and fetch gunzip

* too early for Device.DEFAULT

* simpler hlb_cifar + .to(None) is default

* new compiler failure, start beautiful_cifar

* beautiful cifar runs but is broken

* jit train step

* cleaner

* std_mean, not mean_std

* more correct

* fast indexing

* don't print that

* torch load broken

* add eval

* nicer bar

* decoraters are the way to do this

* bounds check the load

* a few ops

* batchnorm bugfix, if track_running_stats is False, use online estimate

* full timing

* fix fusion

* unneeded realize

* master tensor
2024-08-20 16:58:46 -07:00
George Hotz
296368f0dd Revert "delete arg from cast [run_process_replay] (#6202)" (#6214)
This reverts commit ec52a09393.
2024-08-20 16:45:30 -07:00
nimlgen
89c4cffd86 nv fix size in SET_SEMAPHORE_A (#6213) 2024-08-21 01:47:10 +03:00
qazal
ec52a09393 delete arg from cast [run_process_replay] (#6202) 2024-08-20 14:06:16 -07:00
Francis Lam
7376b67e36 extra/gemm/triton_nv_matmul: fix Program arguments (#6212)
remove op_estimate
2024-08-20 14:05:38 -07:00
madt2709
4bb98d8882 Fix track_running_stats in batchnorm (#6200)
* Fix track_running_stats in batchnorm

* Fix linter

* Update test_fold_conv_batchnorm_notrain to keep allowed at 1

* Add test_fold_conv_batchnorm_notrain_no_running_stats

* Save 1 line
2024-08-20 14:01:22 -07:00
George Hotz
d9c62a33c3 add cifar to datasets.py (#6210) 2024-08-20 11:42:49 -07:00
George Hotz
a5d79688db fix indexing out of bounds (#6208)
* fix indeing out of bounds

* 5 ops per access is fine
2024-08-20 11:34:56 -07:00
chenyu
4451bcaf95 update test_arange test_llama_embedding_opt (#6207)
non CI uses larger embedding, still same orders of magnitude
2024-08-20 13:58:43 -04:00
ignaciosica
e4bb63c1be Refactor amd kernel prefix (#6205)
* refactor amd kernel_prefix

* restore removed comment

* nit
2024-08-20 10:37:36 -07:00
qazal
074cf780dd add option to only benchmark schedule [run_process_replay] (#6204) 2024-08-20 16:51:27 +03:00
Francis Lata
8fd8b970b0 update URL to eval cases from recent MLPerf file movements (#6201) 2024-08-20 08:43:13 -04:00
gswangg
0e6f057eae migrate test_linearizer.py to UOP AST (pt. 1) (#6150)
* migrate test_multioutput to UOP AST

* inline buf declarations

* migrate test_multireduce to UOp AST

* update test_mid_dim_multireduce to UOp AST

* update test_triple_multireduce with UOp AST

* make global definitions more concise

* update test_double_reduce_multireduce with UOp AST

* update test_multireduce_with_parallel with UOp AST

* update test_multiout_multireduce to UOp AST

* make gidx style consistent across updated tests

---------

Co-authored-by: qazal <77887910+Qazalin@users.noreply.github.com>
2024-08-20 10:02:20 +03:00
chenyu
10330a41c7 add CMPNE tests in test_uops (#6196)
fixed the output_dtype for CMPNE and match the tests for CMPLT
2024-08-19 19:41:21 -04:00
chenyu
21d6739237 remove UnaryOps.NEG from lazy.py (#6193)
* remove UnaryOps.NEG from lazy.py

* neg is no longer unary
2024-08-19 18:41:28 -04:00
chenyu
4d1b5781b5 remove UnaryOps.NEG from function.py (#6187)
* remove function.Neg

prep to remove UnaryOps.NEG

* replace all NEG in function.py
2024-08-19 17:39:15 -04:00
nimlgen
bc44e6501b _gpu_alloc -> allocator.alloc (#6189)
* _gpu_alloc -> allocator.alloc

* not needed this import

* pylint
2024-08-19 23:34:22 +03:00
chenyu
96d502d8b7 update function.Max backward (#6190)
instead of `(1-(x!=max))`, use `(x!=max)!=True`.

prep to remove Unary.NEG, also this can be instruction fused later more easily
2024-08-19 16:06:14 -04:00
Gabe Caldwell
bdd6325f31 default num_classes value for one_hot (#6182)
* num_classes=-1

If num_classes set to -1, the number of classes will be inferred as one greater than the largest class value in the input tensor.

* num_classes desc

comment to explain num_classes default and what that means.

* replacing ' with `
2024-08-19 12:07:14 -07:00
Alessandro Benetti
9328248610 support for std_mean and cross_entropy (#6181)
* support for std_mean and cross_entropy (#3)

* Cross entropy and std mean support

* remove extra examples
2024-08-19 12:06:44 -07:00