Commit Graph

10633 Commits

Author SHA1 Message Date
chenyu
5eb8001514 minor cleanup in jit (#4989)
found a non-deterministic bug in jit with multiple variables. but first cleanup some variable names.
[run_process_replay]
2024-06-15 23:43:17 -04:00
chenyu
44dfa37c70 use threefry in stable diffusion benchmark (#4988)
also updated default steps to 10. easier to tell the image is following the prompt.
2024-06-15 20:25:29 -04:00
chenyu
20b50d8d64 doc: manual_seed (#4987)
there was a docstring just not linked to the doc page. also updated the example to show re-seed instead of a internal variable
2024-06-15 19:57:26 -04:00
wozeparrot
ce1ed374c9 more tinychat fixes (#4971) 2024-06-15 16:29:39 -07:00
chenyu
50bc14d186 re-enable test that loads torch pkl format (#4986) 2024-06-15 14:11:30 -04:00
qazal
ff8e9eefc3 hotfix: don't use ASSERT_COMPILE for benchmarks process replay (#4981)
* use replay_codegen [run_process_replay]

* disable for now [run_process_replay]
2024-06-15 16:57:47 +03:00
uuuvn
92f49efd06 Trigger process replay from pull request title [run_process_replay] (#4980)
* Trigger process replay from pull request title

* idk how this thing works btw

* test if it will work

* try 2

* Revert "idk how this thing works btw"

This reverts commit 580da51b07.

* Revert "try 2"

This reverts commit 7ff1e86d5d.

* test if it works

* meh

* Reapply "idk how this thing works btw"

This reverts commit dd33ad7c14.

* revert
2024-06-15 16:21:00 +03:00
uuuvn
033fb53f9e Incomplete/buggy rule breaks process replay on #4976 (#4978)
* Incomplete/buggy rule breaks process replay on #4976

* test passes

---------

Co-authored-by: qazal <qazal.software@gmail.com>
2024-06-15 15:18:35 +03:00
qazal
d91f0ee85b add regression test for the neg folding pattern (#4979) 2024-06-15 15:08:28 +03:00
nimlgen
dfadf82e10 hcq optimize enqueue time (#4973)
* hcq optimize enqueue time

* linter
2024-06-15 10:47:25 +03:00
chenyu
5f7dd74655 docs: update wording for unflatten (#4974)
it was using `Expands`, the same in torch doc, but we also have expand so it's confusing
2024-06-14 23:12:41 -04:00
Cyril Roumégous
efbf4fca05 perf: graph_rewrite line reduction and make it a little bit faster [run_process_replay] (#4958) 2024-06-14 16:37:27 -07:00
wozeparrot
8209cd3c55 easier llama3 + fetch subdir (#4938) 2024-06-14 13:47:27 -07:00
chenyu
64cda3c481 raise TypeError calling len() on a 0-d tensor (#4970)
matched numpy and torch
2024-06-14 16:34:27 -04:00
chenyu
67e8df4969 remove numpy from dtype (#4969)
replaced all dtype.np with _to_np_dtype defined in tensor.py.

after this, the only numpy usages are (1) Tensor(np.ndarray), (2) construct .numpy() output, (3) numpy random buffer
2024-06-14 15:38:45 -04:00
wozeparrot
62dc36d371 autogen _try_dlopen (#4949) 2024-06-14 12:12:18 -07:00
qazal
3e297d8216 delete Linearizer.const [run_process_replay] (#4967) 2024-06-14 21:51:37 +03:00
chenyu
118c9fe468 Tensor._fromcpu -> Tensor._fromnp (#4966)
and moved to constructor with np.ndarray
2024-06-14 14:33:43 -04:00
wozeparrot
2a974ff257 fix: no readablestream await of, too new (#4965) 2024-06-14 11:22:19 -07:00
nimlgen
9436cd4551 hcq add memory_barrier (#4963)
* hcq add memory_barrier

* fix nv
2024-06-14 21:02:55 +03:00
chenyu
dae1c8abe2 create Tensor from bytes without numpy (#4964) 2024-06-14 13:37:27 -04:00
chenyu
5eee974b2a construct Tensor from python list/tuple directly (#4947)
* construct Tensor from python list/tuple directly

no numpy. annoying that half memoryview is 3.12 feature...

* simpler, and test

* flat already

* simpler

* cute

* 10% faster

* 5%
2024-06-14 11:36:05 -04:00
geohotstan
90332eb529 Getitem pin None dimension (#4960)
* fix

* remove torch out of bounds test

* 1 more test case
2024-06-14 10:48:59 -04:00
qazal
2eeddf1a46 IF ends with STORE, RANGE ends with PHI [run_process_replay] (#4953) 2024-06-14 16:00:32 +03:00
George Hotz
d5a92b9b83 sort the axis in reduce op [run_process_replay] (#4956) 2024-06-14 05:16:05 -07:00
George Hotz
14189bca68 graph_dedup function [run_process_replay] (#4955) 2024-06-14 04:24:37 -07:00
George Hotz
63a8add2c2 move uops add logic to linearize (#4952)
* move logic to linearize

* idk how this should work

* empty
2024-06-14 03:52:37 -07:00
qazal
7e32b8c930 refactor generic UOps.END* insertion (#4951)
* merge loops children

* rename to scope_children

* refactor ends

* merge with ends [run_process_replay]
2024-06-14 13:42:41 +03:00
George Hotz
9823752397 make uops.add private (#4950)
* make uops.add private

* modernize all tests
2024-06-14 03:23:25 -07:00
Jhenner Tigreros
dc9e9e4363 Convert BinaryOps.DIV to UnaryOps.RECIP and BinaryOps.IDIV (#4887)
* Create UnaryOps.RECIP and BinaryOps.IDIV and changing uses of BinaryOps.DIV

* Delete unused import

* Add cstyle renderer

* Fix formatting text

* Fix test error due to bad implementation of renderer

* Add PTX support

* Add RECIP to LLVMIR

* Remove BinaryOps.DIV from symbolic test

* Change some test and fix C floor division

* Change references to DIV for the RECIP or IDIV

* Add mimic idiv for symbolic test

* Restore floor

* Mimic idiv

* cast to int

* Fix some test and renderer

* Remove DIV for render nodes

* Resolve issue with div

* Add TestRenderer

* Fix test

* fix error

* Fix PAD test

* Fix div implementation

* Remove DIV

* Add upcast to rshift, due to use of MUL and RECIP on DIV

* Fix linter

* Remove complete BinaryOps.DIV

* Fix lint

* Fix some test

* Revert mul modification

* Fix tests

* Fix CLANG for uops

* Revert IDIV function

* Minor fix

* modify pattern matching rule to support nan

* Fix UNSAFE_PADS_OPS to add UnaryOps.RECIP

* Remove const folding for IDIV and fix PTX

* Complete remove IDIV from extra

* Remove test_div from TestFloatUOps due to test on recip

* Fix linearizer

* fix

* Fix test_22

* Fix llvm

* Apply trunc function for llvmlit

* use floor instead of trunc

* Use correct type

* Generate new fuzz db

* Fix rshift, do not cast to float to support idiv

* Return upcast=false to rshift

* Add to unsafepad BinaryOps.IDIV

* Remove RECIP override for CUDA

* add atol / rtol for the test

* Remove cast to int on IDIV

* Regenerate sops

* delete sops.gz

* regenerate

* regenerate

* regenerate

* Reduce margins

* pass atol and rtol as parametersg for _test_metrics

* regenerated dataset

* Regenerate

* Remove duplicated

* Revert changes on extra

* Remove changes extra and NOQA for test

* Remove E501

* Remove and change line

* Remove E501

* Fix atan2

* Revert import and E501

* Remove E501

* Add hrcp to halp ops

* Remove 1 of hrcp

* Remove last DIV and add type check on uops for IDIV

* Fix new tests

* Fix tests and custom function

* Regenerate dataset

* Regenerate dataset

* Revert dataset

* Change generate dataset script

* Remove line

* Change IDIV, type checker validate if x,y and z are int

---------

Co-authored-by: George Hotz <72895+geohot@users.noreply.github.com>
2024-06-14 02:43:46 -07:00
SnakeOnex
f87ba6016a tqdm total=0 fix (#4939)
* fixes

* fixes

* removed auto loop closing

* one line shorter
2024-06-14 02:31:59 -07:00
nimlgen
225f792330 amd hdp flush regs are on seg2 (#4925) 2024-06-14 01:42:23 +03:00
nimlgen
4bfd1904f6 nv do not modify prg's qmd (#4948) 2024-06-14 01:15:40 +03:00
chenyu
845c10bc28 add Node to _broadcasted type annotation (#4946) 2024-06-13 14:10:56 -04:00
chenyu
287d3c3b84 support list, tuple input in dtypes.from_py (#4945)
* support list, tuple input in dtypes.from_py

and used it to infer dtype from python list and tuple in Tensor constructor.

* fix tests
2024-06-13 13:38:06 -04:00
chenyu
7aecea4f56 support creating Tensor from python tuple (#4944)
added a small fuzzer to test data with mixed tuple and list of numbers matched with numpy
2024-06-13 12:18:37 -04:00
chenyu
74586bc339 fix getitem with leading None (#4943)
i think all None handling can be unified and remove the calc_dim in advanced indexing
2024-06-13 11:23:40 -04:00
George Hotz
e63701fbd4 RDNA3 assembly support (#3637)
* amazing that i can use comgr for this

* compile empty kernel

* cleanups

* tiny_add compiles

* ugh

* more work

* put that in extra
2024-06-13 09:09:24 +02:00
nimlgen
fd071ba27e amd mockgpu correct timer resolution (#4942)
* amd mockgpu correct timer resolution

* test it
2024-06-13 10:07:34 +03:00
chenyu
fae08c4d48 fix Tensor.triu / Tensor.triu with boolean input (#4941)
`where(self, 0)` incorrectly upcasted the output. `where(self, False)` is correct but looks unnatural, so added a cast at the end. Pattern matcher can fold the cast into where branches
2024-06-12 20:16:13 -04:00
chenyu
cc90b3ef9f simpler Tensor.gather (#4940)
get rid of some confusing transpose and permute, and the if condition on dim. Saved a kernel for each dim != 0 case in test by removing the dangling transpose at the end
2024-06-12 19:42:40 -04:00
George Hotz
fa00ef66fd Update README.md 2024-06-13 00:29:19 +02:00
chenyu
eb0f5b5660 failed test case for getitem with leading Nones (#4936)
* failed test case for getitem with leading Nones

torch matched numpy so tinygrad is incorrect.
another repro
```
t = np.arange(12).reshape((3, 4))
print(t[None, None, np.array([1, 2])])

t = torch.arange(12).reshape((3, 4))
print(t[None, None, torch.tensor([1, 2])].numpy())

t = Tensor.arange(12).reshape(3, 4)
print(t[None, None, Tensor([1, 2])].numpy())
```

* # noqa
2024-06-12 16:19:42 -04:00
Elias Wahl
d2e3c391e8 Residual in MLM loss + Change default steps (#4935)
* Residual in mlm loss

* Reduce default steps to 160K * 24

* oops

* comment
2024-06-12 16:09:18 -04:00
chenyu
a21ea165bc skip linearizer test_failure_22 on llvm (#4937)
getting flaky recently
2024-06-12 16:03:38 -04:00
chenyu
27903c5ed5 minor minor Tensor.__getitem__ cleanup (#4934)
more consistent variable names and update comments before next minor cleanup that touches logic
[run_process_replay]
2024-06-12 15:08:18 -04:00
chenyu
5e6336edda minor Tensor.gather cleanup (#4933)
`permarg[i]` is just `i`, and break the big return into two lines.
[run_process_replay]
2024-06-12 13:57:28 -04:00
Timmy
720c700a8a Multireduce-Kernels: Linearizer Changes and Tests (#4259)
* basic tests

* cleanup

* pylint

* ruff

* use define acc as a proxy for rendered reductions

* use define acc as a proxy for rendered reductions

* recursive reduceop rendering via ast_parse

* linters + cleanup

* fixing late buf loading

* plus linters

* removing extra line

* linters

* does this break ci?

* added tests and if add end change

* typo in add_ends

* linters

* removing comments

* allow endifs to be inserted before the end of the graph

* find add ENDIF before next BARRIER

* removing tests with manual ENDIF + linters

* specifically the next barrier aftr the store of the local result

* Revert "specifically the next barrier aftr the store of the local result"

This reverts commit b288a5c3ce.

* keeping up to date

* linters + merge changes

* cleaning up old bad decisions

* linters and opts

* mrged linearizer tests

* fixing merge issues

* removing the big ugly uop test (functionality tested end-to-end by test_linearizer additions

* small diff fixes

* updating linearizer to work without uops.add( ... cachable)

* linters

* comment in multireduce tests

* skipping tests without locals

* full tests

* linters

* load_cache[key] fix for multiple accs

* linters

* assert only one reduceop

* fix loop_scope test to actually cause an issue

* self.load_cache[key] key for DEFINE_ACC changed to use a string to make sure each acc is unique

* updated tests

* fixing merge

* removing debug prints

* complete merge fix

* linters

* diff cleanup

* adding tests in

* give each reduce it's own local buffer

* gpu=1 changes

* store and load locals with upcasting

* modifying test?

* make multireduce_netsted_local_upcast test match single reduce shapes

* removing todo

* cleaning up the diff

* unroll test

* unroll and upcast tests

* fix gpu

* seq and self.load_cache[key] cleaning

* linters

* padto works

* merge fixes

* fixes

* add skips for amd

* linters + seq

* cleaning & more tests

* softmax tests

* linters

* [run_process_replay]

* add new tests back

This reverts commit 19dec22e01.

* more hardcoded -1s

* fix ptx

* Fix name for loop in ptx

* cleaning up the diff

* cleaning up the uops diff

* nv ci is too slow

---------

Co-authored-by: qazal <qazal.software@gmail.com>
Co-authored-by: Szymon Ożóg <58388001+SzymonOzog@users.noreply.github.com>
Co-authored-by: qazal <77887910+Qazalin@users.noreply.github.com>
2024-06-12 13:29:43 -04:00
Nicklas Boman
6e86472cd6 fix typing for test to run in py38 (#4930) 2024-06-12 13:22:30 -04:00
chenyu
1326f29e24 fix Tensor.gather shape checking criteria (#4932)
it's fine if `self.shape[d] >= index.shape[d]` for all `d != dim`, not for all `d`
2024-06-12 13:10:14 -04:00