Commit Graph

2019 Commits

Author SHA1 Message Date
George Hotz
c13da83f12 tests from lowerer branch (#5339)
* tests from lowerer branch

* Update test_image_dtype.py

* Update test_image_dtype.py

* Update test_image_dtype.py
2024-07-08 21:23:19 -07:00
chenyu
4ceab5d2b1 fix PTX match rule for gated LOAD (#5338)
* test padto sum with bool tensor and bool acc dtype

make sure bool tensor acc with gate is handled correctly

* broken in PTX

* fix ptx
2024-07-08 22:25:03 -04:00
chenyu
a80f2df1bd fix some PTX tests (#5337)
fix broken PTX tests in test_linearizer and test_uops. there are tests that were skipped and broken because it runs only with CUDA=1 and we run PTX with NV=1 now
2024-07-08 21:33:05 -04:00
wozeparrot
9150a6be7a tensor metadata (#5271) 2024-07-08 17:45:40 -07:00
chenyu
0f0940225a fix Tensor.all and Tensor.any for PTX (#5335)
supported boolean acc and boolean phi. and rewrite boolean max to uint8 max
2024-07-08 18:15:04 -04:00
kormann
2349d837fb Fix scope order in graph toposort [run_process_replay] (#5330)
* fix

* test

* nothing
2024-07-08 11:46:15 -07:00
Timmy
bb7746985f multireduce scheduler tests (#5141)
Co-authored-by: qazal <77887910+Qazalin@users.noreply.github.com>
2024-07-08 20:28:55 +03:00
chenyu
6856f915d6 Tensor.any and Tensor.all (#5320)
does not work in ptx yet due to how boolean tensor is handled
2024-07-07 14:36:00 -04:00
chenyu
2029cb7047 support passing None to Tensor.clip (#5319)
passing None for no upper bound or no lower bound
2024-07-07 13:04:22 -04:00
chenyu
c1e330f302 Tensor.int and Tensor.bool (#5317) 2024-07-07 11:52:58 -04:00
qazal
ae10e936e7 UOps.VECTORIZE cleanups [run_process_replay] (#5314)
* still render_cast

* one extra line ok

* these are all just vectorize

* save space

* behavior change can go in a different diff
2024-07-07 10:49:08 +03:00
greg-niemeyer
77b2ce9fc9 Add UOps.VECTORIZE [run_process_replay] (#5289)
* Add UOps.VECTORIZE to core

* Update vectorized cast tests

* Addresses code review comments

- Removes VECTORIZE from LLVMRenderer
- Add line breaks to unduly long lines
- Add noop CAST rule back
- Update asserts and add render_vectorize in
  CSytleLanguage renderer

* Add missing const folding rule for VECTORIZE

Also adds corresponding test

* Fixes test_const_vectorize_fold and add assert

- Use sane types with VECTORIZE in test_const_vectorize_fold
- Add assert that sanity checks the types for VECTORIZE

* Rename test_cast_vectorized_fold

Renames test_cast_vectorized_fold to test_noop_vectorize_fold
because the test targets a very specific rule and there are
other tests for VECTORIZE.

* Revert unrelated changes

---------

Co-authored-by: qazal <77887910+Qazalin@users.noreply.github.com>
Co-authored-by: qazal <qazal.software@gmail.com>
2024-07-07 09:59:57 +03:00
qazal
8a99514462 generalize the uops toposort spec to ptx (#5309)
* generalize spec to ptx

* redundant assert

* extra print
2024-07-07 00:06:30 +03:00
chenyu
ca0ef1700b use precise::sin in metal (#5307) 2024-07-06 12:47:27 -04:00
qazal
d813617742 prescheduling refactor (#5300)
* p1

* refactor tuple
2024-07-06 12:04:03 +03:00
qazal
c1e166c08a fix dtype mismatch for bool ops in multi (#5299) 2024-07-06 11:36:40 +03:00
chenyu
fc03fc025e enable sin on METAL in test_dtype_alu (#5298) 2024-07-05 14:52:09 -04:00
qazal
b369e75ed0 refactor schedule creation (#5297) 2024-07-05 21:14:38 +03:00
qazal
5292d37db6 LoadOps.VIEW in the scheduler spec (#5296)
* refactor to allow_buffer_view

* tests

* fix multi
2024-07-05 19:43:50 +03:00
hikettei
1ab7a4cff0 Handling Multiple UnaryOps.BITCAST in Function for Proper Kernel Fusion [run_process_replay] (#5172)
* [Patch] added an option not to ignore view replacing when doing bitcast

* added the testcase

* [Add] reproduced bitcast cannot be fused into a single kernel in the unittest

---------

Co-authored-by: qazal <77887910+Qazalin@users.noreply.github.com>
2024-07-05 19:16:44 +03:00
qazal
1cefbb33ab uop graph tests + type_verify cleanup (#5292)
* test_cast_alu_fold

* test_double_cast_fold + these should assert
2024-07-05 13:00:01 +03:00
chenyu
f1ff65e763 remove "no-nans-fp-math"="true" for LLVM (#5282)
fixed isnan for llvm (still have issue with < nan)
2024-07-03 17:52:50 -04:00
chenyu
3929a9dc94 fix UOp.cmp_tuple for ALU (#5280)
* fix UOp.cmp_tuple for ALU

for ALU, use self.arg instead of self.op to compare

* skip that?
2024-07-03 14:59:05 -04:00
qazal
a9d6a6c339 verify_lazyop with multi reduce (#5276)
* outsource the assert to the implicit movement op check

* tests
2024-07-03 20:15:42 +03:00
chenyu
622b7bd556 simpler TinyJit inside TinyJit detection (#5219)
* simpler TinyJit inside TinyJit detection

suggested in 73395b998b (commitcomment-143660402)

* cannot repro...

* clear the way out

* finally clear
2024-07-03 12:28:53 -04:00
chenyu
b2c3a28a5e nn.RMSNorm (#5272)
the norm itself has no significant value to add to Tensor method, but we would want Tensor.normalize
2024-07-02 21:39:01 -04:00
chenyu
9a2a82a77f test stable diffusion unet in ci (#5268)
unet is parameterized now so can test a smaller one is ci
2024-07-02 21:37:52 -04:00
George Hotz
e53b164e1a small changes from lowerer (#5266) 2024-07-02 15:03:54 -07:00
nimlgen
7be776f9af add _alloc_signal/_free_signal to hcq (#5264)
* add _alloc_signal/_free_signal api

* oops, revert this

* linter
2024-07-02 23:35:39 +03:00
Tobias Fischer
9a25ee0b9a pixed unet call params (#5262) 2024-07-02 12:40:27 -04:00
Tobias Fischer
8c9c1cf62f Pulled CLIP and UNet into Seperate Files (#5253)
* pulled clip and unet into seperate files

* reference cleanup, lru cache fix

* better pool indexing
2024-07-01 22:33:01 -04:00
nimlgen
57e89645cd hcq spec test (#5226)
* start hcq spec test

* more test

* fixes

* run on amd as well

* test amdgpu exec

* fix amd

* amd mockgpu support sdma timestamp
2024-07-01 17:36:37 +03:00
George Hotz
3df47bc21e OpenELM + repeat_interleave (#5234)
* start writing openelm

* progress...hit bug

* repeat_interleave support

* gqa

* add rotary embedding

* spp

* i think it runs correctly

* broken

* output is good now

* cleanups

* no io_uring on android
2024-06-30 15:18:39 -07:00
chenyu
649641a2f2 fix tqdm with generator without __len__ (#5238)
it should be treated as total = 0 (just show iteration count).
also removed duplicated ": " in fetch and fixed unit scale with total = 0
2024-06-30 12:20:59 -04:00
chenyu
fd53b6d901 tqdm supports fractional blocks (#5233)
enabled progress bar match in test, it matched perfectly now
2024-06-29 22:30:18 -04:00
chenyu
ae10ae4722 simplify tqdm scale math (#5231)
expand the log of log stuff
2024-06-29 21:17:40 -04:00
hikettei
ad1ca7da64 [Feature] Added BinaryOps.AND/BinaryOps.OR (#5223)
* [Feature] Added BinaryOps.AND/BinaryOps.OR

* Add: __rand__, __ror__
2024-06-29 17:20:25 -07:00
chenyu
b2ea610df8 fix tqdm unit_scale and support hours in time (#5227)
* fix tqdm unit_scale and support hours in time

previously it only supports MM:SS.
more chars to unitscales, strip trailing "." and " " in formatting, and more tests

* simpler
2024-06-29 14:48:51 -04:00
qazal
f374fb77af assert bool dtype for valid [run_process_replay] (#5214)
* valid is always bool

* prevent NumNode to begin with

* part 2

* test: disable pattern matchers, asserts should pass

* test: store without cast

* test: if (0)

* cleanup time

* only pattern match bool literal

* better for upstream debug
2024-06-29 21:20:32 +03:00
qazal
3f4eeb8b54 late UOps.IF generation [run_process_replay] [no_assert] (#5027)
* find all places

* test gates

* test

* gate based on depths

* add ctx

* that cache was so wrong

* delete useless things

* dont double write if

* self.if_cond

* move UOps.IF to gated store

* test_padto_where_multioutput

* test_padto_group

* minor cleanup

* hmm this actually works?

* need a good barrier

* merge 2

* delete ctx

* p1

* maybe p2

* p3

* minor fixup

* fixup 2

* smart thing from the Lowerer branch

* refactoring

* refactoring 2

* maybe before graph_rewrite

* slightly more acceptable Linearizer diff

* more correct

* [run_process_replay] [no_assert]
2024-06-29 12:22:14 -04:00
chenyu
42d1f92fc1 simpler tqdm (#5221)
can do more, but many cases are not tested
2024-06-29 07:41:46 -04:00
George Hotz
80ac21200b hotfix: linearizer test fixup 2024-06-28 10:52:25 -07:00
kormann
6c456b6d66 remove uopgraph dedup + slight speedup (#5199)
* rm dedup

* rm dedup

* tests

* reduce diff

* oups

* reduce diff

* rm UOp.tuple
2024-06-28 09:26:32 -07:00
chenyu
73395b998b better error msg for TinyJit inside TinyJit (#5202)
it's possible to support TinyJit inside TinyJit, but there are edge cases like two TinyJit functions shared another TinyJit function. so just give a more precise error for now
2024-06-27 18:09:19 -04:00
George Hotz
345bcc2099 move graph_dedup out of class [run_process_replay] (#5197) 2024-06-27 12:04:00 -07:00
George Hotz
d094a6828f single pass rewrite (#5159)
* single pass rewrite

* claude cleanups

* claude cleanups

* skip those tests

* restrict that to ints

* comment

* asserts i don't expect to fail do fail

* simplest...rewrite...ever

* simplest...rewrite...ever

* add that rule back

* tests pass?

* only collapse reduce loops

* second SHL/SHR arg must be 4 bytes

* fix verify

* no SHL/SHR in ptx

* put that back

* skip them in PTX...bad tests
2024-06-27 11:36:05 -07:00
chenyu
ad91962dcf CACHECOLLECTING -> CAPTURING and don't capture clear_l2 (#5190)
fixed first time BEAM slowness
2024-06-27 12:32:28 -04:00
Roelof van Dijk
9704c7d4d4 ruff rule if-exp-instead-of-or-operator (FURB110) (#5178)
Co-authored-by: chenyu <chenyu@fastmail.com>
2024-06-27 08:22:19 -07:00
chenyu
5b8fda3c65 fix: JIT=0 means no JIT (#5188) 2024-06-27 10:31:37 -04:00
Roelof van Dijk
975b811ad9 names shadowing builtins (#5179)
Co-authored-by: chenyu <chenyu@fastmail.com>
2024-06-27 08:15:01 -04:00