chenyu
fc03fc025e
enable sin on METAL in test_dtype_alu ( #5298 )
2024-07-05 14:52:09 -04:00
qazal
b369e75ed0
refactor schedule creation ( #5297 )
2024-07-05 21:14:38 +03:00
qazal
5292d37db6
LoadOps.VIEW in the scheduler spec ( #5296 )
...
* refactor to allow_buffer_view
* tests
* fix multi
2024-07-05 19:43:50 +03:00
hikettei
1ab7a4cff0
Handling Multiple UnaryOps.BITCAST in Function for Proper Kernel Fusion [run_process_replay] ( #5172 )
...
* [Patch] added an option not to ignore view replacing when doing bitcast
* added the testcase
* [Add] reproduced bitcast cannot be fused into a single kernel in the unittest
---------
Co-authored-by: qazal <77887910+Qazalin@users.noreply.github.com >
2024-07-05 19:16:44 +03:00
qazal
1cefbb33ab
uop graph tests + type_verify cleanup ( #5292 )
...
* test_cast_alu_fold
* test_double_cast_fold + these should assert
2024-07-05 13:00:01 +03:00
chenyu
f1ff65e763
remove "no-nans-fp-math"="true" for LLVM ( #5282 )
...
fixed isnan for llvm (still have issue with < nan)
2024-07-03 17:52:50 -04:00
chenyu
3929a9dc94
fix UOp.cmp_tuple for ALU ( #5280 )
...
* fix UOp.cmp_tuple for ALU
for ALU, use self.arg instead of self.op to compare
* skip that?
2024-07-03 14:59:05 -04:00
qazal
a9d6a6c339
verify_lazyop with multi reduce ( #5276 )
...
* outsource the assert to the implicit movement op check
* tests
2024-07-03 20:15:42 +03:00
chenyu
622b7bd556
simpler TinyJit inside TinyJit detection ( #5219 )
...
* simpler TinyJit inside TinyJit detection
suggested in 73395b998b (commitcomment-143660402)
* cannot repro...
* clear the way out
* finally clear
2024-07-03 12:28:53 -04:00
chenyu
b2c3a28a5e
nn.RMSNorm ( #5272 )
...
the norm itself has no significant value to add to Tensor method, but we would want Tensor.normalize
2024-07-02 21:39:01 -04:00
chenyu
9a2a82a77f
test stable diffusion unet in ci ( #5268 )
...
unet is parameterized now so can test a smaller one is ci
2024-07-02 21:37:52 -04:00
George Hotz
e53b164e1a
small changes from lowerer ( #5266 )
2024-07-02 15:03:54 -07:00
nimlgen
7be776f9af
add _alloc_signal/_free_signal to hcq ( #5264 )
...
* add _alloc_signal/_free_signal api
* oops, revert this
* linter
2024-07-02 23:35:39 +03:00
Tobias Fischer
9a25ee0b9a
pixed unet call params ( #5262 )
2024-07-02 12:40:27 -04:00
Tobias Fischer
8c9c1cf62f
Pulled CLIP and UNet into Seperate Files ( #5253 )
...
* pulled clip and unet into seperate files
* reference cleanup, lru cache fix
* better pool indexing
2024-07-01 22:33:01 -04:00
nimlgen
57e89645cd
hcq spec test ( #5226 )
...
* start hcq spec test
* more test
* fixes
* run on amd as well
* test amdgpu exec
* fix amd
* amd mockgpu support sdma timestamp
2024-07-01 17:36:37 +03:00
George Hotz
3df47bc21e
OpenELM + repeat_interleave ( #5234 )
...
* start writing openelm
* progress...hit bug
* repeat_interleave support
* gqa
* add rotary embedding
* spp
* i think it runs correctly
* broken
* output is good now
* cleanups
* no io_uring on android
2024-06-30 15:18:39 -07:00
chenyu
649641a2f2
fix tqdm with generator without __len__ ( #5238 )
...
it should be treated as total = 0 (just show iteration count).
also removed duplicated ": " in fetch and fixed unit scale with total = 0
2024-06-30 12:20:59 -04:00
chenyu
fd53b6d901
tqdm supports fractional blocks ( #5233 )
...
enabled progress bar match in test, it matched perfectly now
2024-06-29 22:30:18 -04:00
chenyu
ae10ae4722
simplify tqdm scale math ( #5231 )
...
expand the log of log stuff
2024-06-29 21:17:40 -04:00
hikettei
ad1ca7da64
[Feature] Added BinaryOps.AND/BinaryOps.OR ( #5223 )
...
* [Feature] Added BinaryOps.AND/BinaryOps.OR
* Add: __rand__, __ror__
2024-06-29 17:20:25 -07:00
chenyu
b2ea610df8
fix tqdm unit_scale and support hours in time ( #5227 )
...
* fix tqdm unit_scale and support hours in time
previously it only supports MM:SS.
more chars to unitscales, strip trailing "." and " " in formatting, and more tests
* simpler
2024-06-29 14:48:51 -04:00
qazal
f374fb77af
assert bool dtype for valid [run_process_replay] ( #5214 )
...
* valid is always bool
* prevent NumNode to begin with
* part 2
* test: disable pattern matchers, asserts should pass
* test: store without cast
* test: if (0)
* cleanup time
* only pattern match bool literal
* better for upstream debug
2024-06-29 21:20:32 +03:00
qazal
3f4eeb8b54
late UOps.IF generation [run_process_replay] [no_assert] ( #5027 )
...
* find all places
* test gates
* test
* gate based on depths
* add ctx
* that cache was so wrong
* delete useless things
* dont double write if
* self.if_cond
* move UOps.IF to gated store
* test_padto_where_multioutput
* test_padto_group
* minor cleanup
* hmm this actually works?
* need a good barrier
* merge 2
* delete ctx
* p1
* maybe p2
* p3
* minor fixup
* fixup 2
* smart thing from the Lowerer branch
* refactoring
* refactoring 2
* maybe before graph_rewrite
* slightly more acceptable Linearizer diff
* more correct
* [run_process_replay] [no_assert]
2024-06-29 12:22:14 -04:00
chenyu
42d1f92fc1
simpler tqdm ( #5221 )
...
can do more, but many cases are not tested
2024-06-29 07:41:46 -04:00
George Hotz
80ac21200b
hotfix: linearizer test fixup
2024-06-28 10:52:25 -07:00
kormann
6c456b6d66
remove uopgraph dedup + slight speedup ( #5199 )
...
* rm dedup
* rm dedup
* tests
* reduce diff
* oups
* reduce diff
* rm UOp.tuple
2024-06-28 09:26:32 -07:00
chenyu
73395b998b
better error msg for TinyJit inside TinyJit ( #5202 )
...
it's possible to support TinyJit inside TinyJit, but there are edge cases like two TinyJit functions shared another TinyJit function. so just give a more precise error for now
2024-06-27 18:09:19 -04:00
George Hotz
345bcc2099
move graph_dedup out of class [run_process_replay] ( #5197 )
2024-06-27 12:04:00 -07:00
George Hotz
d094a6828f
single pass rewrite ( #5159 )
...
* single pass rewrite
* claude cleanups
* claude cleanups
* skip those tests
* restrict that to ints
* comment
* asserts i don't expect to fail do fail
* simplest...rewrite...ever
* simplest...rewrite...ever
* add that rule back
* tests pass?
* only collapse reduce loops
* second SHL/SHR arg must be 4 bytes
* fix verify
* no SHL/SHR in ptx
* put that back
* skip them in PTX...bad tests
2024-06-27 11:36:05 -07:00
chenyu
ad91962dcf
CACHECOLLECTING -> CAPTURING and don't capture clear_l2 ( #5190 )
...
fixed first time BEAM slowness
2024-06-27 12:32:28 -04:00
Roelof van Dijk
9704c7d4d4
ruff rule if-exp-instead-of-or-operator (FURB110) ( #5178 )
...
Co-authored-by: chenyu <chenyu@fastmail.com >
2024-06-27 08:22:19 -07:00
chenyu
5b8fda3c65
fix: JIT=0 means no JIT ( #5188 )
2024-06-27 10:31:37 -04:00
Roelof van Dijk
975b811ad9
names shadowing builtins ( #5179 )
...
Co-authored-by: chenyu <chenyu@fastmail.com >
2024-06-27 08:15:01 -04:00
Roelof van Dijk
f88f71d73a
ruff: unnecessary-comprehension ( #5174 )
...
* enable ruff C416 unnecessary-comprehension
* already a list
2024-06-27 07:45:29 -04:00
kormann
3a04e518ec
print_tree UPat +fix ( #5132 )
...
* fix and extend print_tree
* typing
* typing
* fix upat
* fix none
* ws
* rm prefix
* mv luop dag
* typo
* test print_tree
2024-06-26 15:02:19 -07:00
nimlgen
16405b973a
fix hcq sync ( #5062 )
...
* fix hcq sync
* rewrite
* linter + comment
* fix profiler
* no default dict
* correct sync of unjitted transfer
* fix test
2024-06-26 17:50:37 +03:00
nimlgen
fd27f19e92
graph tests ( #5153 )
...
* graph tests
* add test
* cleanup
2024-06-26 16:31:20 +03:00
qazal
6ca7b13ed1
limit pickled objects [run_process_replay] ( #5154 )
...
* limit pickled objects
* delete uop from the list
* debug metal
* need self.opts for TC
* dont need device
* [run_process_replay]
* minor
2024-06-26 13:51:32 +03:00
David Hou
666a9c1448
don't view origin buffer when sharding ( #5122 )
...
* make buffer view optional with a flag
* do not view when sharding to save memory
2024-06-25 20:19:09 -07:00
George Hotz
c98ca23cb9
test pickle variable ( #5150 )
...
* test pickle variable
* fix process replay
2024-06-25 19:49:21 -07:00
George Hotz
63ba2d05d1
uops dfs cleanup ( #5147 )
...
* uops dfs cleanup
* Update uops.py
2024-06-25 18:51:42 -07:00
Jhenner Tigreros
fa78755f19
Add new patterns to unfold division ( #5139 )
...
* Add new patterns to unfold division
* Create regression test and fix pattern
2024-06-25 18:07:47 -07:00
qazal
c4fdb9c725
second iteration on verify_lazyop ( #5140 )
2024-06-25 09:44:32 +03:00
qazal
981afb114f
safely fold NEG in lazy.py ( #5135 )
...
* safe
* add test
2024-06-24 19:40:37 -04:00
chenyu
7948b05738
fix uneven shard with shrink and pad args on sharded axis ( #5131 )
...
it's incorrect to assume all first (len(device)-1) shards would have the same size. e.g. size 2 shard 4 -> (1, 1, 0, 0)
2024-06-24 16:55:50 -04:00
qazal
18e70deec3
verify_lazyop ( #5124 )
...
* start verify_lazyop
* bfs order
* assert
* assert shapetrackers 2
* refactor
* more iteration
* skips
* that ast was wrong too
2024-06-24 13:45:35 -07:00
chenyu
4a7d403777
cleanup test_multitensor ( #5118 )
...
renamed d_zero, d0, d1, d2, ... to d0, d1, d2, d3 and reused some multi device tuples
2024-06-23 20:54:22 -04:00
chenyu
c0ba5e0dfb
multi copy_to_device return the copy on same device if possible ( #5117 )
...
previously it always returns from the first device
2024-06-23 20:25:56 -04:00
Francis Lam
b563cd52ed
linearizer: change globals to merge into left axis/gridDims.x first ( #5033 )
...
* linearizer: change order of collapse to be left-most
also fixes Variable max size to be correct and add docs for the off
parameter
* fix multiple global dim oversizes
* add passing variable test and reorganize tests
* use assert RuntimeError for failing test
2024-06-23 18:53:15 -04:00