chenyu
fc03fc025e
enable sin on METAL in test_dtype_alu ( #5298 )
2024-07-05 14:52:09 -04:00
qazal
b369e75ed0
refactor schedule creation ( #5297 )
2024-07-05 21:14:38 +03:00
qazal
5292d37db6
LoadOps.VIEW in the scheduler spec ( #5296 )
...
* refactor to allow_buffer_view
* tests
* fix multi
2024-07-05 19:43:50 +03:00
hikettei
1ab7a4cff0
Handling Multiple UnaryOps.BITCAST in Function for Proper Kernel Fusion [run_process_replay] ( #5172 )
...
* [Patch] added an option not to ignore view replacing when doing bitcast
* added the testcase
* [Add] reproduced bitcast cannot be fused into a single kernel in the unittest
---------
Co-authored-by: qazal <77887910+Qazalin@users.noreply.github.com >
2024-07-05 19:16:44 +03:00
chenyu
43c3f73fbc
handcode_bert_opt.py ( #5295 )
...
similar to handcode_resnet50_opt.py, one file to check bert kernels without dataset.
2024-07-05 11:01:20 -04:00
nimlgen
d7835a705c
hotfix: fix metal with vars ( #5294 )
...
* hotfix: fix metal with vars
* one more place
2024-07-05 16:53:40 +03:00
nimlgen
8a548b0b6e
metal support offset ( #5293 )
2024-07-05 16:13:05 +03:00
qazal
1cefbb33ab
uop graph tests + type_verify cleanup ( #5292 )
...
* test_cast_alu_fold
* test_double_cast_fold + these should assert
2024-07-05 13:00:01 +03:00
qazal
341c4a29d1
hotfix: use dtype.scalar() for rendering cast [run_process_replay] [no_assert] ( #5290 )
2024-07-05 11:29:35 +03:00
chenyu
87d27c45ec
minor _broadcast cleanup ( #5286 )
...
`any(x==0 for x in y)` is `0 in y`.
also `get_args(ConstType)` instead of hard coded `float, int, bool`
2024-07-04 14:25:24 -04:00
SnakeOnex
8c03816ae9
fix README example ( #5284 )
...
* fixed README example
* README test
* changed py -> python markdown code flags in REAME
2024-07-04 11:15:07 -04:00
nimlgen
2778b6046c
new memory scheduler ( #5278 )
...
* new memory schedule algo
* works
* fix
* fix
* linter
* tiny fixes
* do not optimize copy buffers
* mpre comments
* tiny cleanups
2024-07-04 18:06:04 +03:00
nimlgen
84b3e3bb6f
hcq exec no embedded signal ( #5142 )
2024-07-04 13:29:21 +03:00
Tobias Fischer
0c3a35e5c2
Stable Diffusion v2 Inference ( #5283 )
...
* model implementation
* clip fix, more qol options
2024-07-03 22:47:10 -04:00
chenyu
e5ba385f03
remove first contiguous in multi from_sharded ( #5121 )
...
second contiguous guarantees lbs are contiguous going into MultiLazyBuffer, don't need the first contiguous
2024-07-03 19:42:56 -04:00
chenyu
f1ff65e763
remove "no-nans-fp-math"="true" for LLVM ( #5282 )
...
fixed isnan for llvm (still have issue with < nan)
2024-07-03 17:52:50 -04:00
chenyu
3929a9dc94
fix UOp.cmp_tuple for ALU ( #5280 )
...
* fix UOp.cmp_tuple for ALU
for ALU, use self.arg instead of self.op to compare
* skip that?
2024-07-03 14:59:05 -04:00
qazal
a9d6a6c339
verify_lazyop with multi reduce ( #5276 )
...
* outsource the assert to the implicit movement op check
* tests
2024-07-03 20:15:42 +03:00
George Hotz
16e3b8b013
uops work from lowerer [run_process_replay] ( #5279 )
2024-07-03 09:40:00 -07:00
chenyu
622b7bd556
simpler TinyJit inside TinyJit detection ( #5219 )
...
* simpler TinyJit inside TinyJit detection
suggested in 73395b998b (commitcomment-143660402)
* cannot repro...
* clear the way out
* finally clear
2024-07-03 12:28:53 -04:00
gip
04ef0fd328
fix: message when applegpu tools missiong ( #5236 )
2024-07-03 09:07:09 -07:00
reddyn12
d3e244d8b7
prev speed improvements ( #5252 )
...
Co-authored-by: reddyn <nikidsniper@gmail.com >
2024-07-03 09:06:01 -07:00
nimlgen
21d41f06a2
nv follows HCQCompatAllocRes protocol ( #5275 )
...
* nv follows HCQCompatAllocRes protocol
* fix amd
2024-07-03 11:34:10 +03:00
Vyacheslav Pachkov
d3e4e21759
add return type for HCQCompatAllocator _alloc ( #5267 )
...
Co-authored-by: nimlgen <138685161+nimlgen@users.noreply.github.com >
2024-07-03 10:25:44 +03:00
chenyu
191463a919
add timing to SDXL ( #5273 )
2024-07-02 23:29:54 -04:00
chenyu
b2c3a28a5e
nn.RMSNorm ( #5272 )
...
the norm itself has no significant value to add to Tensor method, but we would want Tensor.normalize
2024-07-02 21:39:01 -04:00
chenyu
9a2a82a77f
test stable diffusion unet in ci ( #5268 )
...
unet is parameterized now so can test a smaller one is ci
2024-07-02 21:37:52 -04:00
chenyu
ce52b10f6f
add a flag DISABLE_LOOP_COLLAPSE ( #5270 )
...
workaround if user encountered UNMUL error
2024-07-02 20:01:11 -04:00
George Hotz
e53b164e1a
small changes from lowerer ( #5266 )
2024-07-02 15:03:54 -07:00
nimlgen
7be776f9af
add _alloc_signal/_free_signal to hcq ( #5264 )
...
* add _alloc_signal/_free_signal api
* oops, revert this
* linter
2024-07-02 23:35:39 +03:00
Tobias Fischer
9a25ee0b9a
pixed unet call params ( #5262 )
2024-07-02 12:40:27 -04:00
qazal
59bc837ad1
refactor gated load rendering [run_process_replay] ( #5259 )
...
* refactor gated load rendering [run_process_replay]
* hotfix: extra line
* remove llvm diff
2024-07-02 15:13:10 +03:00
nimlgen
e050603b4b
nv close fds after mapping ( #5246 )
2024-07-02 13:57:46 +03:00
qazal
d3cfb6c2e3
refactor UOps.LOAD barrier [run_process_replay] ( #5258 )
2024-07-02 13:48:47 +03:00
qazal
a1044e6063
iterate over scoped uops once [run_process_replay] ( #5255 )
2024-07-02 09:21:09 +03:00
wozeparrot
dfbee4f0f5
feat: add blobfile to testing ( #5254 )
2024-07-01 19:33:58 -07:00
Tobias Fischer
8c9c1cf62f
Pulled CLIP and UNet into Seperate Files ( #5253 )
...
* pulled clip and unet into seperate files
* reference cleanup, lru cache fix
* better pool indexing
2024-07-01 22:33:01 -04:00
chenyu
5808c37302
hotfix disable flaky llama3 beam benchmark on green ( #5249 )
2024-07-01 15:00:47 -04:00
chenyu
b9122ecdaf
revert stable diffusion validation with threefry ( #5248 )
...
* Revert "use threefry in stable diffusion benchmark (#4988 )"
This reverts commit 44dfa37c70 .
* sdxl and validation fix
* relax threshold
2024-07-01 14:43:47 -04:00
nimlgen
57e89645cd
hcq spec test ( #5226 )
...
* start hcq spec test
* more test
* fixes
* run on amd as well
* test amdgpu exec
* fix amd
* amd mockgpu support sdma timestamp
2024-07-01 17:36:37 +03:00
Carson Powers
d7839fdc5f
Add x!=0 -> (bool)x pattern [run_process_replay] [no_assert] ( #5237 )
...
* x!=0 -> (bool)x pattern
* bool != bool pattern
* redundant upat
2024-06-30 17:48:45 -07:00
George Hotz
14980f79dd
hotfix: unbreak llama
2024-06-30 15:27:54 -07:00
George Hotz
146eb3a811
hotfix: add repeat_interleave docs
2024-06-30 15:25:18 -07:00
George Hotz
3df47bc21e
OpenELM + repeat_interleave ( #5234 )
...
* start writing openelm
* progress...hit bug
* repeat_interleave support
* gqa
* add rotary embedding
* spp
* i think it runs correctly
* broken
* output is good now
* cleanups
* no io_uring on android
2024-06-30 15:18:39 -07:00
nimlgen
7b7b751513
simple hip backend for debugging ( #5201 )
...
* hip backend
* fix mypy
* shorter
* fixes
* tiny changes
2024-06-30 23:00:11 +03:00
chenyu
88763eb9ff
fix stable_diffusion with fp16 ( #5239 )
2024-06-30 12:59:31 -04:00
chenyu
649641a2f2
fix tqdm with generator without __len__ ( #5238 )
...
it should be treated as total = 0 (just show iteration count).
also removed duplicated ": " in fetch and fixed unit scale with total = 0
2024-06-30 12:20:59 -04:00
chenyu
fd53b6d901
tqdm supports fractional blocks ( #5233 )
...
enabled progress bar match in test, it matched perfectly now
2024-06-29 22:30:18 -04:00
chenyu
ae10ae4722
simplify tqdm scale math ( #5231 )
...
expand the log of log stuff
2024-06-29 21:17:40 -04:00
hikettei
ad1ca7da64
[Feature] Added BinaryOps.AND/BinaryOps.OR ( #5223 )
...
* [Feature] Added BinaryOps.AND/BinaryOps.OR
* Add: __rand__, __ror__
2024-06-29 17:20:25 -07:00