* don't mutate the uop/lazybuffer, just the Buffer [pr]
* fix red test
* try different fix
* that
* that's the right fix
* test for fixed behavior
* bump to 3.12
* minor uop cleaner [pr]
* free uop creation speed by removing WeakValueDictionary
* a lil faster
* disable that test
* lines
* and it doesn't print non hit patterns
* alu(c?t0:f0, c?t1:f1) -> c?alu(t0,t1):alu(f0,f1)
only do if at least one branch is const, so total alu won't increase
* tests and interesting TODO cases
* script to run regressed sd conv on metal
this and other similar `conv2d + add` kernels contributed to most of the speed regression
* # ruff: noqa: E501
* second try at block linearize
* weeee, works for lil matmul
* it's so beautiful
* test tiny passes
* fix bugs
* combine matching BLOCKENDS
* wrapping
* test lin failures passes
* those failures were fake
* flip sort order
* fix ptx tests
* deal with store better
* dumb ptx fix
* expect less
* reduce lines
* reduce lines
* less lines and cleaner
* no defaultdict
* tighter
* simpler block_parent_count
* update test for gated store
* put gated store rewrite to uopgraph, rm from ptx
* update test
update test
update test
* remove gated st rewrite in llvm
* lint
---------
Co-authored-by: Mesozoic Egg <mesozoic.egg@proton.mail>
Co-authored-by: George Hotz <72895+geohot@users.noreply.github.com>
the test was useless because it was looking at the jit graph counts. wrap with JIT=2 for now.
if it's stable we could consider making kernel count strict, which helps change like #7940
* newest newer than new refactor of getitem
* hmmm
* hmmmmmmmmmmmmmmmmm
* bro.
* ???
* small improvements
* cleaner, but why u gotta do this to me mypy
* fix, but still dunno about mypy
* even better
* try again? Passes locally
* use match
* fix mypy
* better
* broooooo check this out
* fix mypy
* bug fix
* fixed
* polish
* split into another branch
* polish
* try this
* Revert "try this"
This reverts commit 84f711b13e.
* try
* Revert "try"
This reverts commit 89c7a7649b.
* idk anymore
* it is what it is
---------
Co-authored-by: chenyu <chenyu@fastmail.com>
* Don't take const in gcd and change the "nothing_changed" condition
Biggest difference is probably actually that I forgot to check if gcd
changed if nothing else changed
The TODO was fixed by not using the const in the gcd, and then taking it
out
* Fix more tests
* working I think
* where are my onnx scatter tests??
* forward_only for now
* try if nan hack fix NV
* looks like issue is different... CUDA WHY
* oops that was wrong. Try if this fixes CUDA
* simpler multiply
* actually finish this up tmrw morning :x
* fix tests?
* improve tests
* improve test and implementation
* fix ruff
* complete but lots of expected failure...
* reviewed tests
* add onnx tests
* is this a processing op?
* add return type to indicate that it's not in-place
* final cleanups
* use or and improve tests a little
* add masked_index_select
* call it masked_setitem instead
* try
* FIXED
---------
Co-authored-by: chenyu <chenyu@fastmail.com>
* Start from andredaprato:webgpu-clean
* Fix infs
* inf wgsl function is not needed
* Emulated ulong for threefry, more tests passing
* Randomness tests passing
* Update model export to support new changes in webgpu, efficientnet export works again
* Simplify shift emulation in wgsl
* Delete test file
* Fix bigger than u32 u32 literal
* Why was skip copies added here?
* Python3.12 for webgpu tests
* Fix model export syntax error
* Get test ops passing with some skips
* Fix lint
* Much simpler shift
* Run more tests
* Timestamp queries are not supported in CI, so skip search tests
* All fancy indexing passing
* r is ctx
* Run more dtype tests by using is_dtype_supported
* Cleanup ulong shift rendering
* UPat -> Pat, UOps -> Ops
* Pat -> UPat
* Refactor render_ushift if-else
* Pattern to avoid ulong mul
* Remove vals_dtype
* is_nan trick + rewrite, test_isnan passing
* Rewrite a * select(1, nan, gate) -> select(a, nan, gate)
* No arg, just op
* Support char, uchar, short, ushort
* Run test_index_mnis now that we have uint8
* Fix pyling
* Save 3 lines by using base Compiler
* No more long emulation
* Remove fixup_binops
* No more external_local_bufx wgsl specific cstyle modif, use base extra_pm
* Simpler, faster copyin/out
* Skip some new tests that use long
* Fix typo
* copyout touchup
* Save lines by using render_cast
* WebGL is not supported in core, delete it from is_dtype_supported
* More narrow test skips for some unary tests
* TernaryOps, UnaryOps -> Ops
* TinyGrad supports WebGPU
* StableDiffusion demo: f16tof32 gpu is a lib, update UI
* Packed load/store, no more scale_size, no core tinygrad changes
* Rename copyin, copyout
* Device -> dev
* Fix lint
* Pattern matcher rule for packed load/store
* Refactor
* Shorter packed load/store
* this should fix lint
* Fix mypy
* SD compile script working
* New SD webgpu UI
* New default prompt
* New SD weights
* Fix title when webgpu not available
* Run symbolic tests, simplify is_nan, use round_up
* Show step time on UI
* Bump minimum wgpu version to v0.19
* Fix latent
---------
Co-authored-by: George Hotz <72895+geohot@users.noreply.github.com>
* Remove uneccessary if statement
In all paths where something_changed was set to True, remainder is
appended so the list can't be empty
* Working version of improved mod folding
* Fix offset calculation
Passing fuzz_symbolic.py to 130_000 so far
Added an extra test
* Cleaner offset calculation