* Start from andredaprato:webgpu-clean
* Fix infs
* inf wgsl function is not needed
* Emulated ulong for threefry, more tests passing
* Randomness tests passing
* Update model export to support new changes in webgpu, efficientnet export works again
* Simplify shift emulation in wgsl
* Delete test file
* Fix bigger than u32 u32 literal
* Why was skip copies added here?
* Python3.12 for webgpu tests
* Fix model export syntax error
* Get test ops passing with some skips
* Fix lint
* Much simpler shift
* Run more tests
* Timestamp queries are not supported in CI, so skip search tests
* All fancy indexing passing
* r is ctx
* Run more dtype tests by using is_dtype_supported
* Cleanup ulong shift rendering
* UPat -> Pat, UOps -> Ops
* Pat -> UPat
* Refactor render_ushift if-else
* Pattern to avoid ulong mul
* Remove vals_dtype
* is_nan trick + rewrite, test_isnan passing
* Rewrite a * select(1, nan, gate) -> select(a, nan, gate)
* No arg, just op
* Support char, uchar, short, ushort
* Run test_index_mnis now that we have uint8
* Fix pyling
* Save 3 lines by using base Compiler
* No more long emulation
* Remove fixup_binops
* No more external_local_bufx wgsl specific cstyle modif, use base extra_pm
* Simpler, faster copyin/out
* Skip some new tests that use long
* Fix typo
* copyout touchup
* Save lines by using render_cast
* WebGL is not supported in core, delete it from is_dtype_supported
* More narrow test skips for some unary tests
* TernaryOps, UnaryOps -> Ops
* TinyGrad supports WebGPU
* StableDiffusion demo: f16tof32 gpu is a lib, update UI
* Packed load/store, no more scale_size, no core tinygrad changes
* Rename copyin, copyout
* Device -> dev
* Fix lint
* Pattern matcher rule for packed load/store
* Refactor
* Shorter packed load/store
* this should fix lint
* Fix mypy
* SD compile script working
* New SD webgpu UI
* New default prompt
* New SD weights
* Fix title when webgpu not available
* Run symbolic tests, simplify is_nan, use round_up
* Show step time on UI
* Bump minimum wgpu version to v0.19
* Fix latent
---------
Co-authored-by: George Hotz <72895+geohot@users.noreply.github.com>
* new cloud is cloudy [pr]
* waste lines to add security
* safety, with speed and less lines
* timing and del
* lines
* cleanups
* restore CloudSession
* bump to 3.10
* quotes
* renderer security
* Revert "Revert "move up migrate + new gated fold (#7403)" (#7406)"
This reverts commit ea5654a9bc.
* test padded in emulation too
* bring back early folding
* add pickle support for pattern matchers [run_process_replay]
* cleaner and all
* no closures
* fix tests
* revert that
* final
* cleaner
* python 3.8 fix
* add round trip back
* this
* waste lines on this. that's the final line count
* max print better
* more targetted fix
* regrettably add 3.8 support
* real minimum cstyle change
* make it match
* bring back DEFINE_GLOBAL store marking writable
* bump line count to 9800
* closer
* precompute don't render
* cast/bitcast too
* smem_align
* vectorize
* more pr match
* remove that test
* less PR diff
* almost working with relu, even hackable... but acc size is wrong, fix needed
* upcast based on threads, change thread size to 4x4
* revert wrongfully commented assert
* fix tc load indexing
* modify for size 8
* fix bug for size 8
* Revert "fix bug for size 8"
This reverts commit cdb3f5df85.
* Revert "modify for size 8"
This reverts commit 3ef0904bd9.
* good kernel with changes in lowerer
* revert "good kernel with changes in lowerer"
This reverts commit 975e2b5a4e.
* good kernel for relu!
* refactor lowerer changes
* add amx context var to helper
* clean up amx flag
* improve lowerer changes readability
* improve check for amx
* revert lowerer if
* add float4 type rendering for clang
* add amx definitions
* enable indexing for clang if amx
* working amx example, wrong because of dims
* almost works for float 16, need to spot using double load in amx
* cleaner render_kernel
* revert chages in simple_matmul and delete env
* add new var upcast_offset to get_optimized_ast
* change axis for axes
* invert if in rendering phi
* fix some bugs
* fix linearizer tests
* fix vec/get pat for amx
* remove clang tc if amx is disabled
* add ops_python support
* refactor into one complementary function in ops_python
* add job for EMUALTE_AMX
* improve checking for AMX in UPCAST and TC extra ops
* fix lint issue
* commit before refactor into autocontained AMX
* start refactor by removing special rendering for AMX
* all ready for amx handcoded kernel
* working poc, most straightforward amx support
* avoid local opts for tc if amx
* fix merge bugs
* skip test for clang
* skip tc hand-coded opts if amx
* remove hardcoded ops_python values
* remove hardcoded sizes for amx kernel
* fix ops_python bug where dim was hard-coded
* change contract for vectorize
* working without changes in lowerer
* revert changes in gep rendering
* fix ops_python
* modify comment
* skip test if clang for different type accumulation
* move rename and bug for seperate pr
* fix wrong path for test
* addmm not implemented in torch for cpu
* change struct for vector; equally slow but cleaner
* revert modified test
* simply wmma rendering
* minor change
* noqa:501
* add length 16 for AMX
* fix vectorized half issue
* fix error
* remove comment
* change set for dedup
* split test of tensor_core_extra_ops so that cases that dont require locals run for AMX
* add amx reference
* load acc into amx registers
* fix dtype rendering and remove noqa
* moved tests change into another pr
* add real AMX job for CI and fix bug
* fix ops_python bug
* fix test class
* remove real AMX tests and fix uops_stats test
* remove wrong test
* acc folding
* hotfix: bug
* fix float4 tests for amx
* hack for fixing flops counting
* hotfix: mypy
* add flop counts test for amx
* improve test_float4_multidim_amx
* improve test_float4_multidim_amx
* improve test_float4_multidim_unaligned_load_amx
* nits tests
---------
Co-authored-by: George Hotz <72895+geohot@users.noreply.github.com>