* work
* no itertools + top down pass
* clean viz
* python can do that
* webgpu
* gbarrier of gbarrier is gbarrier
* device can be tuple
* bug in toposort
* failing test for gated toposort
* contiguous of gbarrier is gbarrier
* check for binops
* Revert "check for binops"
This reverts commit 53e3cdf720.
* viz + match on gbarrier, self exists by default
* alt
* green now
* cleanup
* propagate use_tensor_cores
* add use_tensor_core to arg in test and search
* bugfix
* get TC val from ContextVar in search
* revert minor space change
* add tc emulation test to ci and benchmark
* revert
* revert whitespace change
* remove test for ptx
* add comment and remove llvm test run
* hotfix amd and amd_llvm
* bf16 not supported in ci
* hotfix amd_llvm is not a device
* remove default
* dont gate on ci and amd_llvm
* minor cleanup
* skip bf16 tc test for amd_llvm
* fix helper_tc_allclose
* cleanup
* hotfix
* cleanup
* cleanup
* check real buffer and add cast for bf16
* cleanup
* fix padded for ops_python
* avoid assert on amd emulated tc
* swap dimensions
* revert, should have nothing to do with padded
* revert fix, should not go in this pr
* remove skip
* init
* add expected failure to correctly track progres
* hotfix
* skip for amd_llvm as well
* add skip
* add pr number
* move comment to amd test
* change reason
* set pad t 3 for amd padded tc test
* change pad for amd regardless CI
* test tc padded uops and correctness separately
* add test_tensor_cores_padded_uops test to ci
* remove redundant chack for amd device
* cleanup
* Kernel.apply_opts [pr]
updated all `for opt in`. also updated a few test_liinearizer tests to not implcitly depend on hand_coded_optimization
* not you yet
* dont test bf16 for emulated amd tc
* skip bf16 tc test in ci
* skip bf16 for AMD in test_tensor_cores_codegen
* add simple bf16 gemm test to benchmark
* init lds noop and lds_0 spec
* refactor lds helper test
* fix typo
* test all lds at the same time
* change comment
* comment
* start test_lds_full
* test_lds_tc
* add tc spec
* Solve dims too large errors on webgpu
* Simplify divisor find
* Test square root divisor
* Fix lint
* Refactor into group_dims and split_dims
* Refactor
* Fix lint
* Add back max check in _group_dims
* Prefer grouping over split
---------
Co-authored-by: chenyu <chenyu@fastmail.com>
* add some docs about speed [pr]
* better torch gemm
* enable locals on llvm/clang
* disable locals for beam speed on LLVM/CLANG
* 0x20 alignment in llvm allows ymm use
* Switch to dawn, all tests passing locally
* Use dawn-python
* Skip failing test
* Skip midcast and fix timestamp on metal ci
* Autogen webgpu
* Try fetch dawn lib again
* /usr/lib
* Without lib prefix
* Test autogen diff
* Delete webgpu support, move everything to ops_webgpu
* mypy fix
* Simplify, refactor
* Line savings
* No ResultContainer
* Type annotation for result
* Some more simplifications
* Why was this explicit sync used at all?
* Refactor: delete functions that are only used once
* Create shader module inline
* Clear unit tests cache, maybe that solves it
* That wasn't it
* Try deleting cache to pass failing weight compare
* weights_only=False for pytorch 2.6
* Simplify ctype array creation
* Remove nanosecond precision timestamps
* Simplify error handling
* Refactor, add back type annotations
* Deleted custom submit function, refactor
* read_buffer simplify
* Fix use after free, refactor
* Simplify supported_features
* Runtime docs
---------
Co-authored-by: George Hotz <72895+geohot@users.noreply.github.com>
* new lines that exist in codegen/ops
* update tests
* update sops.gz (13071 -> 13070 asts)
* fix viz too
* remove that TODO
* diff pruning
* mask assert + device
* work
* diff pruning
* re: fix viz too
---------
Co-authored-by: George Hotz <72895+geohot@users.noreply.github.com>
* rename Opt amt to arg
* ignore_beam_cache for test_tiny
* move ignore_beam_cache to test_tiny
* move to separate pr
* revert space change
---------
Co-authored-by: chenyu <chenyu@fastmail.com>