* print inputs to get_program in process replay [pr]
* colors
* keep dataclass default escapes
* Revert "keep dataclass default escapes"
This reverts commit c6db7e8a7a.
* note for ast_repr
* add that back
* inital commit
* add qr + expand svd to full matrix
* add odd number support
* add linalg tests
* qr supports dims of arbitrary size
* add qr tests
* svd supports dims of arbitrary size
* small cleanip
* improvements over svd batch handling
* improve linalg tests
* make u_pad match q shape
* add nonfull matrix tests
* little less verbose nonfull svd test
* added dtypes on svd + return vt instead of vt
* lint
* more lint
* lint + set seed
* small fix
* small lint
* lint
* add int casting to indices and shapes
* remove int from shape tuple in svd
* small cleanup
* add return types
* reuse inverse_permute
* refactoring
* whitespace
* remove regularization term to prevent bad outputs on ill conditioned matrices
* remove seed
* refactor
* lint
* refactor
* spacing
* remove clone
* line reduction
* smarter heuristic for iterations_per_round
* add big test
* lint
* turns out no constant needed?
* wrap tests
* some small matrices need the constant
* remove realize
---------
Co-authored-by: George Hotz <72895+geohot@users.noreply.github.com>
* WebGPU on Windows
* Fix dawn-python install
* New test
* pydeps
* Minor fix
* Only install dawn-python on windows webgpu
---------
Co-authored-by: George Hotz <72895+geohot@users.noreply.github.com>
* fix process replay diff in PYTHON device [pr]
The PYTHON backend pickles and encodes UOps, the encoded binary can't be
directly diffed in process replay.
* note
* try
* ruff check --fix
* no skip test
* hmmmmmmm I don't get this D:
* run CI again
* why is PYTHON device faster than CPU?
* run ci again and fix lint
* actually doesn't PYTHON device make sense here?
* see cpu speed again
* Revert "see cpu speed again"
This reverts commit 1e366f2256.
* trigger CI
* pretty good
---------
Co-authored-by: chenyu <chenyu@fastmail.com>
* add mem_layout
* ui
* cleanup
* work
* debugLine work and expander
* tooltip style
* real expand device
* wheel does one thing
* diff
* shows llama oom
* add y axis
* mypy chill
* work
* unittests for the memory layout
* kernel.py no longer permutes reduce axis [pr]
* delete tests that handcode uops
* regen of sops is broken...
* put import back
* just remove that
* disable those tests
* fix extract_dataset + tests
* add CI
* sops.gz itself is same as master
* yml + gzip -c + ge
* don't commit that
* bump limit to 1000
* axis=7
* test_tiny
* refactor count_float4 to take uops as input instead of kernel
* remove some calls to linearize in test_linearizer
* remove some more calls
* remove one more call
* squash commits
* temp fix for const tensor
* actually realizing float16 can only happen in raw_data
* .float -> cast(float) to rerun CI
---------
Co-authored-by: chenyu <chenyu@fastmail.com>
* proposal: add option to override opts in the get_program API
* update test_linearizer_rewrite
* state in uops
* update process_replay and names
* empty isn't none
* fix process replay
* minor cleanup on test_tensor_core_opts tests
Tests now notify when skipped
Before, they silently skipped if backend didn't had half precision and
accumulation
Also cleaned up atol and rtol setup
* refactor test_tensor_core_opts_group
---------
Co-authored-by: George Hotz <72895+geohot@users.noreply.github.com>