* fixed xmx demo
* i think i'm invoking the DPAS but it's slow
* compiler build arg to stop register spilling, indicated where to fix flop counter
* don't mind this
* do NOT mind me
* do not mind me
* do not view
* i will add bf16 later
* in process of figuring out tc fields
* we figured out the fields!!!
* added check for cl device vendor, added seperate IntelRenderer
* remove tc thread_local_aliases
* cleaning debris before draft pr
* edits for linter
* deduping and checking device extensions
* i will find more line reductions in other places
* before merge upstream
* double grf size in compiler to fix register spilling (bandaid), device checking changes
* tc python emulation
* fixed emulation
* tests for emulated intel tensor core
* TC=0, 1 working on upstream, fixed perf
* test
* debris
* check for specialized cl device when we canonicalize device
* bf16 support, tc=3 test added
* address tests
* revert half2 loads on intel tc, cleanup
* linter
* fold_expanded revert
* lint, whitespace fix
* cuda bf16 (only one with bf16) is skipped in test tensor cores, so i will skip for intel bf16 too
* make line shorter, no need for noqa E501
* removed device intel
* fix python emulation
---------
Co-authored-by: George Hotz <72895+geohot@users.noreply.github.com>
* github api infra
* process replay is 3 parts now
* parse benchmarks
* add gh_token
* complete diff
* move process replay tests
* last successful run
* add tempdir
* skip master
* assert process replay asserts
* one ci job is fine
* test: Revert "separate process replay main loop (#5734)"
This reverts commit 94d578396f.
* mac sed needs that
* Revert "test: Revert "separate process replay main loop (#5734)""
This reverts commit e4ad7684d5.
* disable process replay capture
* save time
* amd is tiny
* send to /dev/null
* remove check_process_replay
* that can go to the top
* add assert back
* [run_process_replay]
* checkout code [run_process_replay]
* temp [run_process_replay]
* revert temp [run_process_replay]
* ahh this is why [run_process_replay]
* revert temp [run_process_replay]
* replays
* what's in there
* can it be up there
* sha is enough
* insert sha as the key
* fix str
* update reset utils
* that nested try/except was terrible
* github_context can go
* more transcend math tests in ci
test large input to trig functions that hit different reduction algo, and test TRANSCENDENTAL=2 for all backend
* no CUDACPU
* try that