* cvar dtype:DType|tuple[DType, ...]|None=None
* fmt
* add a test
* list typeguard as a dep for CI
* extra step to install mypy
* fix venv
* ci fixes
* mv typeguard to testing install group
* simpler TYPED=1 test
* add typeguard to lint group
* bump
* thou hast implement functions
* hacked in domain support
* some clean ups
* hack quantize_onnx_test too
* add helper lol, why onnx tests why
* better dispatcher, but need tests and better naming
* flaky ci
* change some names
* small clean ups
* make it easier to clean up tests once ORT supports 1.18.0
* nits
* fix bug of Softmax_1 being registered in onnx_ops
* need a default value
* resolve_const is better name
* fix OnnxRunner.to
* use proper domain names
* truncate fp8
* fix
* maybe like that?
* fix linters
* ruff
* move from extra and add ml_types to tests
* minor changes
* str to dtypes and nan support
---------
Co-authored-by: pkotzbach <pawkotz@gmail.com>
* add system json for mi300x mlperf
```
python3 -m mlperf_logging.system_desc_checker examples/mlperf/training_submission_v5.0/tinycorp/systems/tinybox_8xMI300X.json training 4.1.0
INFO - System description checker passed for tinybox 8xMI300X
```
also removed the rocm from tinybox_red since we are not using it
* update mlperf-logging version
* fast idiv with tests and fuzzer
* Add todo comment
* Add env variable to toggle fast_idiv
* Move env check
* Add fuzz fast_idiv to ci
---------
Co-authored-by: chenyu <chenyu@fastmail.com>
* boom
* fix webgpu
* use exact variable names in test so that AI can read easier
* add tag for specific test name like test a specific dtype
* fix ruff
* astype everything
* dtype in array creation
* just arange
* is 67% considered fixed?
* move test up
* small cleanups
* share function
* add qgemm as well
* add qgemm too
* make sure qgemm comes out as int
* take out qgemm for now
* fixed test
* add correct qgemm
* addressing feedback here too, early naive fix for now
* simplify bias and c to be minimalistic enough to test correctness
* refactored qlinearops
* maybe these asserts aren't the best..
* fix test
* updated tests to cover new ops
* try to add to CI
* move test_onnx_ops into testextra/
* more attention tests
* qlinear_add atol=1
* attention still not fullllllly correct
* it is what it is
---------
Co-authored-by: chenyu <chenyu@fastmail.com>
* Switch to dawn, all tests passing locally
* Use dawn-python
* Skip failing test
* Skip midcast and fix timestamp on metal ci
* Autogen webgpu
* Try fetch dawn lib again
* /usr/lib
* Without lib prefix
* Test autogen diff
* Delete webgpu support, move everything to ops_webgpu
* mypy fix
* Simplify, refactor
* Line savings
* No ResultContainer
* Type annotation for result
* Some more simplifications
* Why was this explicit sync used at all?
* Refactor: delete functions that are only used once
* Create shader module inline
* Clear unit tests cache, maybe that solves it
* That wasn't it
* Try deleting cache to pass failing weight compare
* weights_only=False for pytorch 2.6
* Simplify ctype array creation
* Remove nanosecond precision timestamps
* Simplify error handling
* Refactor, add back type annotations
* Deleted custom submit function, refactor
* read_buffer simplify
* Fix use after free, refactor
* Simplify supported_features
* Runtime docs
---------
Co-authored-by: George Hotz <72895+geohot@users.noreply.github.com>
* LLVM JIT
* Autogen LLVM
* Update autogen
* Move things around
* even more non-determinism
* windows
* more autogen weirdness
* more windows stuff
* blind windows development try 2
* more blind windows development
* even more blind windows development
* maybe i should just set up a windows vm...
* why can't everyone just use sysv abi?
* cleanup debugging stuff
* unused import
* icache flushing isn't required on x86
* merge jit_nt and jit_unix
* more
* Temporary hack to not segfault
* better error
* bad conflict resolution
* Attempt to simplify support/llvm.py
* More refactoring
---------
Co-authored-by: George Hotz <72895+geohot@users.noreply.github.com>
* connect to gpu
* rlc init?
* gfx comp start init
* early init is hardoded, some progress with fw
* gart
* progress, next mqd
* ring setup, still does not execute anything
* ugh write correct reg
* pci2: vm
* pci2: start psp
* vm seems to work
* pci2: gfx start
* pci2: fix psp ring resp
* pci2: try ring
* pci2: mes and some fixes
* pci2: some progress
* pci2: progress
* pci2: mm
* pci2: discovery
* pci2: correct apertures
* pci2: b
* pci2: i
* pci2: l
* pci2: o
* pci2: cmu
* pci2: mes_kiq works
* pci2: mes
* pci2: kcq does not work(
* pci2: unhalt gfx
* ops_am
* minor
* check if amdgpu is there, or we will crash
* bring back graph, it just works
* less prints
* do not init mes (not used)
* remove unused files
* ops_am: start move into core
* ops_am: works
* clcks, but still slower
* faster + no mes_kiq
* vm frags + remove mes
* cleanup fw
* gmc tiny cleanup
* move to ops_amd
* comment out what we dont really need
* driverless
* close in speed
* am clean most of ips
* gmc to ips
* cleaner
* new vm walker
* comment old one
* remove unsued autogens
* last write ups
* remove psp hardcoded values
* more
* add logs
* ih
* p2p and sdma
* vfio hal and interrupts
* smth
* amd dev iface
* minor after rebase
* bind for sdma
* Revert "bind for sdma"
This reverts commit a90766514d.
* tmp
* debug new mm
* ugh, allreduce hangs fixed
* p1
* works
* no pci.py
* cleaner a bit
* smth
* tiny cleanups
* cleaner a bit
* pciiface
* linter
* linter 2
* linter 3
* linter
* pylint
* reverted unrelated changes
* unrelated
* cmp tool
* ugh wrong fw
* clockgating
* unrelated
* alloc smaller chunks
* this
* opt sigs
* collect stat
* ops
* upd
* proclogs
* proclogs2
* vfio
* ruff
* linter pylint
* oops
* mypy p1
* mem fix
* mypy p2
* mypy p3
* mypy p4
* correct
* minor
* more tests
* linter in tests
* pci_regs header
* minor write up
* setup
* do not require libs
---------
Co-authored-by: George Hotz <72895+geohot@users.noreply.github.com>
* start work on new gradient
* more correct
* working tests
* more tests
* work
* add (faliing) gradient test
* add view and reduce gradient
* test_add works, many failing test_ops
* add max and reduce max
* add max and reduce max
* 129 failing
* 108 failed
* better view drawing
* 101 failed
* i got 99 failures
* 94 failures
* it's tons of terrible code, but only 50 tests fail
* only 19 failures
* same 19 but shorter
* minimal doesn't matter
* shorter
* lil simpler
* simpler
* simpler
* simpler
* 13 test failures
* nine tests fail
* all ops tests pass
* add contiguous gradient + fix sched tests
* faster by removing toposort calls
* missed one
* add jax to testing