* add f16/f32 mfma support for MI300
- add 16x16 mfma shape support for f16 with f32 acc
- add ops_python mfma emulation
- add arch to AMDRenderer
* minor cleanup
* minor cleanup
* add mfma emulation task to ci
* add back todo
* hotfix: comment
* add tc=3 job to ci
* sqtt
* docs
* multi-device
* ProfileSQTTEvent
* exec update
* 256mb default
* don't let people hang their gpus
* bitfields from autogen
* asic info from mesa
* more bitfields from autogen
* SQTT_ITRACE_SE_MASK
---------
Co-authored-by: George Hotz <72895+geohot@users.noreply.github.com>
* add torch inplace tests
* first set of tests passing
* wrap all inplace funcs, add more tests
* fixes and wrap more functions
* fix all uint8 tests to avoid slow tests
* fix the one test
* another test, another fix
* and one more, works for ddp now
* something on contiguous, cleanup
---------
Co-authored-by: qazal <77887910+Qazalin@users.noreply.github.com>
* enable loading >2 GiB buffer from disk on macOS
* handle None case raised by mypy
* add test
* revert fix to repro bug in CI
* tell CI to run a unit test for macOS
* reapply fix
* yml changes
* torch backend remove meta decomps and add test
* torch backend bump timeout for tests
---------
Co-authored-by: George Hotz <72895+geohot@users.noreply.github.com>
* fixes from chargpt for torch backend
* shrink support
* add stride support
* comment cleanup
* a few more
* work
* import the stream hack
* llvm multi auto
* rig up torch's testing framework [pr]
* support more movement ops
* dec on expand
* fix tests
* work
* fix tests
* a few more
* decomps + opt hook
* installed pytest
* boom
* fix webgpu
* use exact variable names in test so that AI can read easier
* add tag for specific test name like test a specific dtype
* fix ruff
* astype everything
* dtype in array creation
* just arange
* is 67% considered fixed?
* move test up
* small cleanups
* share function
* add qgemm as well
* add qgemm too
* make sure qgemm comes out as int
* take out qgemm for now
* fixed test
* add correct qgemm
* addressing feedback here too, early naive fix for now
* simplify bias and c to be minimalistic enough to test correctness
* refactored qlinearops
* maybe these asserts aren't the best..
* fix test
* updated tests to cover new ops
* try to add to CI
* move test_onnx_ops into testextra/
* more attention tests
* qlinear_add atol=1
* attention still not fullllllly correct
* it is what it is
---------
Co-authored-by: chenyu <chenyu@fastmail.com>
* Solve dims too large errors on webgpu
* Simplify divisor find
* Test square root divisor
* Fix lint
* Refactor into group_dims and split_dims
* Refactor
* Fix lint
* Add back max check in _group_dims
* Prefer grouping over split
---------
Co-authored-by: chenyu <chenyu@fastmail.com>
* add patches
* add osx test in ci
* macos specific uvm, gpfifo mask
* only do that for now
* Revert "add patches"
This reverts commit 80d3112a57.
* use fork for now
* workflow only one worker
* merge osxtests with tests
* Revert "merge osxtests with tests"
This reverts commit 3461c8f46c.
* macos pagesize 16384
---------
Co-authored-by: nimlgen <138685161+nimlgen@users.noreply.github.com>
Co-authored-by: George Hotz <72895+geohot@users.noreply.github.com>
- Ensure that the set backend environment variable is persisted to the next step via $GITHUB_ENV
- It doesn't actually persist for Windows unless shell is explicitly set to bash.
- Add the assertion to ensure the selected backend is actually used.
* WebGPU f16 support
* Don't enable f16 yet
* dtype tests passing after bitcast fix
* Maybe all WebGPU green?
* Require shader-f16 in examples
* Minor wgsl touchup
* 1 line shorter
* Simpler
* Add transcendetal support
* log2 nan location mismatch on Vulkan
* Nan skips