* fast idiv with tests and fuzzer
* Add todo comment
* Add env variable to toggle fast_idiv
* Move env check
* Add fuzz fast_idiv to ci
---------
Co-authored-by: chenyu <chenyu@fastmail.com>
* dont test bf16 for emulated amd tc
* skip bf16 tc test in ci
* skip bf16 for AMD in test_tensor_cores_codegen
* add simple bf16 gemm test to benchmark
* add default gate in index
* assert store
* add TestRendererFailures
- move test_gated_store_with_alu to new TestRenderFailures class for
tests that fail on multiple renderers
- add test_renderer_failures.py run on python CI
* add test for gated index in 2d
* test TestRenderFailures
* fix some tests in test_ops for torch backend(171 failing)
* fix more tests (135 failures)
* fix tests (126 failing)
* handle transposed convs (109 tests failing)
* fix slice
* fix lshift & rshift and more tests (87 tests failing)
* revert accidental change
* remove unnecessary changes (82 failures)
* fix backward for avg_pool2d (78 failures)
* fix backward for avg_pool2d (78 failures)
* fix replication backpass
* fix reflection pad back pass (71 failures)
* cummax with indicies, aten.mv and move out methods (67 failures)
* extract avg_pool2d and avg_pool3d to separate functions (62 failures)
* revert changes for cat_out
* rewrite avg_pool and pad without repetition
* remove duplicates from decomps
* slice rewrite and add slice_backward (59 failures)
* add dtype fixup from https://github.com/tinygrad/tinygrad/pull/9297
* fix linter error and remove Tensor.pad (48 failures)
* add select_backward and index_put (40 failures)
* fix some more tests (36 failures)
* fix more tests (12 failures)
* some cleanups and fix couple more tests (10 failures)
* cleaner way to write upsample
* some more upsample cleanups
* use lambda for upsample
* add autowrapper for upsample forward
* cumsum and max_dim without aten functions
* revert _log_softmax
* fix more tests (1 failure)
* make linter happy
* move import to appropriate func
* make linter happy
* add codes for noqa
* some more refactors
* remove comment
* remove dependency on aten function for conv backward
* some more refactors
* add returns
* revert a change from merge
* some cleanups
* remove whitespace
* remove ruff change
* revert upsample
* add masked_fill_.Tensor and scatter.src_out
* add todo
* fix test_biased_conv2d
* fix test_var_one_in_axis & test_std_one_in_axis but break test_biased_conv2d :(
* revert torch_debug
* revert torch_debug
* skip test_gather_failure for the tiny backend
* make padding registration more consise
* add nonzero
* remove scatter_add since we already have the out
* fix scatter
* remove some repetition
* make upsample backward registrations more concise
* remove select.int
* use Tensor.cumsum
* realize conv2d outputs before backward to fix test_biased_conv2d
* add a todo for realize(1 failure)
* add new_empty and new_empty_strided
* make test_pad_circular_mode forward only and remove redundant stuff
* fix linter errors
* remove expect failure
* just tb
* slice is a view_op
* contiguous only when lazydata.is_realized
* fix backward for test_pad_circular_mode
* revert torch.nn.functional.pad override
* add transpose.int and make constant_pad_nd contiguous
* slice_backwards has no kwargs
---------
Co-authored-by: chenyu <chenyu@fastmail.com>
* add f16/f32 mfma support for MI300
- add 16x16 mfma shape support for f16 with f32 acc
- add ops_python mfma emulation
- add arch to AMDRenderer
* minor cleanup
* minor cleanup
* add mfma emulation task to ci
* add back todo
* hotfix: comment
* add tc=3 job to ci
* sqtt
* docs
* multi-device
* ProfileSQTTEvent
* exec update
* 256mb default
* don't let people hang their gpus
* bitfields from autogen
* asic info from mesa
* more bitfields from autogen
* SQTT_ITRACE_SE_MASK
---------
Co-authored-by: George Hotz <72895+geohot@users.noreply.github.com>
* add torch inplace tests
* first set of tests passing
* wrap all inplace funcs, add more tests
* fixes and wrap more functions
* fix all uint8 tests to avoid slow tests
* fix the one test
* another test, another fix
* and one more, works for ddp now
* something on contiguous, cleanup
---------
Co-authored-by: qazal <77887910+Qazalin@users.noreply.github.com>
* enable loading >2 GiB buffer from disk on macOS
* handle None case raised by mypy
* add test
* revert fix to repro bug in CI
* tell CI to run a unit test for macOS
* reapply fix
* yml changes
* torch backend remove meta decomps and add test
* torch backend bump timeout for tests
---------
Co-authored-by: George Hotz <72895+geohot@users.noreply.github.com>
* fixes from chargpt for torch backend
* shrink support
* add stride support
* comment cleanup
* a few more
* work
* import the stream hack
* llvm multi auto
* rig up torch's testing framework [pr]
* support more movement ops
* dec on expand
* fix tests
* work
* fix tests
* a few more
* decomps + opt hook
* installed pytest
* boom
* fix webgpu
* use exact variable names in test so that AI can read easier
* add tag for specific test name like test a specific dtype
* fix ruff
* astype everything
* dtype in array creation
* just arange
* is 67% considered fixed?
* move test up
* small cleanups
* share function
* add qgemm as well
* add qgemm too
* make sure qgemm comes out as int
* take out qgemm for now
* fixed test
* add correct qgemm
* addressing feedback here too, early naive fix for now
* simplify bias and c to be minimalistic enough to test correctness
* refactored qlinearops
* maybe these asserts aren't the best..
* fix test
* updated tests to cover new ops
* try to add to CI
* move test_onnx_ops into testextra/
* more attention tests
* qlinear_add atol=1
* attention still not fullllllly correct
* it is what it is
---------
Co-authored-by: chenyu <chenyu@fastmail.com>