* fix and test loading num_batches_tracked
* add failing reverse case
* try reshape state dict if mismatch
* reshape for () and (1,)
---------
Co-authored-by: George Hotz <72895+geohot@users.noreply.github.com>
* ** rangeify, try 3
* bring that over
* bufferize, don't use contig tag
* work
* ish
* fix rangeify
* flash attention is back
* fix rangeify tests
* stuff passes
* fix test_log_softmax
* more stuff passes
* progress children
* new endrange solution
* progress
* progress counter
* basic assign
* contigs only
* symbolic in schedule
* unbind_kernel
* late children
* ops fixed
* beautiful mnist is close
* that seems to work
* mnist works
* improve names
* fix bmnist
* no pcontig
* testing backward
* work
* clone movement ops
* new_range helper
* MBLOCK/MERGE
* ops tests pass
* revert mblock stuff
* cleanups...but it breaks ops
* remove reindex
* hack for relu
* disable the hacks
* more hacks
* upd
* mostly works with cleanups disabled
* ndr
* ops tests pass
* terrible hacks for indexing to work
* context mismatch
* pcontig
* split pcontig v contig
* z3 trunc
* null
* no fuse in rangeify
* ops test passes
* lnorm
* fix assign
* nd rangeify
* both should work
* tests for rangeify
* cleanups
* stores pass the pointer through
* disable pcontig for now
* PARTIAL_CONTIG is a flag
* move device tests to test/device
* test speedups
* test device
* linalg to unit
* upd
* so pytest just works
* more divide and skip
* speed
* test devectorize
* add pillow
* refactor: extract conv layer test logic
* tuple is unnecessary
* integrate _test_conv logic into all conv tests
* fix linter, forgot dilation
* undo winograd extraction
adds too many if statements for a single case
* update rmsnorm to match torch implementation
* run all tests
* formatting
* formatting
* oneline
* default to 1e-6
* restore old test
* formatting
* don't save elementwise_affine
* your message
* ignore webgpu
---------
Co-authored-by: George Hotz <72895+geohot@users.noreply.github.com>
* Switch to dawn, all tests passing locally
* Use dawn-python
* Skip failing test
* Skip midcast and fix timestamp on metal ci
* Autogen webgpu
* Try fetch dawn lib again
* /usr/lib
* Without lib prefix
* Test autogen diff
* Delete webgpu support, move everything to ops_webgpu
* mypy fix
* Simplify, refactor
* Line savings
* No ResultContainer
* Type annotation for result
* Some more simplifications
* Why was this explicit sync used at all?
* Refactor: delete functions that are only used once
* Create shader module inline
* Clear unit tests cache, maybe that solves it
* That wasn't it
* Try deleting cache to pass failing weight compare
* weights_only=False for pytorch 2.6
* Simplify ctype array creation
* Remove nanosecond precision timestamps
* Simplify error handling
* Refactor, add back type annotations
* Deleted custom submit function, refactor
* read_buffer simplify
* Fix use after free, refactor
* Simplify supported_features
* Runtime docs
---------
Co-authored-by: George Hotz <72895+geohot@users.noreply.github.com>
current behavior is weird... when model is sharded and state_dict is not, load shards the state_dict and model shard axis does not change.
but if model and state_dict are sharded differently, model shard axis becomes the state_dict axis after load.
it should either always use model shard axis or always use state_dict shard
* add support for padding='same' in nn.conv
* express concisely
* simplify loop
* test same padding with dilation and conv1d
* fix bad indentation
* make loop one liner
* Fix track_running_stats in batchnorm
* Fix linter
* Update test_fold_conv_batchnorm_notrain to keep allowed at 1
* Add test_fold_conv_batchnorm_notrain_no_running_stats
* Save 1 line
* most of the work from the uops2 branch
* schedule
* realize
* kernel
* lowerer
* search
* green
* merge uops with ops
* Revert "merge uops with ops"
This reverts commit 1408a59f12.
* fix benchmark
* remove extra dedup