* update rmsnorm to match torch implementation
* run all tests
* formatting
* formatting
* oneline
* default to 1e-6
* restore old test
* formatting
* don't save elementwise_affine
* your message
* ignore webgpu
---------
Co-authored-by: George Hotz <72895+geohot@users.noreply.github.com>
* Switch to dawn, all tests passing locally
* Use dawn-python
* Skip failing test
* Skip midcast and fix timestamp on metal ci
* Autogen webgpu
* Try fetch dawn lib again
* /usr/lib
* Without lib prefix
* Test autogen diff
* Delete webgpu support, move everything to ops_webgpu
* mypy fix
* Simplify, refactor
* Line savings
* No ResultContainer
* Type annotation for result
* Some more simplifications
* Why was this explicit sync used at all?
* Refactor: delete functions that are only used once
* Create shader module inline
* Clear unit tests cache, maybe that solves it
* That wasn't it
* Try deleting cache to pass failing weight compare
* weights_only=False for pytorch 2.6
* Simplify ctype array creation
* Remove nanosecond precision timestamps
* Simplify error handling
* Refactor, add back type annotations
* Deleted custom submit function, refactor
* read_buffer simplify
* Fix use after free, refactor
* Simplify supported_features
* Runtime docs
---------
Co-authored-by: George Hotz <72895+geohot@users.noreply.github.com>
current behavior is weird... when model is sharded and state_dict is not, load shards the state_dict and model shard axis does not change.
but if model and state_dict are sharded differently, model shard axis becomes the state_dict axis after load.
it should either always use model shard axis or always use state_dict shard
* add support for padding='same' in nn.conv
* express concisely
* simplify loop
* test same padding with dilation and conv1d
* fix bad indentation
* make loop one liner
* Fix track_running_stats in batchnorm
* Fix linter
* Update test_fold_conv_batchnorm_notrain to keep allowed at 1
* Add test_fold_conv_batchnorm_notrain_no_running_stats
* Save 1 line
* most of the work from the uops2 branch
* schedule
* realize
* kernel
* lowerer
* search
* green
* merge uops with ops
* Revert "merge uops with ops"
This reverts commit 1408a59f12.
* fix benchmark
* remove extra dedup
* revert the .detach() in layernorm
it's only correct in LayerNorm where input is the data, and not correct in GroupNorm and InstanceNorm that reused layernorm.
Added backward tests for weights, bias and input for these norms.
* bigger atol for llvm
* relax backward more
* mockgpu nv
* works
* comment that out
* fix merge
* setup gpuocelot
* install packages
* not run all of them
* passes
* fix ci
* almost
* should pass
* linter
* linter 2
* try this?
* ugn, not supported
* ci
* remove ticket from description
* better descs
* Embedding is in one kernel
* embedding is one kernel
* rm extra line
* newline
* bert test counts state vars?
* add a test?
* move items around
---------
Co-authored-by: Patrick Tsai <patosai@users.noreply.github.com>
* UnsyncedBatchNorm with synced trainable weights for hlb cifar
* multitensor reshape tests
* test mlb assign change axis
* E501
* argfix axis
* don't import batchnorm from hlb_cifar in test_multitensor
* pass num_devices to UnsyncedBatchNorm in test, allow UnsyncedBatchNorm to be used with LB
* add backprop test for UnsyncedBatchNorm
* break out MLB assign and reshape changes
* manually shard running mean and running var
* don't shard unless syncbn=0
* replace nn.BatchNorm2d with UnsyncedBatchNorm
* don't increment num_batches_tracked if not tracking running stats
* update tests
* oops
* Revert "oops"
This reverts commit 5e8a67a535.
* Revert "update tests"
This reverts commit 7ebf65d89a.
* Revert "don't increment num_batches_tracked if not tracking running stats"
This reverts commit 78de0ea9ee.
* Revert "replace nn.BatchNorm2d with UnsyncedBatchNorm"
This reverts commit d03da53da7.
* don't increment num_batched_tracked if not tracking running stats
* oops
* test_batchnorm_axis
* compare against torch
* types
---------
Co-authored-by: chenyu <chenyu@fastmail.com>
* lazy rewrite, try 2
* min fix tests
* pass contig test
* put broken pads back
* move that to realize
* no contig child fixes array packing
* so wrong
* now that's correct
* base children
* fix bind issues
* disable to_image_idx
* fix tests
* that failure shouldn't break other tests
* more fixes
* fix torch
* skip failing tests in CI
* 1e-7
* half is broken
* 1e-6 margin of error