cast implicitly resets shapetracker and makes it contiguous (for disk tensor), which fails for Interpreted backend if inputs contain non-contiguous st.
* remove AndNode.__floordiv__
AndNode produces a Node that min/max is bounded by [0, 1] so `//` on top of that is almost always 0.
we don't really use that either
* keep the test
* lazy rewrite, try 2
* min fix tests
* pass contig test
* put broken pads back
* move that to realize
* no contig child fixes array packing
* so wrong
* now that's correct
* base children
* fix bind issues
* disable to_image_idx
* fix tests
* that failure shouldn't break other tests
* more fixes
* fix torch
* skip failing tests in CI
* 1e-7
* half is broken
* 1e-6 margin of error
* invert (broken)
* decent invert
* shapetracker invert works
* plus is meh, invert is good
* support invert mask
* a few more invert tests
* shapetracker math invert test
* the universe is flat as a 2D tensor
* try this
* TESTS
* less lines in test
* don't change all_int since other places use it
* add tests and del noqa by making non-aesthetic spacing LOOOOOL
* some reordering
* fixed empty list and add tests
* more tests
* add list bool tensors
* clearer with least lines added
* added bool
* oops
* more tests
* improved tests
* oops
* add bf16 test support
this model takes me almost a minute to download though:
https://huggingface.co/TinyPixel/Llama-2-7B-bf16-sharded/resolve/main/pytorch_model-00001-of-00014.bin?download=true: 100%|█████████████████████████████| 981M/981M [00:40<00:00, 24.2MB/s]
* ensure we first load if it is bitcast to avoid taking the address of an rvalue
* tiny bf16 in the cloud
skip GPU
* should skip torch
lint
* Revert "ensure we first load if it is bitcast to avoid taking the address of an rvalue"
This reverts commit b86a28ab84.
* break the kernel
* skip LLVM and GPU in CI
* skip CUDA
* handle reshape of contiguous subparts with explicit mask
* remove the add/remove ones logic in reshape
* accomodate ones in accumulate logic
* make multiply commutative
* fix linting
* make mypy happy
* add test for commutative mul
* merge dimensions in shape_strides for 1 range masks
* add offsets for merging
* fix linting
* add back explicit 1 reshapes
* fix mypy errors
* fix accumulate by includng state
* include non-zero stride dimension in acc
* small cleanup
* more compact to_shape_strides
* more logical cleanup
* compress more
* compress reshape mask
* adding some comments
* small bug fix
* improve test coverage
* remove explicit add remove ones
* small bug in test
* enable test_reshape_splitting_combining
* small fix
* 10 lines less to_shape_strides
* shorten reshape mask
* some more cleanup
* more cleanup
* introduce some symbols for compactness
* more symbols
* more cleaner
* lessen symbols, it became less readable
* remove merge_views from view.reshape
* change to_shape_strides to _merge_dims
* improve readability
* fix corner case
* cleanup
* better handling of 1 <= Variable('i',1,10) & new_dim = Variable('i',1,10)
* rewrite _reshape_mask for readability
* fix white space
* add comment
* nice shorthands for readability
* add proof in docs
* small nit
---------
Co-authored-by: chenyu <chenyu@fastmail.com>
* cpu tests pass
* torch works
* works
* metal works
* fix ops_disk
* metal jit works
* fix openpilot
* llvm and clang work
* fix webgpu
* docs are rly broken
* LRU works on metal
* delete comment
* revert name to ._buf. LRU only on Compiled
* changes
* allocator
* allocator, getting closer
* lru alloc
* LRUAllocator
* all pass
* metal
* cuda
* test examples
* linearizer
* test fixes
* fix custom + clean realize
* fix hip
* skip tests
* fix tests
* fix size=0
* fix MOCKHIP
* fix thneed
* copy better
* simple
* old style metal copy
* fix thneed
* np reshape
* give cuda a device