* WIP: `tensor.squeeze` function
* Added `test_except` param to `helper_test_op` to avoid false positives
* Extracted new method `helper_test_exception` for testing exceptions
* Made `squeeze` not throw IndexError when ndim == 0 and dim <= 0 to match PyTorch
* initial commit
* 81 passing
* 105 passing tests
* 148 passing
* CI tests
* install dep on ci
* try opencl pkgs
* try using vulkan
* down to only 6 failing
* refactor
* cleaning up
* another test skipped due to buffer limit
* linter
* segfault
* indent fix
* another segfault found
* small touchups
* Fix max and maxpool tests
* Add constant folding
* Add javascript export script
* better asserts in codegen
* manual upcasting
* reverted token type change
* skip safetensor test due to unsupported type
* FIx efficientnet and all other model tests
* Remove np copy
* fixed indent and missing import
* manually destroy the buffer
* revert back to length
* linter errors
* removed extra val
* skip broken tests
* skipping more tests
* Make the page pretty
* Save model weights as safetensor
* Fix imagenet to c test
* Fix second imagenet to c bug
* Async and paralel kernel compilation
* workgroup support
* reversed local size
* fixed non local bug
* correct local groups
* ci experiment
* removed typo
* Fix define local by using shared memory
* Refactor
* try running on mac
* match metal tests
* add more workers
* scope down tests
* trying windows runner
* fixed windows env
* see how many it can do
* merged master
* refactor
* missed refactor
* increase test suite coverage
* missing import
* whitespace in test_efficientnet.py
* getting there
* fixed reset
* fixed bufs
* switched to cstyle
* cleanup
* min/max rename
* one more linter issue
* fixed demo
* linter
* testing ci chrome
* add unsafe webgpu arg
* add build step
* remove WEBGPU from cmd line
* use module
* try forcing directx
* trying forced metal backend
* temp disable conv2d for CI
* disable conv_trasnpose2d
---------
Co-authored-by: 0x4d - Martin Loretz <20306567+martinloretzzz@users.noreply.github.com>
Co-authored-by: George Hotz <72895+geohot@users.noreply.github.com>
* fixed division by zero for fast operations
* made et closer to 0
* replace POW llop with SQRT
* updated mlops to swap SQRT and POW llops
* updated hlops to swap POW and SQRT
* added sqrt llop to cpu runtime
* added sqrt llop to cstyle codegen
* added POW llop to llvm ir codegen
* added SQRT llop to torch runtime
* moved pow from mlops to hlops
* found a better way to do reverse pow
* fixed indentation
* added SQRT llop to triton
* update docs to match new llops
* removed POW operator from assembly codegen
* added sqrt and rsqrt to pow hlop
* rewrote pow function in tensor.py
* Adjust tolerance
* Adjust for adamw
* Reduce for Adam too
* removed accidental leftover code
* removed all of accidental code
* added rsqrt test
* removed pow from mlops again
it was added back when resolving merge conflicts
---------
Co-authored-by: Jacky Lee <jla524@sfu.ca>
* new upcast works
* float4 try
* fix unaligned float4
* disallow unaligned access
* upcast dim
* maybe good now
* fix gpu half
* vstore_half4
* fix deep image bugs
* improve symbolic to fix issues
* fix symbolic
* cl test
* this maybe
* gcd of 1 is 1
* real fix for old python
* improve fuzzer
* realize hotspots
* no str check
* minor changes
* make this an assert
* faster and more readable
* nicer self.buffers
* tests for weak op + LAZYCACHE=0
* MaskRCNN weights loading
* backbone maybe works
* backbone works, but resnet body atol 1e-3
* RPN Call, but veryy wrong output
* fixed topk
* RPN maybe works, not sure about nms
* Fix cursed modules
* add back editorconfig
* Full call, wrong output
* Full call works
* fix mask
* use NMS from retinanet
* Removing extra funcs
* refactor
* readable
* Add example to run model
* remove filter
* Fix split, batched inference is worse
* Fix image sizes
* Matching reference
* merge master
* add filter on top detections
* cuda backend fixed
* add model eval and spec
* convert images to rgb
* fix eval
* simplify examples code
* remove extra code
* meshgrid using tinygrad
* removing numpy
* roi align, floor, ceil
* remove numpy from level_mapper
* remove numpy from pooler
* Revert "Merge branch 'master' of github.com:kunwar31/tinygrad into mrcnn-inference"
This reverts commit 4b95a3cb49, reversing
changes made to 98f2b1fa2e.
* roi align gather
* fix master merge
* revert to old floor, ceil as ints present in domain
* use log2 op
* fix indexes
* weird bug with ints and gpu
* weird bug with ints and gpu
* refactors, add env var for gather
* floor with contiguous, where
* refactor topk, sort
* remove staticmethod
* refactor stride
* remove log2 mlop
* realize -> contiguous
* refactor forward
* remove num_classes, stride_in_1x1 from state
* refactor forward
* refactoring
* flake8
* removing numpy in anchor gen, use numpy for gather, nonzero, optimize topk
* keep using tinygrad for smaller gathers
* fix empty tensors
* comms
* move from tensor.py
* resnet test passing
* add coco dataset back
* fix spaces
* add test for log2
* no need to create Tensors
* no need to create Tensors
---------
Co-authored-by: Kunwar Raj Singh <kunwar31@pop-os.localdomain>
* global -> group
* allow None for local_size in custom function
* lil local
* comment on shape
* fix cuda
* smart local cast
* better local heuristic
* fix ptx, and work_dim cleanup
* fix metal
* fix ops test
* fix openpilot jit
* no more optlocal
* might fix metal tests
* try metal now
* see generated metal code
* test free removal. REVERT THIS
* mergable
* Revert "Revert "ops rdna""
This reverts commit 0400315078.
* Revert "Revert "writing 2""
This reverts commit 325a3bf2cf.
* no dump
* 2x 2
* simple asm
* local size
* sub
* lil work
* support args != 3
* assembler work
* generate that
* ptx assembler
* begin index renderer
* max
* ptx loops
* gemms work
* valid works
* asm working a bit more
* close
* passing all ops tests
* ptx is a codegen only, not a backend
* ptx
* float16 support
* rdna goes here
* install types
* make amd disassemble
* ansilen for pretty print
* fix ptx log2/exp2
* assemblyinstruction
* new asm
* working gemm
* fix cmp
* more passing
* mod
* ptx works again
* rdan3 add works
* log exp
* sin is sin 2pi
* fix types
* progress
* loops work
* rdna xyz
* better addressing
* cleanups
* handle exception in early process
* div support
* rdna float4
* locals work
* fix neg index
* cast
* smaller diff
* yaml
* import only if selected
* fromimport
* types
* this all needs rewriting
* a few more
* add cumsum with n-dim inputs, over arbitrary axis + relevant tests
* increased rtol for cumsum test
* move test_cumsum into test_ops
* skip arange test for images as relies on cumsum
* Fix typo
* rewrite cumsum to work with images
* add and reorganize test_slice_* tests
* refactor Tensor.__getitem__()
* preliminary tests for 1) 0D tensors and 2) varargs for Tensor.zeros and Tensor.ones
* always compare shapes of the numpy arrays obtained from tinygrad and torch tensors
* add more tests for 0D support
* remove test_tensor.test_slicing(). All slicing tests at test/test_ops.py
* add zero-dim support
* make test_end2end.py consistent with 0dim support
* add test for tensor with zero in shape
* don't simplify ones if shape is ()
* skip tests that need zero-size tensor support.
- zero-size tensor support not related to 0dim tensors.
* add tests for __getitem__() supporting strides >= 1
* refactor __getitem__: support for strides >= 1
* minor refactors and add comments to __getitem__
* add tests for slices with negative steps
* add support for slices with negative strides
* Don't collapse dimensions during batched matmul (FIX#799)
* Avoid reshaping tensor to the same shape
* Skip batched matrix multiply when IMAGE is set
* make maximum split grad
* added test for maximum split grad when equal
* minor expr simplification
* (2-eq)/2 only once
* update test bc one more sum output child stays
* activation ops
* type hints + more testing
* formatting correction + parameter testing
* fixes to shape testing
* hardtanh to use clip + removed type hints
* assign val fix