* fold expands that precede a reduce if the reduction is on the same axis as the expansion
* add deterministic test for SIMPLIFY_SUM_RESHAPE_EXPAND_SUM optimization
* add a test case to make sure we don't fold reduce-expand-reduce on different axes
* fixed division by zero for fast operations
* made et closer to 0
* replace POW llop with SQRT
* updated mlops to swap SQRT and POW llops
* updated hlops to swap POW and SQRT
* added sqrt llop to cpu runtime
* added sqrt llop to cstyle codegen
* added POW llop to llvm ir codegen
* added SQRT llop to torch runtime
* moved pow from mlops to hlops
* found a better way to do reverse pow
* fixed indentation
* added SQRT llop to triton
* update docs to match new llops
* removed POW operator from assembly codegen
* added sqrt and rsqrt to pow hlop
* rewrote pow function in tensor.py
* Adjust tolerance
* Adjust for adamw
* Reduce for Adam too
* removed accidental leftover code
* removed all of accidental code
* added rsqrt test
* removed pow from mlops again
it was added back when resolving merge conflicts
---------
Co-authored-by: Jacky Lee <jla524@sfu.ca>
* Use generators in any(..) instead of lists for better best-case
* Use generators in all(...) instead of lists
* enable R1729 in .pylintrc
* revert import sorting
---------
Co-authored-by: Anselm Coogan <anselm@scandit.com>
* perf: remove extra function, include in cached getitem
* perf: only calculate hash once per node
---------
Co-authored-by: Roelof van Dijk <roelof.van.dijk@vitestro.com>
* new upcast works
* float4 try
* fix unaligned float4
* disallow unaligned access
* upcast dim
* maybe good now
* fix gpu half
* vstore_half4
* fix deep image bugs
* improve symbolic to fix issues
* fix symbolic
* cl test
* this maybe
* gcd of 1 is 1
* real fix for old python
* improve fuzzer
* realize hotspots
* no str check
* minor changes
* make this an assert
* faster and more readable
* nicer self.buffers
* tests for weak op + LAZYCACHE=0
* Dedup params in optimizer
* Passing the same tensor multiple times in the set of learnable params passed to optimizers can result in models completely failing to learn, but no errors are produced. This dedups tensors to avoid the problem.
* Fix types
* Use new variable to satisfy linter
* Use `helpers.dedup` instead of `set()` to dedup params
* Add test for duped params in optimizers
* MaskRCNN weights loading
* backbone maybe works
* backbone works, but resnet body atol 1e-3
* RPN Call, but veryy wrong output
* fixed topk
* RPN maybe works, not sure about nms
* Fix cursed modules
* add back editorconfig
* Full call, wrong output
* Full call works
* fix mask
* use NMS from retinanet
* Removing extra funcs
* refactor
* readable
* Add example to run model
* remove filter
* Fix split, batched inference is worse
* Fix image sizes
* Matching reference
* merge master
* add filter on top detections
* cuda backend fixed
* add model eval and spec
* convert images to rgb
* fix eval
* simplify examples code
* remove extra code
* meshgrid using tinygrad
* removing numpy
* roi align, floor, ceil
* remove numpy from level_mapper
* remove numpy from pooler
* Revert "Merge branch 'master' of github.com:kunwar31/tinygrad into mrcnn-inference"
This reverts commit 4b95a3cb49, reversing
changes made to 98f2b1fa2e.
* roi align gather
* fix master merge
* revert to old floor, ceil as ints present in domain
* use log2 op
* fix indexes
* weird bug with ints and gpu
* weird bug with ints and gpu
* refactors, add env var for gather
* floor with contiguous, where
* refactor topk, sort
* remove staticmethod
* refactor stride
* remove log2 mlop
* realize -> contiguous
* refactor forward
* remove num_classes, stride_in_1x1 from state
* refactor forward
* refactoring
* flake8
* removing numpy in anchor gen, use numpy for gather, nonzero, optimize topk
* keep using tinygrad for smaller gathers
* fix empty tensors
* comms
* move from tensor.py
* resnet test passing
* add coco dataset back
* fix spaces
* add test for log2
* no need to create Tensors
* no need to create Tensors
---------
Co-authored-by: Kunwar Raj Singh <kunwar31@pop-os.localdomain>
* test speed llama
* oops, put it back
* uses the real device codegen
* just do it on the mac
* pp
* is faster?
* Revert "is faster?"
This reverts commit 42db542010.
* disable docker again for less load on CI
* global -> group
* allow None for local_size in custom function
* lil local
* comment on shape
* fix cuda
* smart local cast
* better local heuristic
* fix ptx, and work_dim cleanup
* fix metal
* fix ops test
* fix openpilot jit
* no more optlocal
* might fix metal tests
* try metal now
* see generated metal code
* test free removal. REVERT THIS
* mergable
* Add support for one case of `UOps.CAST` for RDNA3 assembler
* Adds support for casting from `bool` -> `float32`. Seems like a very common operation that is required in many places.
* Fix bool register definition for vector operations
* Use `vcc_lo` instead of `vcc` which seems to be required since it's configured to use wavefront_size=32
* Add vector support for some places that were scalar only in register definition and comparison ops
* Fix some issues in what seems to be defunct `external_test_image.py`
* Some tests still don't pass for other reasons, but it at least runs now and one broken test is now fixed
* Refactor RDNA3 assembler register definition
* Unify multi-registor code between dtypes and combine with single-register allocation since they're all untyped registers at the end of the day
* Minor improvements + cleanup to `ops_gpu.py`
* Add some previously undocumented environment variables from `ops_gpu.py` to `env_vars.md`
* Update debug print for OpenCL to print the devices that will be used post-filtering with `CL_EXCLUDE`
* Remove a couple unused or superfluous variables and assignments
* Use `fromimport` shorthand to shave off a couple precious LOC
* Couple small whitespace changes to clean things up
* Revert change to ordering of OpenCL devices
* Small refactor for OpenCL context creation
* Revert "Revert "ops rdna""
This reverts commit 0400315078.
* Revert "Revert "writing 2""
This reverts commit 325a3bf2cf.
* no dump
* 2x 2
* simple asm
* local size
* sub
* lil work
* support args != 3
* assembler work
* generate that
* ptx assembler
* begin index renderer
* max
* ptx loops
* gemms work
* valid works
* asm working a bit more
* close
* passing all ops tests
* ptx is a codegen only, not a backend
* ptx
* float16 support
* rdna goes here
* install types
* make amd disassemble
* ansilen for pretty print
* fix ptx log2/exp2
* assemblyinstruction
* new asm
* working gemm
* fix cmp
* more passing
* mod
* ptx works again
* rdan3 add works
* log exp
* sin is sin 2pi
* fix types
* progress
* loops work
* rdna xyz
* better addressing
* cleanups
* handle exception in early process
* div support
* rdna float4
* locals work
* fix neg index
* cast
* smaller diff
* yaml
* import only if selected
* fromimport
* types
* this all needs rewriting
* a few more