* factor out opencl runtime
* don't use CL outside the runtime
* cuda runtime adds
* final_dimension
* tests pass with CUDA backend
* more cuda
* cuda simpler
* retain old functionality
* linter and typing
* move globalcounters out of runtimes
* oops, GlobalCounters in cuda
* MAX_OUTPUT_SHAPE=3 is fine for CUDA
* Make test run
* Added new tests: sub pow constant_sub
* Fix indentation
* Added one to many lines
* Fix indentation
* Update test_cl_tiler.py
* Delete test_cl_tiler.py
* Rename Normalize and move to nn
* Fix comparison to None error
* Add test for GroupNorm
* Rename test case
* Flip parameters to match PyTorch
* Increase error tolerance
* Fix elementwise_affine on channels
* Match arguments with PyTorch
* Initialize weight and bias only when affine is true
* Is this it?
* A bit cleaner
* Handle case where weight or bias is None
* we typing
* types look good in theory
* most tests pass
* gpu tests pass
* TEST_AST
* delete comments
* i must have written that bug so many times
* bugfix
* don't merge the small ones
* add f to constants
* commits from reduce
* don't GCD the mod nodes
* broken and a hack IMAGE=3
* group for reduce
* fix linter + mypy
* move out test ast
* insource TENSOR_TYPE_TO_NP_TYPE
* does this fix it?
* move imports out
* indexer
* works
* all use indexer
* boolean in the indexer too
* symbolic is a better name than indexer
* better symbolic API
* min and max
* symbolic tests
* work
* more tests
* fix demodder
* __str__ in the superclass
* NumNode
* awesome that works
* still works
* fix up parens
* fix zeroviews
* dead lines
* expr_node
* works
* still works
* refactor to not use __new__ methods
* ugh something went wrong a while ago
* this fixes it
* mod and div at the end
* test
* symbolic
* working
* one linter issue fixed
* other division
* more simplifys
* works
* validhacks
* VALIDHACKS passes thneed
* no str replace stuff
* inline indexes
* NATIVE_EXPLOG and factoring
* factor both ways
* cl indexing
* split on mod, not just full
* onnxlimit
* fix output shape
* op_estimate is a function of the program
* no ones in the index
* four_float4
* ALLOW_4FLOAT4
* test passes
* compute then store
* loads first
* bugfix
* better, but doesn't match
* select xb in smart way
* new test and bugfix
* no change to lazy
* Node fixes linter
* fix opencl with op_estimate
* fix mypy
* revert valid
* remove unused
* add image
* load + store + boring stuff:
* image tests pass
* thneed print GFLOPS
* op conv test
* more debugging
* hack for multiview image
* shapetracker creates less views
* disable image tests
* working better
* ugh, lkey not key
* print in DEBUG, and allow views
* works
* simple padding conv2d
* use index for image
* that was bad code
* debug print
* fix types
* less lines
* save lines
* bringing back reshape and permute
* done with E701
* 4x4 works in generic way
* max and sum not vectorizing...
* special case single float
* support comparing to MPS
* improve matmul speed, consider generic principles
* GlobalCounter
* fix op tracking
* faster
* comment that out for now
* err, it needs that
* fix minor issues
* fix global_mem
* chonker will make llvm fast
* work
* better speed tests, we will make them fast
* with the cache add is the same speed
* relu and neg are fast
* fix sum speed
* maximum maxnum?
* hack for gemm opt
* gemm very slow
* zeros like
* test_permute
* shapetracker returns self
* fix shapetracker factorization
* err, int strides
* permutes are faster now in tinygrad than pytorch
* support -1 in expand
* gemm unrolled
* improve final test case
* WIP GEMM
* why isn't GEMM fast?
* revert cache dim
* ffp contract works on clang, not llvm?
* ignore llvm ir
* this makes fma work at least, but no faster
* USE_4x4
* 63 GFLOPS
* 87 GFLOPS
* that wasn't matmul, 44 GFLOPS now
* 82 GFLOPS permuted
* this permute too
* a little speed for the convs
* 45 GFLOPS
* speed tests pass again
* clean up prints
* fix FMA WHAT A WASTE OF TIME
* colors
* moar fair
* GPU
* useless on chonker
* cleanups
* improve factorized shapetracker
* better threshold
* label conv
* work
* ops test pass again
* hot load the index
* run the last view, no need to create
* ZeroView needs a repr for the key to work
* fix segfault on out of bounds
* one more test
* start amx, and llvm.initialize_native_asmparser
* amx works
* nice AMX class
* nicer AMX class
* refactor get_idxs
* amx working
* is slower...
* useless flip
* cache
* SZ_X
* AMX_SZ_X/Y work alone
* Contiguous mlop
* test gemm packed
* PREPARE in packed
* use_amx factor
* prefetch isn't faster
* loop
* same 3ms
* 2.24 ms
* allow double on store in TG
* amx reduce is the same speed as non amx reduce
* include memory bandwidth
* clean up shapetracker
* flip returns stride
* prepare for upstream
* Update ops_llvm.py (#426)
* permutes are yellow and green now
* faster conv
* llvm cleanups
* Show optimised IR under debug 4 (#428)
* ASTKernel class
* Make tinygrad work with older python version (#427)
* Make tinygrad work with older python version
* Use partialmethod instead of partial
* smiple chonker is chonking
* remove junk from test speed vs torch
* fix linker and types
* AMX is only here now
* add LLVM tests, it's a valid backend now
* oops, run llvm test
* contiguous_op
* fix loadops compare
* dedup reduceops
Co-authored-by: calledit <1573053+calledit@users.noreply.github.com>
* working exec ast
* exec_ast is staticmethod
* GenericExecAST
* fold that sometimes
* ExplicitExecAST
* exec_ast for GPU
* gpu working
* get_lazyop_shape
* now gpubuffer is ExplicitExecAST
* dedup
* add a type
* RESHAPE in opencl code
* fix linter
* that too for linter
* cleanups
* remove dead code
* GenericShape is less lines
* add ALLOWED_KERNEL_COUNT to tests
* fix mypy
* that's gotta be recursive
* fix opencl shape processing
* remove unneeded lambda
* in progress
* big conv test works
* that's unneeded
* fix opencl with reduce
* rewrite contiguous_view_constant_fold
* clean up mids in loop code
* subidx
* print cl kernel before run
* no reduce, no loop
* Revert "no reduce, no loop"
This reverts commit 92777e40e9.