* New unittest for utils.py
Unit test fetch in basic ways. Would have tested more fetches, but
downloading stuff for tests is annoying and mocking is more
dependencies.
* Remove unused imports
* assign buffer reuse works
* fix assign for torch and cpu
* allow assign from numpy
* fix llvm output_buffer
* add some assign tests
* fix assignment test
* test should fail without lazy
* env var to disable assign
* Rewrote Tensor.__getitem__ to fix negative indices and add support for np.newaxis/None
* Fixed pad2d
* mypy doesn't know about mlops methods
* normal python behavior for out-of-bounds slicing
* type: ignore
* inlined idxfix
* added comment for __getitem__
* Better comments, better tests, and fixed bug in np.newaxis
* remove val expansion
* types for all shapetracker functions:
* more typing
* add all the parens to the test
* more types
* fix tests
* very minor speedup
* triton can add
* print stuff from triton
* write out file
* ops triton working
* reduce ops
* sort of works
* Triton bugfixes & implementation of remaining ops (#490)
* padding
* support pow, max, relu, gt0
* allocate return buffer
* Fix reduce
* Add tests for power op
* Fix triton illegal memory accesses and memory leak (#512)
* Fix mypy issue
* Add triton to setup.py
* Replace torch with pycuda
* Use one cuda stream for data transfer and kernels
* Remove triton submodule
* Fix memory leak by using weakrefs for caching
* Fix memory access by adding valid as mask for load
* Fix invalid kernel launches by flattening the grid (#515)
---------
Co-authored-by: Martin Loretz <20306567+martinloretzzz@users.noreply.github.com>
* Refactor getenv into helpers
* Remove unused os
* Fix default value
* Fix more defaults for CI
* Fix bracket
* Revert changes to openpilot/compile.py
* Use getenv from helpers when possible
* [WIP] Add symbolic tests for correctness
* Fix typo
* Fix expected value for test_and_fold
* Add more tests for symbolic
* It is indeed right
* Clean up
* Check all strings
* Put TODO back
* factor out opencl runtime
* don't use CL outside the runtime
* cuda runtime adds
* final_dimension
* tests pass with CUDA backend
* more cuda
* cuda simpler
* retain old functionality
* linter and typing
* move globalcounters out of runtimes
* oops, GlobalCounters in cuda
* MAX_OUTPUT_SHAPE=3 is fine for CUDA
* Make test run
* Added new tests: sub pow constant_sub
* Fix indentation
* Added one to many lines
* Fix indentation
* Update test_cl_tiler.py
* Delete test_cl_tiler.py
* Rename Normalize and move to nn
* Fix comparison to None error
* Add test for GroupNorm
* Rename test case
* Flip parameters to match PyTorch
* Increase error tolerance
* Fix elementwise_affine on channels
* Match arguments with PyTorch
* Initialize weight and bias only when affine is true
* Is this it?
* A bit cleaner
* Handle case where weight or bias is None
* we typing
* types look good in theory
* most tests pass
* gpu tests pass
* TEST_AST
* delete comments
* i must have written that bug so many times
* bugfix
* don't merge the small ones
* add f to constants
* commits from reduce
* don't GCD the mod nodes
* broken and a hack IMAGE=3
* group for reduce
* fix linter + mypy
* move out test ast
* insource TENSOR_TYPE_TO_NP_TYPE
* does this fix it?
* move imports out
* indexer
* works
* all use indexer
* boolean in the indexer too
* symbolic is a better name than indexer
* better symbolic API
* min and max
* symbolic tests
* work
* more tests
* fix demodder
* __str__ in the superclass
* NumNode
* awesome that works
* still works
* fix up parens
* fix zeroviews
* dead lines
* expr_node
* works
* still works
* refactor to not use __new__ methods
* ugh something went wrong a while ago
* this fixes it
* mod and div at the end
* test
* symbolic
* working
* one linter issue fixed
* other division
* more simplifys
* works
* validhacks
* VALIDHACKS passes thneed
* no str replace stuff
* inline indexes
* NATIVE_EXPLOG and factoring
* factor both ways
* cl indexing
* split on mod, not just full
* onnxlimit
* fix output shape
* op_estimate is a function of the program
* no ones in the index
* four_float4
* ALLOW_4FLOAT4
* test passes
* compute then store
* loads first
* bugfix
* better, but doesn't match
* select xb in smart way
* new test and bugfix
* no change to lazy
* Node fixes linter
* fix opencl with op_estimate
* fix mypy
* revert valid
* remove unused
* add image
* load + store + boring stuff:
* image tests pass
* thneed print GFLOPS
* op conv test
* more debugging
* hack for multiview image
* shapetracker creates less views
* disable image tests
* working better
* ugh, lkey not key
* print in DEBUG, and allow views
* works
* simple padding conv2d
* use index for image
* that was bad code
* debug print
* fix types
* less lines
* save lines
* bringing back reshape and permute
* done with E701
* 4x4 works in generic way
* max and sum not vectorizing...
* special case single float
* support comparing to MPS
* improve matmul speed, consider generic principles
* GlobalCounter
* fix op tracking
* faster
* comment that out for now
* err, it needs that
* fix minor issues
* fix global_mem