* some cleanup
* move continue back
* more more more
* added to CI
* try
* try intentionally break some tests
* wtf
* del True for test
* yay tests broke, now pls no break
* try AGAIN
* gahy
* lol
* try
* move over constant
* moved over MORE
* move shrink over
* trailing lines
* try CUDA CI
* try again
* boom
* oops
* improved comments
* try: disable some flags and disable CUDA
* try breaking tests
* traceback has too much info so add --tb=no
* revert forced CI failure
* add comments and del unused imports
* oooooooo using regular debug try enable tb
* intentionally break tests
* added tb back. Maybe not too verbose
* strip whitespcae
* missed something
* Shape op int32 -> int64
* oops missed something
* add some types
* get rid of crazy 1 liners in pad op
* actually test Split this time LOL
* strip that whitespace
* limit metal buffers
* look at the base, not the srcs
* Revert "Revert "openpilot kernel fix from 209 to 207 (#2006)" (#2065)"
This reverts commit 924ecc4d6a.
* add a test for that
* create cache for q learning
* make linter happy
* global beam
* where it belongs
* bugfix
* ditch the kopt, use the beam
* faster lin and DEBUG=2 okay
* remove kopt, move search to features
* Fix openpilot kernel from 209 to 206
1. Use push_movement_ops conditions in _movement_op. Don't push
PAD or check if the ops are safe to be pushed with PAD
2. Don't push if all the op.buffers are realized
* change ALLOWED_KERNEL_COUNT to 206 for openpilot
* don't push through sourceless buffers
* change the tests to adjust kernel counts for new behaviour
* restore pushing of movement ops through childless buffer
* don't push EXPAND, causes OOM
* allow push of intermediate movement ops
* adding new test behaviour
* modifying external_test_opt for new behaviour
* restore old tests
* Reenable push of EXPAND and introduce new tests
I was wrong intially thinking EXPAND can cause OOM and hence I had
disabled it. Since it is 0 stride and doesn't allocate memory its cool
* Don't push EXPAND above LoadOps LB. This is causing OOM
* Push should be decided on movement root of bufs
To check if ast.op.buffers is sourceless/ realized go the the movement
root and then decide if pushing should be done or not
* refactor for readability
* use .base instead
* don't push expand, bad memory/compute consumption
* restrict push of reshape, seeing improvement
* push reshape if unary without further check
* disable PAD solves convnext kernel count increase
* reenable test_cache_binaryop_transpose
* small nit
* lazy cleanups
* ast functions take in LazyOps
* op instead of self.op
* _base for mops
* fix contiguous
* start schedule
* test_schedule
* fix openpilot
* more tests
* bugfix and test skip
* work
* make sure things get freed
* fix zerosized tensors
* fix failing test
* fix ceil and friends
* fix openpilot
* disable training
* disable test collectives
* Move ops_triton to runtime and remove errors from deprecated code
* Remove deprecated AST Kernel
* Remove deprecated buffer
* Add TritonProgram
* Triton Buffer
* Use RawCUDABuffer
* triton_compile
* Added new parameter
* pass _buf to program
* remove deprecated include
* Added triton tests
* Deprecated includes removed
* remove double print
* Disable float4 support
* Disable float4 support
* variable load fix
* Track local size
* Add pycuda to triton dependencies
* Merge test.yml
* install cuda packages for testing
* merge double package install
* remove emulated from triton tests
* upscale local index to power of 2 and add masking
* cuda envs
* Add TernaryOps
* ConstOp loading
* proper function name
* remove deprecated variables
* get global program from name
* const ops match local shape
* Enable test_nn
* remove deprecated import
* fix linter error
* Add wait logic
* Add local size override
* accumulate local shapes instead of using max shape
* Merge triton tests into global tests
* fix envs in testing
* Old testing routine
* split file into renderer and program
* remove print and starting whitespace
* pretty ptx print on debug 5
* linter errors
* ignore triton saturation tests
* ignore test example
* remove pytorch cpu extra index
* Add triton to existing testing routine
* use triton tests
* disable cuda backend in triton tests
* use cudacpu in tests
* print used device
* Print device default
* Remove print
* ensure we are running triton backend
* update variable signatures
* update dtypes for load
* infinity render fixed
* limit global size
* negative infinity now properly rendered
* split chain with parentheses for and node
* Add option to disable shared memory, disable for triton
* missing import
* Properly index and mask conditional load
* use mask only if not loading a block pointer
* nan support
* fix symbolic tests to include chain split
* proper masking for stores
* Implemented bool dtype
* Add mod
* fix loads for variables with valid range
* merge triton with cuda runtime
* merge from master
* run triton tests with cuda
* Correct target when running from triton
* conftest with triton compiler config
* use triton nightly
* verbose tests for triton
* capture stdout
* fix function depth when exiting multiple loops
* add render valid function for readabilty
* fix mask for local loops
* add _arg_int32 datatype
* fix dims for conditional loads
* enable non float stores
* correct variable dtypes
* fix type for arg_int32
* remove junk
* Added get max function for range based var.max
* remove deprecated code
* Fix triton ptxas path
* Fix testing for CI
* clamp local size by max local size instead of always running max
* Disable matmul test in triton cpu
* rerun tests
* Disable broken test in triton cpu
* whitespace removed
* rerun tests again
* Disable TestSymbolicOps for triton
* update to new uops
* linter fix
* ignore test/extra
* linting fix
* Update tinygrad/renderer/triton.py
Co-authored-by: Gijs Koning <gijs-koning@live.nl>
* remove deprecated line
* quotes type fix
* linter
* Remove unnecesary lines
* UnaryOps.NEG
* dont define constants
* Linting fix
* Disable tests that are broken in ocelot
* remove trailing whitespace
* reduce line count
* linting fix
* update to new uast
* New looping style
* Update to new uast
* make AST runner work with triton
* linting fix
* set renderer var for testing
* disable local for ocelot
* reenable all tests for ocelot
* Pass shared to cuda
* Don't group if the backend doesn't support shared mem
* use working gpuocelot branch
* enable all tests
* enable local for ocelot
* cleanup
* Update test.yml
* update cache key
* reenable test symbolic and extra
* Update test.yml
* Revert "Update test.yml" (rerun tests)
This reverts commit 98c0630ee5.
* Revert "fix symbolic tests to include chain split"
This reverts commit 22a9a4c9cd.
* Revert "split chain with parentheses for and node"
This reverts commit 7499a7004e.
* use global size from linearizer
* rename newvar to dtype to match other renderers
* join program start lines
* simplify code that adds axis to local dims
* assign r[u] in ssa
* We no longer need to replace target in src
* we no longer need to cast indices to int by hand
* Update triton.py(rerun tests)
* Update triton.py(rerun tests)
* Update triton.py(rerun tests)
---------
Co-authored-by: Gijs Koning <gijs-koning@live.nl>
Co-authored-by: George Hotz <72895+geohot@users.noreply.github.com>
* valid hacks
* valid hacks
* valid hacks
* new method
* new method
* handtune
* is gate load breaking?
* lint
ruff
less junk
new approach?
maybe this?
* Make it more clear
* Make it more clear
* Will deal with the linter later
* hack for linter
* subs the idx but dont touch the valid
* Updated the mod rules
* lint hack
* I believe bug fix lets see
* Mod Node left
* revert
* Maybe this wont break?
* revert
* implemented "handtuned garbage"
* revert and use VALIDHACKS
* Lets see the CI
* still broken?
* currently its jungle
* maybe this jungle ?
* This works for everything somehow
* Added test for symbolic
* lint
* final touch
* This still works
* lint
* midway clean
* less garbage
* lint
* final form
* Slow but working way
* lint and other stuff
* lint
* mypy
* Make sure CI test Openpilot valid checks
* test if CI break
* Convert back
* refactor
* refactor
* Managed to reduce openpilot time from 30 secs to 5 secs
* Refactor
* Substitute a node with variable
* flake8
* Comment and refactor
* More comprehensive mod
* refactor
* bug fix
* More shave off
* remove not sure part