* some cleanup
* move continue back
* more more more
* added to CI
* try
* try intentionally break some tests
* wtf
* del True for test
* yay tests broke, now pls no break
* try AGAIN
* gahy
* lol
* try
* move over constant
* moved over MORE
* move shrink over
* trailing lines
* try CUDA CI
* try again
* boom
* oops
* improved comments
* try: disable some flags and disable CUDA
* try breaking tests
* traceback has too much info so add --tb=no
* revert forced CI failure
* add comments and del unused imports
* oooooooo using regular debug try enable tb
* intentionally break tests
* added tb back. Maybe not too verbose
* strip whitespcae
* missed something
* Shape op int32 -> int64
* oops missed something
* add some types
* get rid of crazy 1 liners in pad op
* actually test Split this time LOL
* strip that whitespace
* limit metal buffers
* look at the base, not the srcs
* Revert "Revert "openpilot kernel fix from 209 to 207 (#2006)" (#2065)"
This reverts commit 924ecc4d6a.
* add a test for that
* create cache for q learning
* make linter happy
* global beam
* where it belongs
* bugfix
* ditch the kopt, use the beam
* faster lin and DEBUG=2 okay
* remove kopt, move search to features
* Fix openpilot kernel from 209 to 206
1. Use push_movement_ops conditions in _movement_op. Don't push
PAD or check if the ops are safe to be pushed with PAD
2. Don't push if all the op.buffers are realized
* change ALLOWED_KERNEL_COUNT to 206 for openpilot
* don't push through sourceless buffers
* change the tests to adjust kernel counts for new behaviour
* restore pushing of movement ops through childless buffer
* don't push EXPAND, causes OOM
* allow push of intermediate movement ops
* adding new test behaviour
* modifying external_test_opt for new behaviour
* restore old tests
* Reenable push of EXPAND and introduce new tests
I was wrong intially thinking EXPAND can cause OOM and hence I had
disabled it. Since it is 0 stride and doesn't allocate memory its cool
* Don't push EXPAND above LoadOps LB. This is causing OOM
* Push should be decided on movement root of bufs
To check if ast.op.buffers is sourceless/ realized go the the movement
root and then decide if pushing should be done or not
* refactor for readability
* use .base instead
* don't push expand, bad memory/compute consumption
* restrict push of reshape, seeing improvement
* push reshape if unary without further check
* disable PAD solves convnext kernel count increase
* reenable test_cache_binaryop_transpose
* small nit
* lazy cleanups
* ast functions take in LazyOps
* op instead of self.op
* _base for mops
* fix contiguous
* start schedule
* test_schedule
* fix openpilot
* more tests
* bugfix and test skip
* work
* make sure things get freed
* fix zerosized tensors
* fix failing test
* fix ceil and friends
* fix openpilot
* disable training
* disable test collectives
* add some contiguous
* remove second contig
* Revert "remove second contig"
This reverts commit fc164f7dca1ad75b1e466e4e45a05eca58b7e0e0.
* shm on osx
* can repro bug
* don't contig zeros and ones
* 1
* 83 failed
* learning how git works
* lol idk
* zero shape aaaa
* space lol
* aaa
* test check
* haha
* fixed gather
* 73 failing
* 71 failing
* 68 failing
* added some debug
* fking resize
* lol
* 62 failing
* 58 failling fucking did nearest resize hell yeah
* clean up
* 56 failing
* janitor duty
* lol
* 53 failing
* hi mom
* 50 failing
* added linear interp, but coord_trans is wrong
* did lin interpolation woohoo
* 43 failing
* 40 failing
* temporary Gather fix
* 39 failing
* fixed slice onnxver<10
* 37 failing
* 35 failing
* excluded tests that use float64
* 32 failing with hacks
* added _batchnorm() for 3D 5D batchnorm, 29 failing
* changed ALLOWED_KERNEL_COUNT from 199 to 207
* added improved Gather op, reverted ALLOWED_KERNEL_COUNT commit
* support Round op
* added storage_order/indices maxpool, 27 failing
* support maxunpool, 25 failures
* support Gradient, 23 failures
* merged new where
* added Adam
* cleanups
* added Momentum and Nesterov Momentum
* added Adagrad
* support sequence_type, 20 failing
* ugh git
* I give up on cubic interp :D, 9 failing
* sexy 1 liner gather, much improved, wow
* polished gather to make it shine bright like a diamond
* clean 1 liner for gather
* improved readability of gather
* uhh
* clean up
* more clean up
* WHITEspace
* implemented SoftmaxCrossEntropyLoss op
* added comments and cleaned up if statements
* update
* thank based wozeparrot for pow and new GatherElements
* CPU and TORCH all pass | cast float64 -> float32 for all fromCPU()
* _nearest_gather() failing on yolo
* reverted ops_cpu change and added assert in Resize
* added comments for resize for multiple channels
* oops
* merge
* test
* switched np.pad to Tensor.pad for constant padding
* gah
* gah2
* sexy reflect pad with movementops -> add
* delete commented out lines
* edge mode pad sexy as well
* trying out model_benchmark
* revert gitignore change lol
* init
* Revert "init"
This reverts commit 682bf2073a.
* wrote cast workaround for CPU, CPU and TORCH all pass
* wrote cast workaround for CPU, CPU and TORCH all pass
* skipped tests w/ 0 shape for METAL and GPU
* excluded tests for CLANG, CPU, TORCH, CLANG pass
* fixed hacky ConvTranspose
* gotta figure out autopad
* UOps.STORE support cast bool -> float
* small fix for fast gather
* reverted 0 shape skipped tests
* oops missed a file
* added comment
* fixed slice op hack
* First commit to pr
* More trig ops
* More trig ops
* format
* isinf support
* More ops
* changed onnx_ops to use our new gather :D
* Det op bug fix
* rebase
* fixed some tests
* det broken and slow
* fixed compress to use new gather
* implemented argmax argmin
* support variable types in type_proto
* support Upsample and Identity sequence
* we support float64 now and tinygrad support automatic broadcasting
* added EyeLike op
* resize does support multiple channels now actually
* yolov8 onnx runs successfully
* added batch size 1
* oops
* finally fixed type_proto I think
* fixed some llvm bugs
* del whitespaces
* added ZenginU Format PR
* test
* oops
* added float64 exclude tests back
* more skipped tests
* try
* ok openpilot pass
* flake8 pass
* woooooohooo
* revert external_model_benchmark changes
* perf tested gather
* removed promote types from ops_cpu
* numerical errors from 1681 is fixed
---------
Co-authored-by: ZenginU <umutzengin00@gmail.com>
* Symbolic Shape JIT
update tests
2 variables symbolic ops, adding more tests
test passing
cleanup
* more test cases
* single flag
* review update
* jit attention one piece
* realize
* symbolic_jit test for cuda
* old artifact
* works with cuda gpu but failed ci
* CUDACPU
* try to run commavq
* fix 0 dim, start implementing new ops
- Implement EmbedLayerNormalization
- Implement Attention
* SkipLayerNormalization and FastGelu
* use original torch model, cast inputs
* fix some ops:
- properly do Cast
- Attention: bi- and unidirectional
- FastGelu: add bias before gelu
* cleanup onnx_ops.py
* add validation option to benchmark
* cleanup imports
* add checks incase onnx2torch implements ops in future
* run onnx instead of original torch
* just skip gpu on m1
* reactivate the other models
* check for strange params & squash whitespace
* cleanup
* fix causal mask Attention
* Range doesn't need int cast
* embedding vocab_counter same dtype as input
* no need to cast
* always validate, fix PosixPath ort
---------
Co-authored-by: George Hotz <george@comma.ai>
* testing new memops
* better debugging
* testing padded conv
* branching with load
* refactoring a bit
* first try
* fixing bugs
* fixing some
* eq
* eq2
* do not use x's
* working
* fixing imm
* getting things working
* refactor
* pow not working
* working except one
* refactor: one store mem
* refactor: global load
* refactor: imm
* refactor: cleaning
* fixing big offsets
* refactor with ci
* try ci
* typo
* another typo
* ubuntu default
* forgot git
* do i need git?
* missing packages
* adding python-dev
* with cache?
* buildx action
* buildx name issue?
* maybe now?
* python3
* newline warning
* maybe now
* i actually need this
* ci should work now
* improved caching
* fixing cache
* maybe now it will cache
* this
* testing cache
* trying again
* load
* missing platform
* caching gha
* testing cache
* full testing
* typo
* now?
* why
* adding checkout back
* bad formatting
* fixing convention issues
* supporting python
* adding CI flag
* testing all
* better comments
* adding debugging
* takes 12x longer
* does it output progress now?
* ignore models for speed
* fixing merge
* excluding conv_transpose2d
* only 2 test cuz is to slow
* another approach
* let's see
* faster duh
* my bad
* T_T
* typo
* sup
* with output?
* comment test
* comment test
* comment test
* :?
* no comment
* with cache
* back to normal
* testing that ci works
* back to passing
* trying again
* does it create another entry
* does it create another entry?
* build local
* hey
* Revert "excluding conv_transpose2d"
This reverts commit cc7348de03.
* does it cache if done before?
* does it cache?
* done
* adding test ops
* bad formatting
* no need for this
* working static mem
* sum 1d
* add ndim
* better reg import
* fix stack
* back to np
* working except for softmax
* 5 failing
* no pogress
* remove keystone
* remove keystone
* testops passing
* cleanups
* more cleanup
* typo
* ci
* ci2
* cond import
* ci3
* ci4
* ci4
* ci5
* ci5
* ci6
* aligment
* test all
* correct test
* err read_unmapped
* passing test
* ignore for speed
* ignore for speed
* ci7
* cleanup
* remove docker
* fixing merge
* fixing bugs
* add skipload for const ops
* comments
* First merge to master: Renderer
* fix emulation
* passing all tests arm64
* cleaning
* fix handcoded binary
* cleaning
* fix errs
* fix runtime arg binary
* clean git diff
* fix and clean
* fixing metal test
* cleaning
* fix metal test
* ci ~8 min
* fix pylint and clang
* cache the files in ops_clang
---------
Co-authored-by: George Hotz <72895+geohot@users.noreply.github.com>
* feat: world
* feat: tests
* feat: no more backwards
* feat: recv into
* feat: whoops
* feat: test in ci
* feat: some debug logging
* feat: workflow naming
* feat: need to set pythonpath
* feat: just send to same device
* feat: allreduce
* feat: test
* feat: need contiguous
* feat: test in ci
* feat: exit with correct code
* feat: don't need that
* feat: opencl wait_for just doesn't work
* feat: synchronize on out
* feat: try?
* feat: try again?
* feat: add extra realizes
* feat: print
* feat: seed
* feat: tol
* feat: test ones and zeros
* feat: remove print
* feat: are you just flaky
* feat: seperate scatter and gather?
* feat: just try synchronizing
* feat: remove print again
* feat: bring back difference
* feat: no sync
* feat: revert that
* feat: back to wait_for
* fix: typo
* feat: world
* feat: tests
* feat: no more backwards
* feat: recv into
* feat: whoops
* feat: test in ci
* feat: some debug logging
* feat: workflow naming
* feat: need to set pythonpath
* feat: just send to same device