* lazy cleanups
* ast functions take in LazyOps
* op instead of self.op
* _base for mops
* fix contiguous
* start schedule
* test_schedule
* fix openpilot
* more tests
* bugfix and test skip
* work
* make sure things get freed
* fix zerosized tensors
* fix failing test
* fix ceil and friends
* fix openpilot
* disable training
* disable test collectives
* add some contiguous
* remove second contig
* Revert "remove second contig"
This reverts commit fc164f7dca1ad75b1e466e4e45a05eca58b7e0e0.
* shm on osx
* can repro bug
* don't contig zeros and ones
* 1
* 83 failed
* learning how git works
* lol idk
* zero shape aaaa
* space lol
* aaa
* test check
* haha
* fixed gather
* 73 failing
* 71 failing
* 68 failing
* added some debug
* fking resize
* lol
* 62 failing
* 58 failling fucking did nearest resize hell yeah
* clean up
* 56 failing
* janitor duty
* lol
* 53 failing
* hi mom
* 50 failing
* added linear interp, but coord_trans is wrong
* did lin interpolation woohoo
* 43 failing
* 40 failing
* temporary Gather fix
* 39 failing
* fixed slice onnxver<10
* 37 failing
* 35 failing
* excluded tests that use float64
* 32 failing with hacks
* added _batchnorm() for 3D 5D batchnorm, 29 failing
* changed ALLOWED_KERNEL_COUNT from 199 to 207
* added improved Gather op, reverted ALLOWED_KERNEL_COUNT commit
* support Round op
* added storage_order/indices maxpool, 27 failing
* support maxunpool, 25 failures
* support Gradient, 23 failures
* merged new where
* added Adam
* cleanups
* added Momentum and Nesterov Momentum
* added Adagrad
* support sequence_type, 20 failing
* ugh git
* I give up on cubic interp :D, 9 failing
* sexy 1 liner gather, much improved, wow
* polished gather to make it shine bright like a diamond
* clean 1 liner for gather
* improved readability of gather
* uhh
* clean up
* more clean up
* WHITEspace
* implemented SoftmaxCrossEntropyLoss op
* added comments and cleaned up if statements
* update
* thank based wozeparrot for pow and new GatherElements
* CPU and TORCH all pass | cast float64 -> float32 for all fromCPU()
* _nearest_gather() failing on yolo
* reverted ops_cpu change and added assert in Resize
* added comments for resize for multiple channels
* oops
* merge
* test
* switched np.pad to Tensor.pad for constant padding
* gah
* gah2
* sexy reflect pad with movementops -> add
* delete commented out lines
* edge mode pad sexy as well
* trying out model_benchmark
* revert gitignore change lol
* init
* Revert "init"
This reverts commit 682bf2073a.
* wrote cast workaround for CPU, CPU and TORCH all pass
* wrote cast workaround for CPU, CPU and TORCH all pass
* skipped tests w/ 0 shape for METAL and GPU
* excluded tests for CLANG, CPU, TORCH, CLANG pass
* fixed hacky ConvTranspose
* gotta figure out autopad
* UOps.STORE support cast bool -> float
* small fix for fast gather
* reverted 0 shape skipped tests
* oops missed a file
* added comment
* fixed slice op hack
* First commit to pr
* More trig ops
* More trig ops
* format
* isinf support
* More ops
* changed onnx_ops to use our new gather :D
* Det op bug fix
* rebase
* fixed some tests
* det broken and slow
* fixed compress to use new gather
* implemented argmax argmin
* support variable types in type_proto
* support Upsample and Identity sequence
* we support float64 now and tinygrad support automatic broadcasting
* added EyeLike op
* resize does support multiple channels now actually
* yolov8 onnx runs successfully
* added batch size 1
* oops
* finally fixed type_proto I think
* fixed some llvm bugs
* del whitespaces
* added ZenginU Format PR
* test
* oops
* added float64 exclude tests back
* more skipped tests
* try
* ok openpilot pass
* flake8 pass
* woooooohooo
* revert external_model_benchmark changes
* perf tested gather
* removed promote types from ops_cpu
* numerical errors from 1681 is fixed
---------
Co-authored-by: ZenginU <umutzengin00@gmail.com>
* Symbolic Shape JIT
update tests
2 variables symbolic ops, adding more tests
test passing
cleanup
* more test cases
* single flag
* review update
* jit attention one piece
* realize
* symbolic_jit test for cuda
* old artifact
* works with cuda gpu but failed ci
* CUDACPU
* try to run commavq
* fix 0 dim, start implementing new ops
- Implement EmbedLayerNormalization
- Implement Attention
* SkipLayerNormalization and FastGelu
* use original torch model, cast inputs
* fix some ops:
- properly do Cast
- Attention: bi- and unidirectional
- FastGelu: add bias before gelu
* cleanup onnx_ops.py
* add validation option to benchmark
* cleanup imports
* add checks incase onnx2torch implements ops in future
* run onnx instead of original torch
* just skip gpu on m1
* reactivate the other models
* check for strange params & squash whitespace
* cleanup
* fix causal mask Attention
* Range doesn't need int cast
* embedding vocab_counter same dtype as input
* no need to cast
* always validate, fix PosixPath ort
---------
Co-authored-by: George Hotz <george@comma.ai>
* testing new memops
* better debugging
* testing padded conv
* branching with load
* refactoring a bit
* first try
* fixing bugs
* fixing some
* eq
* eq2
* do not use x's
* working
* fixing imm
* getting things working
* refactor
* pow not working
* working except one
* refactor: one store mem
* refactor: global load
* refactor: imm
* refactor: cleaning
* fixing big offsets
* refactor with ci
* try ci
* typo
* another typo
* ubuntu default
* forgot git
* do i need git?
* missing packages
* adding python-dev
* with cache?
* buildx action
* buildx name issue?
* maybe now?
* python3
* newline warning
* maybe now
* i actually need this
* ci should work now
* improved caching
* fixing cache
* maybe now it will cache
* this
* testing cache
* trying again
* load
* missing platform
* caching gha
* testing cache
* full testing
* typo
* now?
* why
* adding checkout back
* bad formatting
* fixing convention issues
* supporting python
* adding CI flag
* testing all
* better comments
* adding debugging
* takes 12x longer
* does it output progress now?
* ignore models for speed
* fixing merge
* excluding conv_transpose2d
* only 2 test cuz is to slow
* another approach
* let's see
* faster duh
* my bad
* T_T
* typo
* sup
* with output?
* comment test
* comment test
* comment test
* :?
* no comment
* with cache
* back to normal
* testing that ci works
* back to passing
* trying again
* does it create another entry
* does it create another entry?
* build local
* hey
* Revert "excluding conv_transpose2d"
This reverts commit cc7348de03.
* does it cache if done before?
* does it cache?
* done
* adding test ops
* bad formatting
* no need for this
* working static mem
* sum 1d
* add ndim
* better reg import
* fix stack
* back to np
* working except for softmax
* 5 failing
* no pogress
* remove keystone
* remove keystone
* testops passing
* cleanups
* more cleanup
* typo
* ci
* ci2
* cond import
* ci3
* ci4
* ci4
* ci5
* ci5
* ci6
* aligment
* test all
* correct test
* err read_unmapped
* passing test
* ignore for speed
* ignore for speed
* ci7
* cleanup
* remove docker
* fixing merge
* fixing bugs
* add skipload for const ops
* comments
* First merge to master: Renderer
* fix emulation
* passing all tests arm64
* cleaning
* fix handcoded binary
* cleaning
* fix errs
* fix runtime arg binary
* clean git diff
* fix and clean
* fixing metal test
* cleaning
* fix metal test
* ci ~8 min
* fix pylint and clang
* cache the files in ops_clang
---------
Co-authored-by: George Hotz <72895+geohot@users.noreply.github.com>
* feat: world
* feat: tests
* feat: no more backwards
* feat: recv into
* feat: whoops
* feat: test in ci
* feat: some debug logging
* feat: workflow naming
* feat: need to set pythonpath
* feat: just send to same device
* feat: allreduce
* feat: test
* feat: need contiguous
* feat: test in ci
* feat: exit with correct code
* feat: don't need that
* feat: opencl wait_for just doesn't work
* feat: synchronize on out
* feat: try?
* feat: try again?
* feat: add extra realizes
* feat: print
* feat: seed
* feat: tol
* feat: test ones and zeros
* feat: remove print
* feat: are you just flaky
* feat: seperate scatter and gather?
* feat: just try synchronizing
* feat: remove print again
* feat: bring back difference
* feat: no sync
* feat: revert that
* feat: back to wait_for
* fix: typo
* feat: world
* feat: tests
* feat: no more backwards
* feat: recv into
* feat: whoops
* feat: test in ci
* feat: some debug logging
* feat: workflow naming
* feat: need to set pythonpath
* feat: just send to same device
* flake8: Ignore frequent violations, correct infrequent ones
* Ignore some rules in test
* Reorder test ignores
* Lint test + main
* EOF indent
* Include all E71,E72 errors
* Test the failing case in CI
* Revert "Test the failing case in CI"
This reverts commit 110add0a70.
* Push to test!
This reverts commit f317532779.
* ok back to passing
This reverts commit ba5052685f.
* Prove that CI fails when formatting is incorrect.
* Fix formatting
* Remove duplicitous E117 rule
* Use flake8 config for precommit
---------
Co-authored-by: waifairer <waifairer@gmail.com>
* models matrix
* fix typo and install gpu deps
* install llvm deps if needed
* fix
* testops with cuda
* remove pip cache since not work
* cuda env
* install cuda deps
* maybe it will work now
* i can't read
* all tests in matrix
* trim down more
* opencl stuff in matrix
* opencl pip cache
* test split
* change cuda test exclusion
* test
* fix cuda maybe
* add models
* add more n=auto
* third thing
* fix bug
* cache pip more
* change name
* update tests
* try again cause why not
* balance
* try again...
* try apt cache for cuda
* try on gpu:
* try cuda again
* update packages step
* replace libz-dev with zlib1g-dev
* only cache cuda
* why error
* fix gpuocelot bug
* apt cache err
* apt cache to slow?
* opt and image in single runner
* add a couple n=autos
* remove test matrix
* try cuda apt cache again
* libz-dev -> zlib1g-dev
* remove -s since not supported by xdist
* the cache takes too long and doesn't work
* combine webgpu and metal tests
* combine imagenet to c and cpu tests
* torch tests with linters
* torch back by itself
* small windows clang test with torch tests
* fix a goofy windows bug
* im dumb
* bro
* clang with linters
* fix pylint error
* linter not work on windows
* try with clang again
* clang and imagenet?
* install deps
* fix
* fix quote
* clang by itself (windows too slow)
* env vars for imagenet
* cache pip for metal and webgpu tests
* try torch with metal and webgpu
* doesn't work, too long
* remove -v
* try -n=logical
* don't use logical
* revert accidental thing
* remove some prints unless CI
* fix print unless CI
* ignore speed tests for slow tests
* clang windows in matrix (ubuntu being tested in imagenet->c test)
* try manual pip cache
* fix windows pip cache path
* all manual pip cache
* fix pip cache dir for macos
* print_ci function in helpers
* CI as variable, no print_ci
* missed one
* cuda tests with docker image
* remove setup-python action for cuda
* python->python3?
* remove -s -v
* try fix pip cache
* maybe fix
* try to fix pip cache
* is this the path?
* maybe cache pip
* try again
* create wheels dir
* ?
* cuda pip deps in dockerfile
* disable pip cache for clang
* image from ghcr instead of docker hub
* why is clang like this
* fast deps
* try use different caches
* remove the fast thing
* try with lighter image
* remove setup python for cuda
* small docker and cuda fast deps
* ignore a few more tests
* cool docker thing (maybe)
* oops
* quotes
* fix docker command
* fix bug
* ignore train efficientnet test
* remove dockerfile (docker stuff takes too long)
* remove docker stuff and normal cuda
* oops
* ignore the tests for cuda
* does this work
* ignore test_train on slow backends
* add space
* llvm ignore same tests as cuda
* nvm
* ignore lr scheduler tests
* get some stats
* fix ignore bug
* remove extra '
* remove and
* ignore test for llvm
* change ignored tests and durationon all backends
* fix
* and -> or
* ignore some more cuda tests
* finally?
* does this fix it
* remove durations=0
* add some more tests to llvm
* make last pytest more readable
* fix
* don't train efficientnet on cpu
* try w/out pip cache
* pip cache seems to be generally better
* pytest file markers
* try apt fast for cuda
* use quick install for apt-fast
* apt-fast not worth
* apt-get to apt
* fix typo
* suppress warnings
* register markers
* disable debug on fuzz tests
* change marker names
* apt update and apt install in one command
* update marker names in test.yml
* webgpu pytest marker
* Add additional kernel when reducing multiple dimensions at once.
* Faster for smaller inputs
* Whitespace and naming
* Cleaner, guard for Metal only, and max 1 split rather than N
* Draft of different approach
* One additional kernel call for this test (as expected)
* Fuzz test symbolic and shapetracker
This reverts commit d5773ddebff54c1ff608838076f0b4ff126b8aa8.
* mess again
* no tail
* test shapetracker too
* Revert mess and enable all tests
* removed leftover