* Symbolic Shape JIT
update tests
2 variables symbolic ops, adding more tests
test passing
cleanup
* more test cases
* single flag
* review update
* jit attention one piece
* realize
* symbolic_jit test for cuda
* old artifact
* works with cuda gpu but failed ci
* CUDACPU
* try to run commavq
* fix 0 dim, start implementing new ops
- Implement EmbedLayerNormalization
- Implement Attention
* SkipLayerNormalization and FastGelu
* use original torch model, cast inputs
* fix some ops:
- properly do Cast
- Attention: bi- and unidirectional
- FastGelu: add bias before gelu
* cleanup onnx_ops.py
* add validation option to benchmark
* cleanup imports
* add checks incase onnx2torch implements ops in future
* run onnx instead of original torch
* just skip gpu on m1
* reactivate the other models
* check for strange params & squash whitespace
* cleanup
* fix causal mask Attention
* Range doesn't need int cast
* embedding vocab_counter same dtype as input
* no need to cast
* always validate, fix PosixPath ort
---------
Co-authored-by: George Hotz <george@comma.ai>
* testing new memops
* better debugging
* testing padded conv
* branching with load
* refactoring a bit
* first try
* fixing bugs
* fixing some
* eq
* eq2
* do not use x's
* working
* fixing imm
* getting things working
* refactor
* pow not working
* working except one
* refactor: one store mem
* refactor: global load
* refactor: imm
* refactor: cleaning
* fixing big offsets
* refactor with ci
* try ci
* typo
* another typo
* ubuntu default
* forgot git
* do i need git?
* missing packages
* adding python-dev
* with cache?
* buildx action
* buildx name issue?
* maybe now?
* python3
* newline warning
* maybe now
* i actually need this
* ci should work now
* improved caching
* fixing cache
* maybe now it will cache
* this
* testing cache
* trying again
* load
* missing platform
* caching gha
* testing cache
* full testing
* typo
* now?
* why
* adding checkout back
* bad formatting
* fixing convention issues
* supporting python
* adding CI flag
* testing all
* better comments
* adding debugging
* takes 12x longer
* does it output progress now?
* ignore models for speed
* fixing merge
* excluding conv_transpose2d
* only 2 test cuz is to slow
* another approach
* let's see
* faster duh
* my bad
* T_T
* typo
* sup
* with output?
* comment test
* comment test
* comment test
* :?
* no comment
* with cache
* back to normal
* testing that ci works
* back to passing
* trying again
* does it create another entry
* does it create another entry?
* build local
* hey
* Revert "excluding conv_transpose2d"
This reverts commit cc7348de03.
* does it cache if done before?
* does it cache?
* done
* adding test ops
* bad formatting
* no need for this
* working static mem
* sum 1d
* add ndim
* better reg import
* fix stack
* back to np
* working except for softmax
* 5 failing
* no pogress
* remove keystone
* remove keystone
* testops passing
* cleanups
* more cleanup
* typo
* ci
* ci2
* cond import
* ci3
* ci4
* ci4
* ci5
* ci5
* ci6
* aligment
* test all
* correct test
* err read_unmapped
* passing test
* ignore for speed
* ignore for speed
* ci7
* cleanup
* remove docker
* fixing merge
* fixing bugs
* add skipload for const ops
* comments
* First merge to master: Renderer
* fix emulation
* passing all tests arm64
* cleaning
* fix handcoded binary
* cleaning
* fix errs
* fix runtime arg binary
* clean git diff
* fix and clean
* fixing metal test
* cleaning
* fix metal test
* ci ~8 min
* fix pylint and clang
* cache the files in ops_clang
---------
Co-authored-by: George Hotz <72895+geohot@users.noreply.github.com>
* feat: world
* feat: tests
* feat: no more backwards
* feat: recv into
* feat: whoops
* feat: test in ci
* feat: some debug logging
* feat: workflow naming
* feat: need to set pythonpath
* feat: just send to same device
* feat: allreduce
* feat: test
* feat: need contiguous
* feat: test in ci
* feat: exit with correct code
* feat: don't need that
* feat: opencl wait_for just doesn't work
* feat: synchronize on out
* feat: try?
* feat: try again?
* feat: add extra realizes
* feat: print
* feat: seed
* feat: tol
* feat: test ones and zeros
* feat: remove print
* feat: are you just flaky
* feat: seperate scatter and gather?
* feat: just try synchronizing
* feat: remove print again
* feat: bring back difference
* feat: no sync
* feat: revert that
* feat: back to wait_for
* fix: typo
* feat: world
* feat: tests
* feat: no more backwards
* feat: recv into
* feat: whoops
* feat: test in ci
* feat: some debug logging
* feat: workflow naming
* feat: need to set pythonpath
* feat: just send to same device
* flake8: Ignore frequent violations, correct infrequent ones
* Ignore some rules in test
* Reorder test ignores
* Lint test + main
* EOF indent
* Include all E71,E72 errors
* Test the failing case in CI
* Revert "Test the failing case in CI"
This reverts commit 110add0a70.
* Push to test!
This reverts commit f317532779.
* ok back to passing
This reverts commit ba5052685f.
* Prove that CI fails when formatting is incorrect.
* Fix formatting
* Remove duplicitous E117 rule
* Use flake8 config for precommit
---------
Co-authored-by: waifairer <waifairer@gmail.com>
* models matrix
* fix typo and install gpu deps
* install llvm deps if needed
* fix
* testops with cuda
* remove pip cache since not work
* cuda env
* install cuda deps
* maybe it will work now
* i can't read
* all tests in matrix
* trim down more
* opencl stuff in matrix
* opencl pip cache
* test split
* change cuda test exclusion
* test
* fix cuda maybe
* add models
* add more n=auto
* third thing
* fix bug
* cache pip more
* change name
* update tests
* try again cause why not
* balance
* try again...
* try apt cache for cuda
* try on gpu:
* try cuda again
* update packages step
* replace libz-dev with zlib1g-dev
* only cache cuda
* why error
* fix gpuocelot bug
* apt cache err
* apt cache to slow?
* opt and image in single runner
* add a couple n=autos
* remove test matrix
* try cuda apt cache again
* libz-dev -> zlib1g-dev
* remove -s since not supported by xdist
* the cache takes too long and doesn't work
* combine webgpu and metal tests
* combine imagenet to c and cpu tests
* torch tests with linters
* torch back by itself
* small windows clang test with torch tests
* fix a goofy windows bug
* im dumb
* bro
* clang with linters
* fix pylint error
* linter not work on windows
* try with clang again
* clang and imagenet?
* install deps
* fix
* fix quote
* clang by itself (windows too slow)
* env vars for imagenet
* cache pip for metal and webgpu tests
* try torch with metal and webgpu
* doesn't work, too long
* remove -v
* try -n=logical
* don't use logical
* revert accidental thing
* remove some prints unless CI
* fix print unless CI
* ignore speed tests for slow tests
* clang windows in matrix (ubuntu being tested in imagenet->c test)
* try manual pip cache
* fix windows pip cache path
* all manual pip cache
* fix pip cache dir for macos
* print_ci function in helpers
* CI as variable, no print_ci
* missed one
* cuda tests with docker image
* remove setup-python action for cuda
* python->python3?
* remove -s -v
* try fix pip cache
* maybe fix
* try to fix pip cache
* is this the path?
* maybe cache pip
* try again
* create wheels dir
* ?
* cuda pip deps in dockerfile
* disable pip cache for clang
* image from ghcr instead of docker hub
* why is clang like this
* fast deps
* try use different caches
* remove the fast thing
* try with lighter image
* remove setup python for cuda
* small docker and cuda fast deps
* ignore a few more tests
* cool docker thing (maybe)
* oops
* quotes
* fix docker command
* fix bug
* ignore train efficientnet test
* remove dockerfile (docker stuff takes too long)
* remove docker stuff and normal cuda
* oops
* ignore the tests for cuda
* does this work
* ignore test_train on slow backends
* add space
* llvm ignore same tests as cuda
* nvm
* ignore lr scheduler tests
* get some stats
* fix ignore bug
* remove extra '
* remove and
* ignore test for llvm
* change ignored tests and durationon all backends
* fix
* and -> or
* ignore some more cuda tests
* finally?
* does this fix it
* remove durations=0
* add some more tests to llvm
* make last pytest more readable
* fix
* don't train efficientnet on cpu
* try w/out pip cache
* pip cache seems to be generally better
* pytest file markers
* try apt fast for cuda
* use quick install for apt-fast
* apt-fast not worth
* apt-get to apt
* fix typo
* suppress warnings
* register markers
* disable debug on fuzz tests
* change marker names
* apt update and apt install in one command
* update marker names in test.yml
* webgpu pytest marker
* Add additional kernel when reducing multiple dimensions at once.
* Faster for smaller inputs
* Whitespace and naming
* Cleaner, guard for Metal only, and max 1 split rather than N
* Draft of different approach
* One additional kernel call for this test (as expected)
* Fuzz test symbolic and shapetracker
This reverts commit d5773ddebff54c1ff608838076f0b4ff126b8aa8.
* mess again
* no tail
* test shapetracker too
* Revert mess and enable all tests
* removed leftover
* Rename in files
* Move files
* Moved to extra/datasets as suggested
* Changes to files
* Fixed stupid mistake
---------
Co-authored-by: terafo <terafo@protonmail.com>
* Fixes + improved test coverage for helpers.py
- added exception handling in `proc`, if an exception was thrown, the thread would hang
- made `_early_exec_process` catch any Exception, before if an exception was thrown before the process was started, it would hand the thread
* Made `_early_exec_process` catch any Exception
Otherwise, if an exception was thrown before the process was started, it would hang the thread. For example a type error for an argument passed to `subprocess.check_output`
* Fixed `from tinygrad.helpers import Timing` import
oops, for some reason my IDE cleaned that import from extra/helpers.
* Fixed import in llama.py
Another one that I skipped by accident, mybad
* Extracted a class for tests of early exec
* Normalize line endings, windows uses /r/n
* Made `cross_process` not a daemon
* fold expands that precede a reduce if the reduction is on the same axis as the expansion
* add deterministic test for SIMPLIFY_SUM_RESHAPE_EXPAND_SUM optimization
* add a test case to make sure we don't fold reduce-expand-reduce on different axes
* new upcast works
* float4 try
* fix unaligned float4
* disallow unaligned access
* upcast dim
* maybe good now
* fix gpu half
* vstore_half4
* fix deep image bugs
* improve symbolic to fix issues
* fix symbolic
* cl test
* this maybe
* gcd of 1 is 1
* real fix for old python
* improve fuzzer
* refactor: print formatting for llama timing, report median and individual runs
* feat: back to mean
* fix: whitespace
* fix: add mean to print
---------
Co-authored-by: Roelof van Dijk <roelof.van.dijk@vitestro.com>
* test speed llama
* oops, put it back
* uses the real device codegen
* just do it on the mac
* pp
* is faster?
* Revert "is faster?"
This reverts commit 42db542010.
* disable docker again for less load on CI
* Add support for one case of `UOps.CAST` for RDNA3 assembler
* Adds support for casting from `bool` -> `float32`. Seems like a very common operation that is required in many places.
* Fix bool register definition for vector operations
* Use `vcc_lo` instead of `vcc` which seems to be required since it's configured to use wavefront_size=32
* Add vector support for some places that were scalar only in register definition and comparison ops
* Fix some issues in what seems to be defunct `external_test_image.py`
* Some tests still don't pass for other reasons, but it at least runs now and one broken test is now fixed
* Refactor RDNA3 assembler register definition
* Unify multi-registor code between dtypes and combine with single-register allocation since they're all untyped registers at the end of the day
* added SPPF module from yolov8
* added conv_block, bottleneck modules
* cleaned modules
* c2f example
* spf changes
* C2f
* fixed and tested bottleneck
* improved detect class
* tested spf and conv
* checked c2f
* DFL structure
* fixed dfl
* added dist2bbox function
* added dist2bbox function
* added and tested make_anchors function for the head
* keeping functions above
* creating the detection head
* fixing head
* untested blocks a. scale_boxes b. clip_boxes c. xywh2xyxy d. box_iou
* head works
* structure fixx
* added darknet (backbone)
* yolov8 neck, and intialize bias function while detection
* fixed spacing
* yolov8 class, init bias, and fixed c2f
* forward pass almost working
* fixed net structure
* init bias not needed, forward pass working
* load weights boilerplate
* load weights done?
* all variants loading!
* post process: clip_boxes, scale_boxes, xywh2xyxy, and box_iou(untested)
* fix scale_boxes
* box_iou fixed and tested
* created the pre nms function
* fix nms
* fixed load weights, apparently the latest commit broke something, excluding num_batches_tracked
* added letterbox and pre_tranform for pre_process function
* fixed letterbox, pre_transform and added preprocess function
* custom NMS done, integrated prepare_boxes and nms, improved box_iou
* added postprocess function till parsing
* added draw_bounding_boxes_and_save function
* testing full flow
* using fetch for class names
* fixed make_anchors + all tinygrad now
* added command line arguments, weight downloading
* single image for now only
* made draw boxes more efficient
* made NMS functions efficient
* made compute_transform better
* v8 working now, inference is done
* prints objects detected in console now
* fixed image loading (pre processing)
* batch post processing
* created initial tests
* fixes bounding box thickness AND added get_detected_classes_with_frequency function
* cleaning for testing
* two tests
* added url option for image, removed need for specifiying arguments
* tests complete, but lots on things are printed on screen by ultralytics
* remove parse arguments
* fixed weight location
* fixed colours of classes, and black font when high brightness
* minor changes
* TODOs for later
* removed use of torch, using .npz weights
* fixed tests
* one path for fetch
* preprocess now in tinygrad, plus test fix for that
* updated tests
* fix tests
* no class labels needed
* Add files via upload
* Update showcase.md
* Update showcase.md
* added safe tensors as weights, and tests fix for that
* safe tensors test
* using safe_load
* using tinygrad functions now to load weights
* update tests
---------
Co-authored-by: r3sist-uniq <amanmatreja@gmail.com>
Co-authored-by: r3sist <72573738+r3sist-uniq@users.noreply.github.com>
* resolved some slice test errors and added some more debugging logs
* use same device in cumsum
* increased float priority
* onnx debug ouput match input