* Symbolic Shape JIT
update tests
2 variables symbolic ops, adding more tests
test passing
cleanup
* more test cases
* single flag
* review update
* jit attention one piece
* realize
* symbolic_jit test for cuda
* old artifact
* works with cuda gpu but failed ci
* CUDACPU
* move assembly, assembly_ptx
* successful but broken rendering of ptx asm
* clear ins before render asm
* slightly less broken :')
* we needed thread syncs
* fix float16 loading, rounding modifiers and other casting stuff, passing casts_from_half
* Fix runtime_args for gpuocelot
* our casts were flipped on both ends
* more casting
* add ternary where op
* dealing with storing/loading bool
* add test for casting to bool from negative
* Fix args.valid on ConstOp
* add to CI, TODO: fix runtime_args for test_uops
* fix placement of runtime_args to work with lazy.Device
* undo ci changes so I can push
* fix lints
* start cleanup and fix things we broke fixing lints
* add checks for PTX specifc asm instructions
* revert added test -- doesn't pass on llvm
* skip tests for underflow,overflow
* another fix for how we're setting runtime args
* Less broken cleanup
* add to CI
* add more env variables for ci test
* fix ci to install pycuda for ptx
* ci: copy cuda test command
* cleanup
* assert to make sure we're actually running ptx in ci
* remove test assert
* move is_ptx arg
* move assembly, assembly_ptx back to extras
* fix imports
* initial merge fixes
* clear registers, fix UOps.LOAD with invalid value
* draft merge fixes
* remove prints
* quick lint and merge fixes
* cleanup
* remove PTXProgram wrapper
* final cleanup
* temp change for ci rerun
* ci rerun
* rollback ISA version
* testing new memops
* better debugging
* testing padded conv
* branching with load
* refactoring a bit
* first try
* fixing bugs
* fixing some
* eq
* eq2
* do not use x's
* working
* fixing imm
* getting things working
* refactor
* pow not working
* working except one
* refactor: one store mem
* refactor: global load
* refactor: imm
* refactor: cleaning
* fixing big offsets
* refactor with ci
* try ci
* typo
* another typo
* ubuntu default
* forgot git
* do i need git?
* missing packages
* adding python-dev
* with cache?
* buildx action
* buildx name issue?
* maybe now?
* python3
* newline warning
* maybe now
* i actually need this
* ci should work now
* improved caching
* fixing cache
* maybe now it will cache
* this
* testing cache
* trying again
* load
* missing platform
* caching gha
* testing cache
* full testing
* typo
* now?
* why
* adding checkout back
* bad formatting
* fixing convention issues
* supporting python
* adding CI flag
* testing all
* better comments
* adding debugging
* takes 12x longer
* does it output progress now?
* ignore models for speed
* fixing merge
* excluding conv_transpose2d
* only 2 test cuz is to slow
* another approach
* let's see
* faster duh
* my bad
* T_T
* typo
* sup
* with output?
* comment test
* comment test
* comment test
* :?
* no comment
* with cache
* back to normal
* testing that ci works
* back to passing
* trying again
* does it create another entry
* does it create another entry?
* build local
* hey
* Revert "excluding conv_transpose2d"
This reverts commit cc7348de03.
* does it cache if done before?
* does it cache?
* done
* adding test ops
* bad formatting
* no need for this
* working static mem
* sum 1d
* add ndim
* better reg import
* fix stack
* back to np
* working except for softmax
* 5 failing
* no pogress
* remove keystone
* remove keystone
* testops passing
* cleanups
* more cleanup
* typo
* ci
* ci2
* cond import
* ci3
* ci4
* ci4
* ci5
* ci5
* ci6
* aligment
* test all
* correct test
* err read_unmapped
* passing test
* ignore for speed
* ignore for speed
* ci7
* cleanup
* remove docker
* fixing merge
* fixing bugs
* add skipload for const ops
* comments
* First merge to master: Renderer
* fix emulation
* passing all tests arm64
* cleaning
* fix handcoded binary
* cleaning
* fix errs
* fix runtime arg binary
* clean git diff
* fix and clean
* fixing metal test
* cleaning
* fix metal test
* ci ~8 min
* fix pylint and clang
* cache the files in ops_clang
---------
Co-authored-by: George Hotz <72895+geohot@users.noreply.github.com>
* feat: world
* feat: tests
* feat: no more backwards
* feat: recv into
* feat: whoops
* feat: test in ci
* feat: some debug logging
* feat: workflow naming
* feat: need to set pythonpath
* feat: just send to same device
* feat: allreduce
* feat: test
* feat: need contiguous
* feat: test in ci
* feat: exit with correct code
* feat: don't need that
* feat: opencl wait_for just doesn't work
* feat: synchronize on out
* feat: try?
* feat: try again?
* feat: add extra realizes
* feat: print
* feat: seed
* feat: tol
* feat: test ones and zeros
* feat: remove print
* feat: are you just flaky
* feat: seperate scatter and gather?
* feat: just try synchronizing
* feat: remove print again
* feat: bring back difference
* feat: no sync
* feat: revert that
* feat: back to wait_for
* fix: typo
* feat: world
* feat: tests
* feat: no more backwards
* feat: recv into
* feat: whoops
* feat: test in ci
* feat: some debug logging
* feat: workflow naming
* feat: need to set pythonpath
* feat: just send to same device
* add stable diffusion and llama
* pretty in CI
* was CI not true
* that
* CI=true, wtf
* pythonpath
* debug=1
* oops, wrong place
* uops test broken for wgpu
* wgpu tests flaky
* flake8: Ignore frequent violations, correct infrequent ones
* Ignore some rules in test
* Reorder test ignores
* Lint test + main
* EOF indent
* Include all E71,E72 errors
* Test the failing case in CI
* Revert "Test the failing case in CI"
This reverts commit 110add0a70.
* Push to test!
This reverts commit f317532779.
* ok back to passing
This reverts commit ba5052685f.
* Prove that CI fails when formatting is incorrect.
* Fix formatting
* Remove duplicitous E117 rule
* Use flake8 config for precommit
---------
Co-authored-by: waifairer <waifairer@gmail.com>
* models matrix
* fix typo and install gpu deps
* install llvm deps if needed
* fix
* testops with cuda
* remove pip cache since not work
* cuda env
* install cuda deps
* maybe it will work now
* i can't read
* all tests in matrix
* trim down more
* opencl stuff in matrix
* opencl pip cache
* test split
* change cuda test exclusion
* test
* fix cuda maybe
* add models
* add more n=auto
* third thing
* fix bug
* cache pip more
* change name
* update tests
* try again cause why not
* balance
* try again...
* try apt cache for cuda
* try on gpu:
* try cuda again
* update packages step
* replace libz-dev with zlib1g-dev
* only cache cuda
* why error
* fix gpuocelot bug
* apt cache err
* apt cache to slow?
* opt and image in single runner
* add a couple n=autos
* remove test matrix
* try cuda apt cache again
* libz-dev -> zlib1g-dev
* remove -s since not supported by xdist
* the cache takes too long and doesn't work
* combine webgpu and metal tests
* combine imagenet to c and cpu tests
* torch tests with linters
* torch back by itself
* small windows clang test with torch tests
* fix a goofy windows bug
* im dumb
* bro
* clang with linters
* fix pylint error
* linter not work on windows
* try with clang again
* clang and imagenet?
* install deps
* fix
* fix quote
* clang by itself (windows too slow)
* env vars for imagenet
* cache pip for metal and webgpu tests
* try torch with metal and webgpu
* doesn't work, too long
* remove -v
* try -n=logical
* don't use logical
* revert accidental thing
* remove some prints unless CI
* fix print unless CI
* ignore speed tests for slow tests
* clang windows in matrix (ubuntu being tested in imagenet->c test)
* try manual pip cache
* fix windows pip cache path
* all manual pip cache
* fix pip cache dir for macos
* print_ci function in helpers
* CI as variable, no print_ci
* missed one
* cuda tests with docker image
* remove setup-python action for cuda
* python->python3?
* remove -s -v
* try fix pip cache
* maybe fix
* try to fix pip cache
* is this the path?
* maybe cache pip
* try again
* create wheels dir
* ?
* cuda pip deps in dockerfile
* disable pip cache for clang
* image from ghcr instead of docker hub
* why is clang like this
* fast deps
* try use different caches
* remove the fast thing
* try with lighter image
* remove setup python for cuda
* small docker and cuda fast deps
* ignore a few more tests
* cool docker thing (maybe)
* oops
* quotes
* fix docker command
* fix bug
* ignore train efficientnet test
* remove dockerfile (docker stuff takes too long)
* remove docker stuff and normal cuda
* oops
* ignore the tests for cuda
* does this work
* ignore test_train on slow backends
* add space
* llvm ignore same tests as cuda
* nvm
* ignore lr scheduler tests
* get some stats
* fix ignore bug
* remove extra '
* remove and
* ignore test for llvm
* change ignored tests and durationon all backends
* fix
* and -> or
* ignore some more cuda tests
* finally?
* does this fix it
* remove durations=0
* add some more tests to llvm
* make last pytest more readable
* fix
* don't train efficientnet on cpu
* try w/out pip cache
* pip cache seems to be generally better
* pytest file markers
* try apt fast for cuda
* use quick install for apt-fast
* apt-fast not worth
* apt-get to apt
* fix typo
* suppress warnings
* register markers
* disable debug on fuzz tests
* change marker names
* apt update and apt install in one command
* update marker names in test.yml
* webgpu pytest marker