* Fix sm89 PTX=1 compilation
The minimum PTX version that supports sm89 is 7.8 (same version also
supports sm90); without this ptxas fails when running tinygrad with
PTX=1 on RTX 4090.
* Use int(arch[3:]) for forward compat with SM10.0 if that happens
* init multidevice cuda graph
* cuda just works!
* clean
* linter happier
* liners happy
* update transfer inputs
* do not change free
* useless check for cuda
---------
Co-authored-by: George Hotz <72895+geohot@users.noreply.github.com>
* ptx float4 implementation
* remove from cache when trimming uops
* Gate for float4
* Linting fix
* disable test reasonable time for ptx
* import getenv
* Update uops.py
* linter
* Add div test for half
* upcast if op does not support operation
* fix offset
* Run only if dtype supported
* zero out registers when accessing by pred + cleanup
* Remove trailing whitespace
* revert
* spacing fix
* move cache clearing outside loop
* did this suddenly start working?
* unused import removed
* Remove cast
* Use pattern matching
* linting
---------
Co-authored-by: George Hotz <72895+geohot@users.noreply.github.com>
* remove HIP in core tinygrad
ci test uses device RHIP and HSA compiler (LinearizerOpt), so fine to remove HIP from tc.
Also updated README and EMULATE tc test flag
* EMULATE_CUDA
* where fold try 2
* assign fold
* test_where_fold works
* add gated store support to ops_python
---------
Co-authored-by: nimlgen <138685161+nimlgen@users.noreply.github.com>
* It works?
* Clamp correctly
* Refactor
* Make code better
* Undo some stuff
* First step to trying to make floats work
* Floats work in Python op but not metal because int div is different
Python integerdivision was implemented as // which rounds towards
negative infinity, but C integer division rounds towards 0 so there
is an off-by-1 division error
* arange does cumsum with ints and then multiplies by step
This is so loop optimization can remain int only
* Undo a lot of symbolic changes
* Final check
* Cleanup
* There can be multiple phis
* Fix multiple phi op removal
* const sets dtype correctly
* Fix bugs
* Fix a couple bugs and add loop vars to resolve
* missed one
* Don't trim too many ops
* Fix symbolic test
* Use ones instead of full
* Delete test
* Lint passes
* max node error
* Small updates to loop logic
* Remove unnecessary changes
* We are getting somewhere
* Simple case
* Fix
* rm, prn
* Better
* If NumNode doesn't work then continue
* clamp is needed for arange(256)
* Move everything into the optim fn
* Replace correctly
* Order optimizations better
* Delete
* mypy
* Test for simplification
* Rename
* Fix test
* update test description
* Undo more
* Cleanup
* No replaced_ops map
* Fix lint
* AssertionError
* back again
* Reinstate assertion
* Return true and make diff not as big
* Bigger range for test
* Change cumsum impl
* fix bug
* make big cumsum work
* lint
* Undo cumsum 2-stage removal
* No while helper
* optional min/max clamping
* floats work
* rm giant arange test
* fix python cast None
* Check phi parents
* one phi allowed per where
* Fix one phi per where
* Rework iteration
* Delete assertions
* convert to int
* Try mul -1 instead of neg for hip..?
* Remove one phi per where requirements
* one accum only
* Lint
* should simplify a loop at a time
* Don't get rid of loop explcitly
* Need to iterate backwards
* lint
* unary neg
* Make optim work for onnx and sum_pad_collapse
* Better message
* filter alu ops correctly
* Fix the limiter
* lint and simplify
* Add it back
* off by one error
* test wheres and phis
* test max ops and non-if stuff
* <=
* cast_scalar
* Oops
* Change test
* Pass loop uops instead of a modified map
* Cut param transfer between linearizer and uops
* Fix issues
* Fix lint
* fix efficientnet python 3.8 invalid syntax
* distinct vars in seen_vars
* accurate var names
---------
Co-authored-by: Patrick Tsai <patosai@users.noreply.github.com>
Co-authored-by: George Hotz <72895+geohot@users.noreply.github.com>
when compilation succeeds, but runtime fails due to thread limits
on METAL, this allows a beam search to proceed, treating this
the same way as a compile failure.
* run test_linearizer_failures on PYTHON backend
only test 1, some have hanging issues and gated store is not implemented
* --durations=20
* two less slow ones