* Create UnaryOps.RECIP and BinaryOps.IDIV and changing uses of BinaryOps.DIV
* Delete unused import
* Add cstyle renderer
* Fix formatting text
* Fix test error due to bad implementation of renderer
* Add PTX support
* Add RECIP to LLVMIR
* Remove BinaryOps.DIV from symbolic test
* Change some test and fix C floor division
* Change references to DIV for the RECIP or IDIV
* Add mimic idiv for symbolic test
* Restore floor
* Mimic idiv
* cast to int
* Fix some test and renderer
* Remove DIV for render nodes
* Resolve issue with div
* Add TestRenderer
* Fix test
* fix error
* Fix PAD test
* Fix div implementation
* Remove DIV
* Add upcast to rshift, due to use of MUL and RECIP on DIV
* Fix linter
* Remove complete BinaryOps.DIV
* Fix lint
* Fix some test
* Revert mul modification
* Fix tests
* Fix CLANG for uops
* Revert IDIV function
* Minor fix
* modify pattern matching rule to support nan
* Fix UNSAFE_PADS_OPS to add UnaryOps.RECIP
* Remove const folding for IDIV and fix PTX
* Complete remove IDIV from extra
* Remove test_div from TestFloatUOps due to test on recip
* Fix linearizer
* fix
* Fix test_22
* Fix llvm
* Apply trunc function for llvmlit
* use floor instead of trunc
* Use correct type
* Generate new fuzz db
* Fix rshift, do not cast to float to support idiv
* Return upcast=false to rshift
* Add to unsafepad BinaryOps.IDIV
* Remove RECIP override for CUDA
* add atol / rtol for the test
* Remove cast to int on IDIV
* Regenerate sops
* delete sops.gz
* regenerate
* regenerate
* regenerate
* Reduce margins
* pass atol and rtol as parametersg for _test_metrics
* regenerated dataset
* Regenerate
* Remove duplicated
* Revert changes on extra
* Remove changes extra and NOQA for test
* Remove E501
* Remove and change line
* Remove E501
* Fix atan2
* Revert import and E501
* Remove E501
* Add hrcp to halp ops
* Remove 1 of hrcp
* Remove last DIV and add type check on uops for IDIV
* Fix new tests
* Fix tests and custom function
* Regenerate dataset
* Regenerate dataset
* Revert dataset
* Change generate dataset script
* Remove line
* Change IDIV, type checker validate if x,y and z are int
---------
Co-authored-by: George Hotz <72895+geohot@users.noreply.github.com>
* basic tests
* cleanup
* pylint
* ruff
* use define acc as a proxy for rendered reductions
* use define acc as a proxy for rendered reductions
* recursive reduceop rendering via ast_parse
* linters + cleanup
* fixing late buf loading
* plus linters
* removing extra line
* linters
* does this break ci?
* added tests and if add end change
* typo in add_ends
* linters
* removing comments
* allow endifs to be inserted before the end of the graph
* find add ENDIF before next BARRIER
* removing tests with manual ENDIF + linters
* specifically the next barrier aftr the store of the local result
* Revert "specifically the next barrier aftr the store of the local result"
This reverts commit b288a5c3ce.
* keeping up to date
* linters + merge changes
* cleaning up old bad decisions
* linters and opts
* mrged linearizer tests
* fixing merge issues
* removing the big ugly uop test (functionality tested end-to-end by test_linearizer additions
* small diff fixes
* updating linearizer to work without uops.add( ... cachable)
* linters
* comment in multireduce tests
* skipping tests without locals
* full tests
* linters
* load_cache[key] fix for multiple accs
* linters
* assert only one reduceop
* fix loop_scope test to actually cause an issue
* self.load_cache[key] key for DEFINE_ACC changed to use a string to make sure each acc is unique
* updated tests
* fixing merge
* removing debug prints
* complete merge fix
* linters
* diff cleanup
* adding tests in
* give each reduce it's own local buffer
* gpu=1 changes
* store and load locals with upcasting
* modifying test?
* make multireduce_netsted_local_upcast test match single reduce shapes
* removing todo
* cleaning up the diff
* unroll test
* unroll and upcast tests
* fix gpu
* seq and self.load_cache[key] cleaning
* linters
* padto works
* merge fixes
* fixes
* add skips for amd
* linters + seq
* cleaning & more tests
* softmax tests
* linters
* [run_process_replay]
* add new tests back
This reverts commit 19dec22e01.
* more hardcoded -1s
* fix ptx
* Fix name for loop in ptx
* cleaning up the diff
* cleaning up the uops diff
* nv ci is too slow
---------
Co-authored-by: qazal <qazal.software@gmail.com>
Co-authored-by: Szymon Ożóg <58388001+SzymonOzog@users.noreply.github.com>
Co-authored-by: qazal <77887910+Qazalin@users.noreply.github.com>
* mockgpu nv
* works
* comment that out
* fix merge
* setup gpuocelot
* install packages
* not run all of them
* passes
* fix ci
* almost
* should pass
* linter
* linter 2
* try this?
* ugn, not supported
* ci
* remove ticket from description
* better descs
* Separate cast and bitcast
* Fix lint
* No more arg[0]
* Revert "No more arg[0]"
This reverts commit dee6911335513f092fe2cbb9684e8a9d26aad964.
* CAST/BITCAST arg is the dtype only, no more tuple
* No image bitcast, regenerate dataset
* Small fixes
* Adjust adds between WHERE and PHI
* Not much better
* undo recursive change
* hm
* iterate over where, not factored op
* oo
* consts only for loop
* UNdo var name change
* update
---------
Co-authored-by: Patrick Tsai <patosai@users.noreply.github.com>
previously it was incorrectly aliasing 16 into the size 8 upcast
on the store alias. now it splits it properly into 8 and the
remaining 2 into the correct local stride
* test_linearizer_failure: add failure 27 from a gpt2 kernel
found during a full fuzz test of applied_opts combos to a
depth of 4 on the gpt2 kernels w/o GROUPTOP.
added additional examples to failure 26 that don't have GROUPTOP
* add other platform failure
two followups after this. (1) if a buffer is never accessed in kernel, it can be removed from input (2) real_size can be smaller conditional on valid being true (the old validhack stuff)
* It works?
* Clamp correctly
* Refactor
* Make code better
* Undo some stuff
* First step to trying to make floats work
* Floats work in Python op but not metal because int div is different
Python integerdivision was implemented as // which rounds towards
negative infinity, but C integer division rounds towards 0 so there
is an off-by-1 division error
* arange does cumsum with ints and then multiplies by step
This is so loop optimization can remain int only
* Undo a lot of symbolic changes
* Final check
* Cleanup
* There can be multiple phis
* Fix multiple phi op removal
* const sets dtype correctly
* Fix bugs
* Fix a couple bugs and add loop vars to resolve
* missed one
* Don't trim too many ops
* Fix symbolic test
* Use ones instead of full
* Delete test
* Lint passes
* max node error
* Small updates to loop logic
* Remove unnecessary changes
* We are getting somewhere
* Simple case
* Fix
* rm, prn
* Better
* If NumNode doesn't work then continue
* clamp is needed for arange(256)
* Move everything into the optim fn
* Replace correctly
* Order optimizations better
* Delete
* mypy
* Test for simplification
* Rename
* Fix test
* update test description
* Undo more
* Cleanup
* No replaced_ops map
* Fix lint
* AssertionError
* back again
* Reinstate assertion
* Return true and make diff not as big
* Bigger range for test
* Change cumsum impl
* fix bug
* make big cumsum work
* lint
* Undo cumsum 2-stage removal
* No while helper
* optional min/max clamping
* floats work
* rm giant arange test
* fix python cast None
* Check phi parents
* one phi allowed per where
* Fix one phi per where
* Rework iteration
* Delete assertions
* convert to int
* Try mul -1 instead of neg for hip..?
* Remove one phi per where requirements
* one accum only
* Lint
* should simplify a loop at a time
* Don't get rid of loop explcitly
* Need to iterate backwards
* lint
* unary neg
* Make optim work for onnx and sum_pad_collapse
* Better message
* filter alu ops correctly
* Fix the limiter
* lint and simplify
* Add it back
* off by one error
* test wheres and phis
* test max ops and non-if stuff
* <=
* cast_scalar
* Oops
* Change test
* Pass loop uops instead of a modified map
* Cut param transfer between linearizer and uops
* Fix issues
* Fix lint
* fix efficientnet python 3.8 invalid syntax
* distinct vars in seen_vars
* accurate var names
---------
Co-authored-by: Patrick Tsai <patosai@users.noreply.github.com>
Co-authored-by: George Hotz <72895+geohot@users.noreply.github.com>
be more specific about invalid kernel opt, used that in test_linearizer_failures.
make BEAM kernel search work even with assertion disabled.
`BEAM=2 python3 -O examples/llama.py --temperature=0 --count=10 --prompt="Hello." --timing`
* add FUZZ_NTH to fuzz_linearizer
also update tests in test_linearizer_failures to not just run on METAL
* update failures for HIP/HSA
* test_failure_21 LLVM PADTO
need to remove SUB since it's possible to have (const - (const - const)) in test/test_ops.py::TestOps::test_cos,
in which case cannot remove the parens of children