* Prevent const folding in test_payne_hanek_reduction
* Do not use list as a default parameter
* Bitcast constant folding
---------
Co-authored-by: George Hotz <72895+geohot@users.noreply.github.com>
* remove Tensor._to_const_val
added a TODO for advance indexing on const, which was the last place that checks const in Tensor
* that is not folding now
* one more
* LazyBuffer = UOp
* try 4 at this diff
* skip optimization tests p1
* raise kernel count expectations
* BIND isn't the _only_ uop that can become a tensor
* fix test_ones_sum on symbolic
* bump openpilot, correctness first
* offset on assign is fine
* uop is immutable
* what if this was higher
* more optimization skips
* instant fold const copy
* test_multitensor shouldn't expect buffer for unrealized
* move copy folder to upats
* start BUFFER_VIEW
* kinda BUFFER_VIEW
* Revert "kinda BUFFER_VIEW"
This reverts commit 94b4fe3040.
* BUFFER_VIEW try 2
* linter and missed _device
* pylint
* keep Ops.CONTIGUOUS
* always BUFFER_VIEW disk
* test
* cpu isn't a real device
* buffer references afte del
* add that back
* start bringing some of these back
* more test updates
* simpler simplify copy
* subbufer everything
* this is fine with buffer view
* cleanup the diff in test/ 1
* copy is one thing
* diff pruning
* diff pruning 2
* oh bind unbinds way too early
* extra
* more diff pruning
* more const folding
* experiment with symbolic here
* Revert "experiment with symbolic here"
This reverts commit cb87d61f7a.
* Revert "more const folding"
This reverts commit 2a7d258a2b.
* Revert VALID early folding
This reverts commit 4074f52317.
* storing const is fine
* fix test_prefer_half_buffer
* iterate on test_real_world
* this fixes test_train_mnist memory, breaks everything else
* Revert "this fixes test_train_mnist memory, breaks everything else"
This reverts commit dccfcbe068.
* always expect buffer to exist here
* temp debug: something is mutating lazydata in compile3
* Revert "temp debug: something is mutating lazydata in compile3"
This reverts commit 71400f0d55.
* everything back to normal
* compile3
* compile3 test
* start captured jit work, that test passes
* finalized memory skip set
* linter err
* back to base here
* tiny metaop cleanup
* print tensor
* 4th type this unbind got me
* green pickle
* tensor_variable sanity
* cast sanity
* link from the reds
* COPY sanity + minor repr change
* you can exist
* enable test_winograd
* bye bye nbytes
* danger, uop is mutating
* real become
* delete those from uop init
* put it in buffer init
* buffer inits with so much stuff
* buffer pickle try 2
* toposort can't be a cached property
* fix test_schedule_gc_with_inputs
* remove all @unittest.skip(gc)
* Revert "remove all @unittest.skip(gc)"
This reverts commit 9d8d92dd85.
* reenable real world + test_schedule_gc
* test: RUN_PROCESS_REPLAY=0
* fix pickle jit
* test changes
* reenable test_lru_alloc and TestTrain
* fix imagedtype
* bring pr back
* reenable 3 gc tests
* test_schedule better diff
* disable SPLIT_REDUCEOP
* test_save_all_dtypes looks fixed
* fix metadata
* skip that one
* fix viz by not pickling buffers
* simple test for const folding
* bring split reduceop back
* add simplify_alu
* simplify_binop fixes a test
* fix cast folding
* disable that test
* that test looks fine
* changes from delete_lazy pruning p1
* cast folding and children base
* test: cast folding from pruning branch
* green test_sgd_4convs_fuse_conv_bw
* enable some indexing folding
* test_complex_backward is fixed
* prune more, 295 -> 233
* fix test_multi_const_folding_literal
* fix double copy
* early become test
* ooooops
* clean up ctx in all big_graph
* fix openpilot 208 kernels
* train_cifar is fine now
* fix CAST_BEFORE_VIEW
* ever faker const
* back to 13
* mark expectedFailure
* fine don't create them
* test_multi_const_folding_tensor
---------
Co-authored-by: George Hotz <geohot@gmail.com>
Co-authored-by: George Hotz <72895+geohot@users.noreply.github.com>
* most of the work from the uops2 branch
* schedule
* realize
* kernel
* lowerer
* search
* green
* merge uops with ops
* Revert "merge uops with ops"
This reverts commit 1408a59f12.
* fix benchmark
* remove extra dedup
* [WIP] Added an approximated implementation of Sin(FP32, FP64) passing all tests on Clang runtime
* Map nan/-inf/inf as 1.0 in order to avoid doing as_const(math.inf)
* [WIP] Added a support for LLVM IR
* cleaned up the code for the mypy and linter
* [WIP] Updated fp64 supports (bitwise shift causes the compilation error), fixed linter issue.
* [Add] added fast=true mode which disables the payne-hanek reduction which is slow
* [Fix] fails to compute elements when shape includes zero
* [WIP] Added BinaryOps.ADD/BinaryOps.OR to assembly
* [wip] update the assembly for ptx
* Enables fast=True when device is one of PTX, NV, CUDA, to avoid slow bitwise ops (as lv3 reduction is not required).
* [WIP] Added an approximation of LOG2/EXP2 (FP32, FP64)
* [Fix] Cyclic dependencies existing in xlog2
* [Fix] Cycle dependency in the graph of exp2, and log2. (passing test_symbolic_ops.py)
* [Fix] keep using higher precision for exp2, but cycle graph issue remained to be fixed...
* [Refactor] removed is_metal option. xsin does not rely on fp64 when fp32 mode.
* [WIP] fp16 xsin implementation passing all tests. (still needs to be refactored)
* [WIP] Added fp16 exp2 implementation
* [WIP] Increased the precision of Log2 from 3.5 ULP to 1.0 ULP, and added FP16 Log2 approximation.
* stashed the changes for FP16 sin
* [Fix] Patch for FP16 Sin/Exp2. (updated the dtype_via, fp32_p, and lower)
* [Refactor] migration to fastmath.py, some code simplification, renamed apis in fastmath, et al.
* [Refactor] Added the function polyN to clean-up N-terms polynomial approximation.
* [Patch] Increase fp64 precision when ldexp3k if possible, and patch for fp16 exp2
* [Patch] added bitcast_forward option
* [Patch] resolved cycle graph
* patch fix cycle graph
* set bitcast_forward=True in ilogb2k
* bitcast_forward for multi.py
* E501
* Break into multiple small PRs
* [Patch] FP16 -> FP64 upcast is not anymore required since xlog2 use quad precision polyN
* [Patch] NV still required FP64 for xlog2
* updated schedule test
* updated the count of kernels
* [Update] Removed all bitwise ops (SHL/SHR), tweaked the nan manipulation of log2, passing all tests except for AMD.
* Bitcast: make them api-compatible
* [update] force to use bitcast
* updated the count of constant folding
* [Patch] Creating a mask for exp2 using x <= Inf satisfies True as long as x is a real value
* [Update] isNaN(x) Free log2 algorithm, passing PTX tests, METAL with fastmath enabled is able to handle nan well, amd backend will not crash.
* xsin is reluctant to call payne_hanek_reduction which is slow to compile, passing stable diffusion compilation in a realistic time
* some minor simplification to payne hanek reduction
* [refactor] refactored some rebundant parts existing in payne hanek
* [refactor] more readable payne hanek impl
* [refactor] improved the code consistency of payne hanek
* [experiment] topological sort when doing _recursive_group (i dunno if this is good but at least it works.)
* Revert "[experiment] topological sort when doing _recursive_group (i dunno if this is good but at least it works.)"
This reverts commit 0eee08b87c.
* use allow_buffer_view
* lets support multilazytensor
* updated the count of kernels
* [test] added the jit tests for approx ops
* keep failed constant folding tests tested, added expectedFailure
* explict the timeout deadline when testing approx jit timeout
* [WIP] Simplified the implementation of xsin, never timeouts
* [Refactor] Improved the consistency of approx sin implementation, passing time out tests
* integrated xexp2_base into xexp2
* Set switch_over=39800.0
* delete: is_buffer_fastmath_supported
* sin: compute against abs(x)
* some cleanups
* fix typo
* removed the space between param and dtype
* allow 514 kernels on CI for sd
* [refactor] no need to upcast ad ldexp3k
* [refactor] added some comments, references to help understanding the code.
* [Fix] 1.0 ULP Sine Approximation for FP16
* [update] assume e != 0
* use pow2if instead of ldexp3k to fuse payne_hanek reduction into one
* check if approximated sin/log2/exp are fused into one
* clean up changes
* test amd exp
* some code cleanup and test sigmoid
* fix: enabled payne_hanek for fp16 to achieve higher acc
* fix: payne_hanek always accumlates the value with uint64, and fp16 sin is fused to a single kernel
* [Refactor] Rename: fastmath -> transcendental
* [Refactor] Added TRANSCENDENTAL, Moved the gate function to function.py
* updated const folding tests
* TRANSCENDENTAL as a ContextVar, removed old test of cody waite reduction, added assertions, et al.
* Add: unittest.main()
* Import TRANSCENDENTAL instead of getenv
* Refactor: Added dtype check when TRANSCENDENTAL=2, more context var
* Patch: xlog2, break expt(2, 32) x 2 -> expt(2, 16) x 4 for fp16 math
---------
Co-authored-by: George Hotz <72895+geohot@users.noreply.github.com>
Co-authored-by: chenyu <chenyu@fastmail.com>