* beam compare 2
* found issue maybe
* correct, not fail
* full rand
* less numpy
* extra simplify doesn't fix it
* reorder
* no numpy
* check in reverse
* test new tensor behavior
* better error msg
* remove check_process_replay
* that can go to the top
* add assert back
* [run_process_replay]
* checkout code [run_process_replay]
* temp [run_process_replay]
* revert temp [run_process_replay]
* ahh this is why [run_process_replay]
* revert temp [run_process_replay]
* [Patch] Removed weird NaN Handling in xlog2 resulting in different output around 1e-203
* Patch: compare the value of xlog(x) using y, allowing x <= 1e-200
* mypy
* fuzzer tests for log2
* fix tests: use approximate dbl_min, fp64 fails at nv
* update: gradually increment the scale (if y is not inf)
* fixes on transcendental: fix for fp64 payne hanek, refactor for fp16 sin
* revert the changes on test
* refactor on xsin: removed cody_waite_reduction, always use payne_hanek
* Revert "refactor on xsin: removed cody_waite_reduction, always use payne_hanek"
This reverts commit 2fd401f251.
* still need cody_waite_reduction for the very smaller range
* test: added a regression test for transcendental sin
* test: found the worse case ulp 3.5 only in numpy
* give the input as a valid dtype
---------
Co-authored-by: chenyu <chenyu@fastmail.com>
* replays
* what's in there
* can it be up there
* sha is enough
* insert sha as the key
* fix str
* update reset utils
* that nested try/except was terrible
* github_context can go
* test: use const
* hotfix: base
* asserts
* dont push through reshape
* cleanup
* dont need the cache
* test_reduceop_reshape_dont_push and test_index_fused are next
* improve single kernel indexing
* metadata in graph (#5399)
* indexing is O(1)
* add failing test
* ugh, that all needs to be replaced with symbolic
* broken on ptx, it's fine
---------
Co-authored-by: wozeparrot <wozeparrot@gmail.com>
* indexing getting better [run_process_replay] [no_assert]
* fix test
* test_arange_2_reduce is a simpler test
* put that print back, NOOPT
* don't merge reduces (they could be different reduces)
* FUSE_AS_ONE_KERNEL
* fix tests
* fix test_var_multireduce
* w/e put that there
* fails on others too
* fix test, revert UNMUL change
* in case order matters
* one kernel indexing works
* one kernel indexing works (test other)
* more transcend math tests in ci
test large input to trig functions that hit different reduction algo, and test TRANSCENDENTAL=2 for all backend
* no CUDACPU
* try that
* [WIP] Added an approximated implementation of Sin(FP32, FP64) passing all tests on Clang runtime
* Map nan/-inf/inf as 1.0 in order to avoid doing as_const(math.inf)
* [WIP] Added a support for LLVM IR
* cleaned up the code for the mypy and linter
* [WIP] Updated fp64 supports (bitwise shift causes the compilation error), fixed linter issue.
* [Add] added fast=true mode which disables the payne-hanek reduction which is slow
* [Fix] fails to compute elements when shape includes zero
* [WIP] Added BinaryOps.ADD/BinaryOps.OR to assembly
* [wip] update the assembly for ptx
* Enables fast=True when device is one of PTX, NV, CUDA, to avoid slow bitwise ops (as lv3 reduction is not required).
* [WIP] Added an approximation of LOG2/EXP2 (FP32, FP64)
* [Fix] Cyclic dependencies existing in xlog2
* [Fix] Cycle dependency in the graph of exp2, and log2. (passing test_symbolic_ops.py)
* [Fix] keep using higher precision for exp2, but cycle graph issue remained to be fixed...
* [Refactor] removed is_metal option. xsin does not rely on fp64 when fp32 mode.
* [WIP] fp16 xsin implementation passing all tests. (still needs to be refactored)
* [WIP] Added fp16 exp2 implementation
* [WIP] Increased the precision of Log2 from 3.5 ULP to 1.0 ULP, and added FP16 Log2 approximation.
* stashed the changes for FP16 sin
* [Fix] Patch for FP16 Sin/Exp2. (updated the dtype_via, fp32_p, and lower)
* [Refactor] migration to fastmath.py, some code simplification, renamed apis in fastmath, et al.
* [Refactor] Added the function polyN to clean-up N-terms polynomial approximation.
* [Patch] Increase fp64 precision when ldexp3k if possible, and patch for fp16 exp2
* [Patch] added bitcast_forward option
* [Patch] resolved cycle graph
* patch fix cycle graph
* set bitcast_forward=True in ilogb2k
* bitcast_forward for multi.py
* E501
* Break into multiple small PRs
* [Patch] FP16 -> FP64 upcast is not anymore required since xlog2 use quad precision polyN
* [Patch] NV still required FP64 for xlog2
* updated schedule test
* updated the count of kernels
* [Update] Removed all bitwise ops (SHL/SHR), tweaked the nan manipulation of log2, passing all tests except for AMD.
* Bitcast: make them api-compatible
* [update] force to use bitcast
* updated the count of constant folding
* [Patch] Creating a mask for exp2 using x <= Inf satisfies True as long as x is a real value
* [Update] isNaN(x) Free log2 algorithm, passing PTX tests, METAL with fastmath enabled is able to handle nan well, amd backend will not crash.
* xsin is reluctant to call payne_hanek_reduction which is slow to compile, passing stable diffusion compilation in a realistic time
* some minor simplification to payne hanek reduction
* [refactor] refactored some rebundant parts existing in payne hanek
* [refactor] more readable payne hanek impl
* [refactor] improved the code consistency of payne hanek
* [experiment] topological sort when doing _recursive_group (i dunno if this is good but at least it works.)
* Revert "[experiment] topological sort when doing _recursive_group (i dunno if this is good but at least it works.)"
This reverts commit 0eee08b87c.
* use allow_buffer_view
* lets support multilazytensor
* updated the count of kernels
* [test] added the jit tests for approx ops
* keep failed constant folding tests tested, added expectedFailure
* explict the timeout deadline when testing approx jit timeout
* [WIP] Simplified the implementation of xsin, never timeouts
* [Refactor] Improved the consistency of approx sin implementation, passing time out tests
* integrated xexp2_base into xexp2
* Set switch_over=39800.0
* delete: is_buffer_fastmath_supported
* sin: compute against abs(x)
* some cleanups
* fix typo
* removed the space between param and dtype
* allow 514 kernels on CI for sd
* [refactor] no need to upcast ad ldexp3k
* [refactor] added some comments, references to help understanding the code.
* [Fix] 1.0 ULP Sine Approximation for FP16
* [update] assume e != 0
* use pow2if instead of ldexp3k to fuse payne_hanek reduction into one
* check if approximated sin/log2/exp are fused into one
* clean up changes
* test amd exp
* some code cleanup and test sigmoid
* fix: enabled payne_hanek for fp16 to achieve higher acc
* fix: payne_hanek always accumlates the value with uint64, and fp16 sin is fused to a single kernel
* [Refactor] Rename: fastmath -> transcendental
* [Refactor] Added TRANSCENDENTAL, Moved the gate function to function.py
* updated const folding tests
* TRANSCENDENTAL as a ContextVar, removed old test of cody waite reduction, added assertions, et al.
* Add: unittest.main()
* Import TRANSCENDENTAL instead of getenv
* Refactor: Added dtype check when TRANSCENDENTAL=2, more context var
* Patch: xlog2, break expt(2, 32) x 2 -> expt(2, 16) x 4 for fp16 math
---------
Co-authored-by: George Hotz <72895+geohot@users.noreply.github.com>
Co-authored-by: chenyu <chenyu@fastmail.com>
* st to uops function
* lowerer
* uops reduce
* uops reduce
* acc_number correct
* reduce unroll
* complete unroll
* do upcasts
* handle multioutput
* define_accs
* fix valid
* get grouped dims
* revert lin
* minor
* fixup_ast
* group for reduce
* group works now
* all forwards pass
* all ops tests pass
* fix clang
* mypy
* lil cleanups, no image yet
* ugh, variables everywhere
* bugfix
* counters and name fix
* use symbolic, not uops
* cleanups
* Fix tests
* linearizer tests
* expands
* float4 expand load
* tests pass
* woooo, float4 test
* test ops works again
* one more lin test
* more lin tests
* bypass
* fix tests
* something like this
* const in defineacc
* uops get_reduce_acc
* move around
* allow consts in the LOAD/STORE
* each axis should only appear once, 21 failures
* 16 failures
* fix some image
* optional float4
* onnx tests
* gate the stores
* add reorder
* fix terrible skip function
* tc work
* opt add/mul merge
* fix float4 tests
* tiny tweak, 9 failing
* 7 test failures
* start tc, but i don't think this will work
* progress on tensorcores
* note
* fix ops tests
* closer on tc
* weeee...one tensor core works
* still works, more generic
* large WMMA works
* tc test passes
* use WMMA as accumulator
* basic tc tests passing
* small gemm padded works
* 4 failures
* 3 tests failing
* super barrier
* now two tests failing
* one test failing
* cleanpus, add reduce to UopGraph
* remove the linearizer
* remove unused
* lil cleanups
* Lowerer everywhere
* remove test that doesn't exist now
* image indexing
* llvm fix
* fix metal
* fix image
* fix images
* might fix ptx
* fix image type mismatch
* more tests pass
* CAST -> VECTORIZE
* forgot that one
* fix TestOps.test_flip_eye_crash
* locals shouldn't be image dtype
* change less files
* test fix
* fix recursive expands
* touches
* MULACC support in python
* delete unneeded
* alu before contract
* bug fixes
* tests
* no var multireduce
* simpler tc
* metal works in new style
* working on AMD and METAL
* fix amd
* shot in the dark, fix amd
* something for CUDA
* CUDA WORKS from the docs
* comment
* correct merge
* cleanups + ptx fix + get_reduce_acc
* local alias isn't used anymore
* add store sanity check
* fix for AMD
* cleanups and single expand pass
* more correct with acc_cache
* tests should pass
* block on WMMA
* tests pass
* merge contract and reduce
* contractor fixes issue
* multicontract
* pre expand wmma (same as a reduce)
* expand wmma and only take one
* all expands
* comments and whitespace
* deep pat test
* lint
* min diff
* min lines
* nothing
* is res extra
* cleanup2
* add res back
* reduce lines
* type anno
---------
Co-authored-by: qazal <qazal.software@gmail.com>