Commit Graph

37 Commits

Author SHA1 Message Date
George Hotz
4679f9fb44 add detach to graph [pr] (#8221)
* add detach to graph [pr]

* accept failure
2024-12-13 14:21:32 -08:00
qazal
9044b0746a delete lazy [pr] (#7801)
* LazyBuffer = UOp

* try 4 at this diff

* skip optimization tests p1

* raise kernel count expectations

* BIND isn't the _only_ uop that can become a tensor

* fix test_ones_sum on symbolic

* bump openpilot, correctness first

* offset on assign is fine

* uop is immutable

* what if this was higher

* more optimization skips

* instant fold const copy

* test_multitensor shouldn't expect buffer for unrealized

* move copy folder to upats

* start BUFFER_VIEW

* kinda BUFFER_VIEW

* Revert "kinda BUFFER_VIEW"

This reverts commit 94b4fe3040.

* BUFFER_VIEW try 2

* linter and missed _device

* pylint

* keep Ops.CONTIGUOUS

* always BUFFER_VIEW disk

* test

* cpu isn't a real device

* buffer references afte del

* add that back

* start bringing some of these back

* more test updates

* simpler simplify copy

* subbufer everything

* this is fine with buffer view

* cleanup the diff in test/ 1

* copy is one thing

* diff pruning

* diff pruning 2

* oh bind unbinds way too early

* extra

* more diff pruning

* more const folding

* experiment with symbolic here

* Revert "experiment with symbolic here"

This reverts commit cb87d61f7a.

* Revert "more const folding"

This reverts commit 2a7d258a2b.

* Revert VALID early folding

This reverts commit 4074f52317.

* storing const is fine

* fix test_prefer_half_buffer

* iterate on test_real_world

* this fixes test_train_mnist memory, breaks everything else

* Revert "this fixes test_train_mnist memory, breaks everything else"

This reverts commit dccfcbe068.

* always expect buffer to exist here

* temp debug: something is mutating lazydata in compile3

* Revert "temp debug: something is mutating lazydata in compile3"

This reverts commit 71400f0d55.

* everything back to normal

* compile3

* compile3 test

* start captured jit work, that test passes

* finalized memory skip set

* linter err

* back to base here

* tiny metaop cleanup

* print tensor

* 4th type this unbind got me

* green pickle

* tensor_variable sanity

* cast sanity

* link from the reds

* COPY sanity + minor repr change

* you can exist

* enable test_winograd

* bye bye nbytes

* danger, uop is mutating

* real become

* delete those from uop init

* put it in buffer init

* buffer inits with so much stuff

* buffer pickle try 2

* toposort can't be a cached property

* fix test_schedule_gc_with_inputs

* remove all @unittest.skip(gc)

* Revert "remove all @unittest.skip(gc)"

This reverts commit 9d8d92dd85.

* reenable real world + test_schedule_gc

* test: RUN_PROCESS_REPLAY=0

* fix pickle jit

* test changes

* reenable test_lru_alloc and TestTrain

* fix imagedtype

* bring pr back

* reenable 3 gc tests

* test_schedule better diff

* disable SPLIT_REDUCEOP

* test_save_all_dtypes looks fixed

* fix metadata

* skip that one

* fix viz by not pickling buffers

* simple test for const folding

* bring split reduceop back

* add simplify_alu

* simplify_binop fixes a test

* fix cast folding

* disable that test

* that test looks fine

* changes from delete_lazy pruning p1

* cast folding and children base

* test: cast folding from pruning branch

* green test_sgd_4convs_fuse_conv_bw

* enable some indexing folding

* test_complex_backward is fixed

* prune more, 295 -> 233

* fix test_multi_const_folding_literal

* fix double copy

* early become test

* ooooops

* clean up ctx in all big_graph

* fix openpilot 208 kernels

* train_cifar is fine now

* fix CAST_BEFORE_VIEW

* ever faker const

* back to 13

* mark expectedFailure

* fine don't create them

* test_multi_const_folding_tensor

---------

Co-authored-by: George Hotz <geohot@gmail.com>
Co-authored-by: George Hotz <72895+geohot@users.noreply.github.com>
2024-12-12 05:05:19 +08:00
qazal
b894657aa7 assert the same things without mutating or accessing internal ops state [pr] (#8157)
* don't mutate internal state in test_lazybuffer

* fix test_schedule internals

* save time

* third si

* fine sometimes buffer_view isn't there
2024-12-11 22:01:27 +08:00
qazal
df84dc6444 unrelated test fixups from delete_lazy [pr] (#8088)
* unrelated test fixups from delete_lazy [pr]

* fine if it's scheduled later
2024-12-06 17:31:02 +02:00
qazal
435a51e10c reduce folding simple tests [pr] (#8040)
* reduce folding simple tests [pr]

* test for view and realized src pattern

* realize / buffer behavior
2024-12-05 12:22:45 +08:00
ignaciosica
597a239e28 Remove UnaryOps, BinaryOps, TernaryOps, MetaOps [pr] (#7725)
* remove unaryops

* remove ternaryops

* remove metaops

* hotfix

* remove binaryops

* hotfix: test_pattern_matcher

---------

Co-authored-by: qazal <77887910+Qazalin@users.noreply.github.com>
2024-11-16 20:56:56 +08:00
chenyu
aeb1301bab enable a few tests that work now (#7721)
should mark the ones that are expected to work with expectedFailure, and delete and ones that are not expected to work
2024-11-15 14:30:52 -05:00
George Hotz
205befa788 move is_dtype_supported to device [pr] (#7575) 2024-11-07 20:38:03 +08:00
Ahmed Harmouche
36488a2a43 Use is_dtype_supported in more places in tests (#7529) 2024-11-04 09:21:15 -05:00
George Hotz
c8bf09b7d4 s/UOps/Ops (#7500)
* s/UOps/Ops [pr]

* fix
2024-11-03 11:26:10 +08:00
chenyu
b76f0c875e lazy const fold idiv 1 (#6285) 2024-08-26 10:29:59 -04:00
chenyu
590c0922b6 Tensor.prod (#6250)
* Tensor.prod

a new reduce op!

* onnx ReduceProd
2024-08-23 10:06:32 -04:00
chenyu
21d6739237 remove UnaryOps.NEG from lazy.py (#6193)
* remove UnaryOps.NEG from lazy.py

* neg is no longer unary
2024-08-19 18:41:28 -04:00
qazal
28c75bf2a6 merge uops with ops (#6111)
Co-authored-by: chenyu <chenyu@fastmail.com>
2024-08-16 18:17:57 -04:00
qazal
c23d44c779 AST is UOp (#6030)
* most of the work from the uops2 branch

* schedule

* realize

* kernel

* lowerer

* search

* green

* merge uops with ops

* Revert "merge uops with ops"

This reverts commit 1408a59f12.

* fix benchmark

* remove extra dedup
2024-08-16 22:09:00 +03:00
George Hotz
fa7e734b49 MetaOps.KERNEL (#5543) 2024-07-17 19:41:23 -07:00
George Hotz
6707c778d0 scheduleitem is not Tuple [run_process_replay] (#5425)
* scheduleitem is not Tuple [run_process_replay]

* fix tests

* fix op + fuzzers

* fix mop test
2024-07-12 15:13:19 -07:00
chenyu
2396ab9b33 more transcend cleanup [run_process_replay] (#5369)
fix test name, less # noqa: E501 and removed the cast
2024-07-10 23:05:03 -04:00
George Hotz
0215c952c5 Move transcendental to UOp level (#5367)
* move uopgraph to file [run_process_replay]

* transcendental uops

* tests pass

* no skip

---------

Co-authored-by: chenyu <chenyu@fastmail.com>
2024-07-10 19:06:25 -07:00
hikettei
320e7ed935 Approximations for SIN/LOG2/EXP2 passing all tests. (#5187)
* [WIP] Added an approximated implementation of Sin(FP32, FP64) passing all tests on Clang runtime

* Map nan/-inf/inf as 1.0 in order to avoid doing as_const(math.inf)

* [WIP] Added a support for LLVM IR

* cleaned up the code for the mypy and linter

* [WIP] Updated fp64 supports (bitwise shift causes the compilation error), fixed linter issue.

* [Add] added fast=true mode which disables the payne-hanek reduction which is slow

* [Fix] fails to compute elements when shape includes zero

* [WIP] Added BinaryOps.ADD/BinaryOps.OR to assembly

* [wip] update the assembly for ptx

* Enables fast=True when device is one of PTX, NV, CUDA, to avoid slow bitwise ops (as lv3 reduction is not required).

* [WIP] Added an approximation of LOG2/EXP2 (FP32, FP64)

* [Fix] Cyclic dependencies existing in xlog2

* [Fix] Cycle dependency in the graph of exp2, and log2. (passing test_symbolic_ops.py)

* [Fix] keep using higher precision for exp2, but cycle graph issue remained to be fixed...

* [Refactor] removed is_metal option. xsin does not rely on fp64 when fp32 mode.

* [WIP] fp16 xsin implementation passing all tests. (still needs to be refactored)

* [WIP] Added fp16 exp2 implementation

* [WIP] Increased the precision of Log2 from 3.5 ULP to 1.0 ULP, and added FP16 Log2 approximation.

* stashed the changes for FP16 sin

* [Fix] Patch for FP16 Sin/Exp2. (updated the dtype_via, fp32_p, and lower)

* [Refactor] migration to fastmath.py, some code simplification, renamed apis in fastmath, et al.

* [Refactor] Added the function polyN to clean-up N-terms polynomial approximation.

* [Patch] Increase fp64 precision when ldexp3k if possible, and patch for fp16 exp2

* [Patch] added bitcast_forward option

* [Patch] resolved cycle graph

* patch fix cycle graph

* set bitcast_forward=True in ilogb2k

* bitcast_forward for multi.py

* E501

* Break into multiple small PRs

* [Patch] FP16 -> FP64 upcast is not anymore required since xlog2 use quad precision polyN

* [Patch] NV still required FP64 for xlog2

* updated schedule test

* updated the count of kernels

* [Update] Removed all bitwise ops (SHL/SHR), tweaked the nan manipulation of log2, passing all tests except for AMD.

* Bitcast: make them api-compatible

* [update] force to use bitcast

* updated the count of constant folding

* [Patch] Creating a mask for exp2 using x <= Inf satisfies True as long as x is a real value

* [Update] isNaN(x) Free log2 algorithm, passing PTX tests, METAL with fastmath enabled is able to handle nan well, amd backend will not crash.

* xsin is reluctant to call payne_hanek_reduction which is slow to compile, passing stable diffusion compilation in a realistic time

* some minor simplification to payne hanek reduction

* [refactor] refactored some rebundant parts existing in payne hanek

* [refactor] more readable payne hanek impl

* [refactor] improved the code consistency of payne hanek

* [experiment] topological sort when doing _recursive_group (i dunno if this is good but at least it works.)

* Revert "[experiment] topological sort when doing _recursive_group (i dunno if this is good but at least it works.)"

This reverts commit 0eee08b87c.

* use allow_buffer_view

* lets support multilazytensor

* updated the count of kernels

* [test] added the jit tests for approx ops

* keep failed constant folding tests tested, added expectedFailure

* explict the timeout deadline when testing approx jit timeout

* [WIP] Simplified the implementation of xsin, never timeouts

* [Refactor] Improved the consistency of approx sin implementation, passing time out tests

* integrated xexp2_base into xexp2

* Set switch_over=39800.0

* delete: is_buffer_fastmath_supported

* sin: compute against abs(x)

* some cleanups

* fix typo

* removed the space between param and dtype

* allow 514 kernels on CI for sd

* [refactor] no need to upcast ad ldexp3k

* [refactor] added some comments, references to help understanding the code.

* [Fix] 1.0 ULP Sine Approximation for FP16

* [update] assume e != 0

* use pow2if instead of ldexp3k to fuse payne_hanek reduction into one

* check if approximated sin/log2/exp are fused into one

* clean up changes

* test amd exp

* some code cleanup and test sigmoid

* fix: enabled payne_hanek for fp16 to achieve higher acc

* fix: payne_hanek always accumlates the value with uint64, and fp16 sin is fused to a single kernel

* [Refactor] Rename: fastmath -> transcendental

* [Refactor] Added TRANSCENDENTAL, Moved the gate function to function.py

* updated const folding tests

* TRANSCENDENTAL as a ContextVar, removed old test of cody waite reduction, added assertions, et al.

* Add: unittest.main()

* Import TRANSCENDENTAL instead of getenv

* Refactor: Added dtype check when TRANSCENDENTAL=2, more context var

* Patch: xlog2, break expt(2, 32) x 2 -> expt(2, 16) x 4 for fp16 math

---------

Co-authored-by: George Hotz <72895+geohot@users.noreply.github.com>
Co-authored-by: chenyu <chenyu@fastmail.com>
2024-07-10 16:44:58 -07:00
qazal
981afb114f safely fold NEG in lazy.py (#5135)
* safe

* add test
2024-06-24 19:40:37 -04:00
chenyu
36a1f38049 lazy folding: mul -1 is neg, and neg neg is noop (#4472) 2024-05-08 01:52:22 -04:00
chenyu
c508eb7425 revert the removal of CAST_BEFORE_VIEW (#4471)
this brings most of the memory gain for resnet back.
2024-05-08 00:14:29 -04:00
chenyu
f363f39e83 fix dtype of const folded sum (#4349)
const folding sum should return in the same dtype the same as regular sum, which can be different from input dtype
2024-04-29 11:40:45 -04:00
George Hotz
ba7314c26b cleanup lbs (#4163) 2024-04-12 22:32:16 -07:00
chenyu
a7c6864260 remove CAST_BEFORE_VIEW (#4152)
* remove CAST_BEFORE_VIEW

testing perf, also this might have issue with assign?

* remove all
2024-04-13 01:05:08 -04:00
geohotstan
1a1dd1c1a7 add and enable tests for indexing const folding (#4068)
* enable test in test_indexing

* added tests

* rename stuff

* del a test case cuz it's loadops.copy
2024-04-04 10:46:28 -04:00
chenyu
406cb5fd90 const fold ReduceOps (#4059) 2024-04-03 14:39:28 -04:00
chenyu
fe03725b21 const fold cast unrealized_unpadded_const (#4047)
* const fold unrealized_unpadded_const

changed the underlying arg directly

* CAST_BEFORE_VIEW folds some

* fix const index in getitem
2024-04-03 12:31:24 -04:00
chenyu
f61ed869f5 Use exec_alu for lazy const folding (#4039) 2024-04-02 20:52:05 -04:00
chenyu
85edc493b0 uops const fold rules to prevent tautological compare warnings (#4041)
* uops const fold rules to prevent tautological compare warnings

`bool < false` is false, `true < bool` is false, `a == a` is true, `a != a` is false

* not true for nan

* and nan does not work with llvm

* full truth table test

* revert a==a

* comments and indents
2024-04-02 16:45:58 -04:00
chenyu
82440d3416 don't call contiguous for unpadded const into multi tensor (#4032)
* don't call contiguous for unpadded const into multi tensor

fixed multi const folding for sharded const.
still wip, need to be careful that this does not break multi device cache somewhere

* ehh need a memory test for that

* simple sharded memory test
2024-04-01 19:22:14 -04:00
chenyu
77a68fc52f test examples for multi tensor const folding (#4031)
works with literal const operand now because it's copied to each shard and handled by lazy.
does not work for sharded const
2024-04-01 16:53:43 -04:00
chenyu
379d52548d const fold left const operand for ADD and MUL (#4029)
* const fold left const operand for ADD and MUL

* neg have dtype issue
2024-04-01 15:09:04 -04:00
chenyu
0e02d074bd fix Tensor.pow folding for exponent 0 and 1 (#4025) 2024-03-31 19:57:23 -04:00
chenyu
d3f27761b0 move const folding of ADD/SUB/MUL from tensor to lazy (#4020)
* move const folding of ADD/SUB/MUL from tensor to lazy

will do div and pow separately.

* fix onnx adding with None
2024-03-31 16:35:36 -04:00
chenyu
7f859593b8 fix _to_const_val and const folding around it (#4017)
* fix _to_const_val and const folding around it

is_unrealized_contiguous_const is too strict and almost never hit if const is expanded.
suffice to check if there's no pad

* that test is folded

* test_const_folding
2024-03-31 13:09:23 -04:00