Commit Graph

5015 Commits

Author SHA1 Message Date
George Hotz
3e40211e45 add UOP_IS_SYMBOLIC [run_process_replay] [no_assert] (#5386)
* cleanup a few things in uops [run_process_replay] [no_assert]

* add optional UOP_IS_SYMBOLIC
2024-07-11 10:48:45 -07:00
nimlgen
b3790b759b nv cleanup gpfifo setup (#5382)
* nv cleanup gpfifo setup

* save lines
2024-07-11 17:50:52 +03:00
chenyu
416f838a1a hotfix tqdm respects total=0 if set (#5380)
if you insist total=0, it should use 0 instead of inferring from iterable. matched tqdm
2024-07-11 10:30:12 -04:00
nimlgen
2ba96d4c29 nv use mv_address (#5381)
* nv use mv_address

* unsued import
2024-07-11 16:45:03 +03:00
nimlgen
bd77efda2f add HWCommandQueue base class for hcq devices (#5303)
* add HWCommandQueue as base queue for hcq devices

* try this

* fixes

* comments

* linter

* linetr2

* linter

* linter

* fixed

* revert this
2024-07-11 16:19:13 +03:00
qazal
dc3ea78560 hotfix: faster UOps.END* insert [run_process_replay] (#5377)
* is this faster

* p2

* don't waste lines
2024-07-11 13:20:19 +03:00
qazal
004366b193 context aware process replay [run_process_replay] (#5378)
* test tc as ctx var

* remove from opts

* process replay

* pop variable

* B -> Variable

* fix re-assign

* pop temp vars

* move TRANSCENDENTAL=2
2024-07-11 13:07:28 +03:00
qazal
45e1b9d5e3 use TC options as ContextVars [run_process_replay] (#5379)
* delete from renderer

* move to ctx
2024-07-11 12:01:36 +03:00
qazal
289fd2e940 Lowerer cleanup 2 [run_process_replay] (#5376)
* test outbufs delete

* comments

* valid is bool
2024-07-11 10:56:53 +03:00
qazal
9ca2d96b6b delete extra check in DEFINE_ACC [run_process_replay] (#5375) 2024-07-11 10:49:03 +03:00
George Hotz
3e9f200905 KernelInfo + cleanups [run_process_replay] (#5372) 2024-07-10 21:00:31 -07:00
chenyu
2396ab9b33 more transcend cleanup [run_process_replay] (#5369)
fix test name, less # noqa: E501 and removed the cast
2024-07-10 23:05:03 -04:00
George Hotz
909ad72c53 remove getattr [run_process_replay] (#5370)
* remove getattr [run_process_replay]

* don't waste lines
2024-07-10 19:42:17 -07:00
George Hotz
0215c952c5 Move transcendental to UOp level (#5367)
* move uopgraph to file [run_process_replay]

* transcendental uops

* tests pass

* no skip

---------

Co-authored-by: chenyu <chenyu@fastmail.com>
2024-07-10 19:06:25 -07:00
chenyu
64986f949c more transcend math tests in ci (#5368)
* more transcend math tests in ci

test large input to trig functions that hit different reduction algo, and test TRANSCENDENTAL=2 for all backend

* no CUDACPU

* try that
2024-07-10 21:19:09 -04:00
wozeparrot
c9b3ae6bbf fix llama.py chat mode assert (#5366) 2024-07-10 18:06:14 -07:00
George Hotz
d13654a820 move uopgraph to file [run_process_replay] (#5364)
* move uopgraph to file [run_process_replay]

* fix print tree test
2024-07-10 17:34:50 -07:00
hikettei
320e7ed935 Approximations for SIN/LOG2/EXP2 passing all tests. (#5187)
* [WIP] Added an approximated implementation of Sin(FP32, FP64) passing all tests on Clang runtime

* Map nan/-inf/inf as 1.0 in order to avoid doing as_const(math.inf)

* [WIP] Added a support for LLVM IR

* cleaned up the code for the mypy and linter

* [WIP] Updated fp64 supports (bitwise shift causes the compilation error), fixed linter issue.

* [Add] added fast=true mode which disables the payne-hanek reduction which is slow

* [Fix] fails to compute elements when shape includes zero

* [WIP] Added BinaryOps.ADD/BinaryOps.OR to assembly

* [wip] update the assembly for ptx

* Enables fast=True when device is one of PTX, NV, CUDA, to avoid slow bitwise ops (as lv3 reduction is not required).

* [WIP] Added an approximation of LOG2/EXP2 (FP32, FP64)

* [Fix] Cyclic dependencies existing in xlog2

* [Fix] Cycle dependency in the graph of exp2, and log2. (passing test_symbolic_ops.py)

* [Fix] keep using higher precision for exp2, but cycle graph issue remained to be fixed...

* [Refactor] removed is_metal option. xsin does not rely on fp64 when fp32 mode.

* [WIP] fp16 xsin implementation passing all tests. (still needs to be refactored)

* [WIP] Added fp16 exp2 implementation

* [WIP] Increased the precision of Log2 from 3.5 ULP to 1.0 ULP, and added FP16 Log2 approximation.

* stashed the changes for FP16 sin

* [Fix] Patch for FP16 Sin/Exp2. (updated the dtype_via, fp32_p, and lower)

* [Refactor] migration to fastmath.py, some code simplification, renamed apis in fastmath, et al.

* [Refactor] Added the function polyN to clean-up N-terms polynomial approximation.

* [Patch] Increase fp64 precision when ldexp3k if possible, and patch for fp16 exp2

* [Patch] added bitcast_forward option

* [Patch] resolved cycle graph

* patch fix cycle graph

* set bitcast_forward=True in ilogb2k

* bitcast_forward for multi.py

* E501

* Break into multiple small PRs

* [Patch] FP16 -> FP64 upcast is not anymore required since xlog2 use quad precision polyN

* [Patch] NV still required FP64 for xlog2

* updated schedule test

* updated the count of kernels

* [Update] Removed all bitwise ops (SHL/SHR), tweaked the nan manipulation of log2, passing all tests except for AMD.

* Bitcast: make them api-compatible

* [update] force to use bitcast

* updated the count of constant folding

* [Patch] Creating a mask for exp2 using x <= Inf satisfies True as long as x is a real value

* [Update] isNaN(x) Free log2 algorithm, passing PTX tests, METAL with fastmath enabled is able to handle nan well, amd backend will not crash.

* xsin is reluctant to call payne_hanek_reduction which is slow to compile, passing stable diffusion compilation in a realistic time

* some minor simplification to payne hanek reduction

* [refactor] refactored some rebundant parts existing in payne hanek

* [refactor] more readable payne hanek impl

* [refactor] improved the code consistency of payne hanek

* [experiment] topological sort when doing _recursive_group (i dunno if this is good but at least it works.)

* Revert "[experiment] topological sort when doing _recursive_group (i dunno if this is good but at least it works.)"

This reverts commit 0eee08b87c.

* use allow_buffer_view

* lets support multilazytensor

* updated the count of kernels

* [test] added the jit tests for approx ops

* keep failed constant folding tests tested, added expectedFailure

* explict the timeout deadline when testing approx jit timeout

* [WIP] Simplified the implementation of xsin, never timeouts

* [Refactor] Improved the consistency of approx sin implementation, passing time out tests

* integrated xexp2_base into xexp2

* Set switch_over=39800.0

* delete: is_buffer_fastmath_supported

* sin: compute against abs(x)

* some cleanups

* fix typo

* removed the space between param and dtype

* allow 514 kernels on CI for sd

* [refactor] no need to upcast ad ldexp3k

* [refactor] added some comments, references to help understanding the code.

* [Fix] 1.0 ULP Sine Approximation for FP16

* [update] assume e != 0

* use pow2if instead of ldexp3k to fuse payne_hanek reduction into one

* check if approximated sin/log2/exp are fused into one

* clean up changes

* test amd exp

* some code cleanup and test sigmoid

* fix: enabled payne_hanek for fp16 to achieve higher acc

* fix: payne_hanek always accumlates the value with uint64, and fp16 sin is fused to a single kernel

* [Refactor] Rename: fastmath -> transcendental

* [Refactor] Added TRANSCENDENTAL, Moved the gate function to function.py

* updated const folding tests

* TRANSCENDENTAL as a ContextVar, removed old test of cody waite reduction, added assertions, et al.

* Add: unittest.main()

* Import TRANSCENDENTAL instead of getenv

* Refactor: Added dtype check when TRANSCENDENTAL=2, more context var

* Patch: xlog2, break expt(2, 32) x 2 -> expt(2, 16) x 4 for fp16 math

---------

Co-authored-by: George Hotz <72895+geohot@users.noreply.github.com>
Co-authored-by: chenyu <chenyu@fastmail.com>
2024-07-10 16:44:58 -07:00
George Hotz
7c0a657f08 hotfix: put process replay back 2024-07-10 16:32:50 -07:00
George Hotz
7a014d5435 move ast to the kernel (#5362)
* move ast to the kernel

* locals aren't image

* comment
2024-07-10 16:22:26 -07:00
wozeparrot
245d83a392 more tinybox docs (#5361) 2024-07-10 16:13:24 -07:00
George Hotz
6972a2569f Linearizer -> Lowerer (#4957)
* st to uops function

* lowerer

* uops reduce

* uops reduce

* acc_number correct

* reduce unroll

* complete unroll

* do upcasts

* handle multioutput

* define_accs

* fix valid

* get grouped dims

* revert lin

* minor

* fixup_ast

* group for reduce

* group works now

* all forwards pass

* all ops tests pass

* fix clang

* mypy

* lil cleanups, no image yet

* ugh, variables everywhere

* bugfix

* counters and name fix

* use symbolic, not uops

* cleanups

* Fix tests

* linearizer tests

* expands

* float4 expand load

* tests pass

* woooo, float4 test

* test ops works again

* one more lin test

* more lin tests

* bypass

* fix tests

* something like this

* const in defineacc

* uops get_reduce_acc

* move around

* allow consts in the LOAD/STORE

* each axis should only appear once, 21 failures

* 16 failures

* fix some image

* optional float4

* onnx tests

* gate the stores

* add reorder

* fix terrible skip function

* tc work

* opt add/mul merge

* fix float4 tests

* tiny tweak, 9 failing

* 7 test failures

* start tc, but i don't think this will work

* progress on tensorcores

* note

* fix ops tests

* closer on tc

* weeee...one tensor core works

* still works, more generic

* large WMMA works

* tc test passes

* use WMMA as accumulator

* basic tc tests passing

* small gemm padded works

* 4 failures

* 3 tests failing

* super barrier

* now two tests failing

* one test failing

* cleanpus, add reduce to UopGraph

* remove the linearizer

* remove unused

* lil cleanups

* Lowerer everywhere

* remove test that doesn't exist now

* image indexing

* llvm fix

* fix metal

* fix image

* fix images

* might fix ptx

* fix image type mismatch

* more tests pass

* CAST -> VECTORIZE

* forgot that one

* fix TestOps.test_flip_eye_crash

* locals shouldn't be image dtype

* change less files

* test fix

* fix recursive expands

* touches

* MULACC support in python

* delete unneeded

* alu before contract

* bug fixes

* tests

* no var multireduce

* simpler tc

* metal works in new style

* working on AMD and METAL

* fix amd

* shot in the dark, fix amd

* something for CUDA

* CUDA WORKS from the docs

* comment

* correct merge

* cleanups + ptx fix + get_reduce_acc

* local alias isn't used anymore

* add store sanity check

* fix for AMD

* cleanups and single expand pass

* more correct with acc_cache

* tests should pass

* block on WMMA

* tests pass

* merge contract and reduce

* contractor fixes issue

* multicontract

* pre expand wmma (same as a reduce)

* expand wmma and only take one

* all expands

* comments and whitespace
2024-07-10 15:07:42 -07:00
chenyu
204b6169ca minimize view _reshape_mask API [run_process_replay] (#5359)
* minimize view _reshape_mask API [run_process_replay]

_reshape_mask is only determined by mask, old_shape, new_shape. it does not need to input the whole view

* combine
2024-07-10 17:13:01 -04:00
wozeparrot
fa873df9c1 bring tinychat more inline with tinyos' version (#5358) 2024-07-10 13:13:52 -07:00
Roelof van Dijk
84839f6c58 mini refactor: simpler view.unbind (#5356)
* refactor: simpler unbind

* restore var_unboundvar_val
2024-07-10 09:29:56 -04:00
chenyu
f01192b8cd tiny View.unbind cleanup [run_process_replay] (#5355)
`Variable.val` asserts if it's None, so unbind does not need to check if `.val` is not None
2024-07-09 23:37:53 -04:00
chenyu
cc80005377 clean up view _reshape_mask [run_process_replay] (#5354)
return the new mask if reshaping is possible and None if not, instead of a mask and a bool
2024-07-09 18:48:32 -04:00
chenyu
322c37e621 use helpers.JIT in llama and gpt2 examples (#5350)
* use helpers.JIT in llama and gpt2 examples

replaced getenv("JIT"), effectively made gpt2 default jit

* fix test_gpt2
2024-07-09 15:04:43 -04:00
Elias Wahl
097268fab3 Add layerwise performance bench for bert (#5349)
* add bert bench

* dont disable by defauöt

* remove lr

* linter
2024-07-09 15:03:25 -04:00
nimlgen
1678199b15 add update_copy to hcq spec (#5348)
* add update_copy to hcq spec

* fix amd
2024-07-09 20:44:44 +03:00
chenyu
9504db1a57 remove the realize in _rebuild_tensor_v2 (#5347)
no longer needed
2024-07-09 12:28:52 -04:00
qazal
1f5de80eba multi reduce Tensor.var passing verify_lazyop (#5346)
* what about this

* reset late gate
2024-07-09 17:20:17 +03:00
kormann
3d452195e4 [bug fix] nested commutative pattern _match [run_process_replay] [no_assert] (#5340)
* deep pat test

* lint

* min diff

* min lines

* nothing

* is res extra

* cleanup2

* add res back

* reduce lines

* type anno

---------

Co-authored-by: qazal <qazal.software@gmail.com>
2024-07-09 16:38:39 +03:00
nimlgen
e815c57039 use hcq_profile in nv/amd program (#5344) 2024-07-09 15:56:06 +03:00
qazal
bee96a19ff fuzz uop schedules (#5345)
* basic blocks + cleanups

* fixups

* elif is better for future me

* fuzz_schedule_max_paths

* fix linter
2024-07-09 15:24:56 +03:00
Ian Paul
d5a68ae6b3 Simple abstractions3.py fix (#5343)
* abstractions3.py fix

* Add abstractions3.py to CI tests
2024-07-09 13:48:42 +03:00
nimlgen
a2a9bfd2ec nv correct error messages with ptx (#5341)
* nv correct error messages with ptx

* return compile error
2024-07-09 10:39:39 +03:00
George Hotz
c13da83f12 tests from lowerer branch (#5339)
* tests from lowerer branch

* Update test_image_dtype.py

* Update test_image_dtype.py

* Update test_image_dtype.py
2024-07-08 21:23:19 -07:00
chenyu
4ceab5d2b1 fix PTX match rule for gated LOAD (#5338)
* test padto sum with bool tensor and bool acc dtype

make sure bool tensor acc with gate is handled correctly

* broken in PTX

* fix ptx
2024-07-08 22:25:03 -04:00
chenyu
a80f2df1bd fix some PTX tests (#5337)
fix broken PTX tests in test_linearizer and test_uops. there are tests that were skipped and broken because it runs only with CUDA=1 and we run PTX with NV=1 now
2024-07-08 21:33:05 -04:00
wozeparrot
9150a6be7a tensor metadata (#5271) 2024-07-08 17:45:40 -07:00
chenyu
7f642aa7ed minor PTX matcher cleanup [run_process_replay] (#5336)
* minor PTX matcher cleanup [run_process_replay]

uop.cast syntatic sugar and some newline/space cleanup

* comment
2024-07-08 19:19:20 -04:00
chenyu
0f0940225a fix Tensor.all and Tensor.any for PTX (#5335)
supported boolean acc and boolean phi. and rewrite boolean max to uint8 max
2024-07-08 18:15:04 -04:00
Roelof van Dijk
053c706961 refactor: expr_view on View (#5315) 2024-07-08 11:47:34 -07:00
kormann
2349d837fb Fix scope order in graph toposort [run_process_replay] (#5330)
* fix

* test

* nothing
2024-07-08 11:46:15 -07:00
chenyu
631bc974a0 raise line count limit to 8500 (#5331) 2024-07-08 14:00:28 -04:00
Timmy
bb7746985f multireduce scheduler tests (#5141)
Co-authored-by: qazal <77887910+Qazalin@users.noreply.github.com>
2024-07-08 20:28:55 +03:00
nimlgen
bb2222e488 nv default for ampere & ada (#5329) 2024-07-08 19:01:27 +03:00
nimlgen
51d6f372e4 nv get classes based on device (#5325)
* nv get classes

* support in mockgpu

* choose sm based on gpu

* fix

* fix

* fix arch
2024-07-08 18:25:05 +03:00
chenyu
7d049fc20c move getting 0 and min value of a dtype to dtype.py (#5328)
cleanup getting base case for reduce ops
[run_process_replay]
2024-07-08 10:51:56 -04:00