Commit Graph

10633 Commits

Author SHA1 Message Date
Roelof van Dijk
b18aa00bba refactor: consolidate replace [run_process_replay] (#5403) 2024-07-12 07:36:57 -07:00
chenyu
497274f663 add float64 to test_dtype_alu dtypes_float (#5410)
* add float64 to test_dtype_alu dtypes_float

* CUDACPU float64 crashes

* real NV failed
2024-07-12 10:21:32 -04:00
qazal
31fcc516dc more process replay tooling (#5407)
* replays

* what's in there

* can it be up there

* sha is enough

* insert sha as the key

* fix str

* update reset utils

* that nested try/except was terrible

* github_context can go
2024-07-12 13:11:34 +03:00
Roelof van Dijk
6ec7dbc287 ci: parallelize uops tests (#5405) 2024-07-12 11:22:41 +03:00
qazal
e22b377839 generalize FUSE_AS_ONE_KERNEL in the scheduler (#5397)
* test: use const

* hotfix: base

* asserts

* dont push through reshape

* cleanup

* dont need the cache

* test_reduceop_reshape_dont_push and test_index_fused are next
2024-07-12 10:23:16 +03:00
chenyu
6e0a523078 repro slow resnet kernel with 4 global dims (#5402)
* repro slow resnet kernel with 4 global dims

* fix ruff
2024-07-11 23:31:15 -04:00
George Hotz
8390feb7b9 optim.OptimizerGroup in hlb_cifar (#5401) 2024-07-11 20:14:36 -07:00
George Hotz
01fbd18209 metal compile fail 2024-07-11 19:27:05 -07:00
George Hotz
3a2b5a75d2 improve single kernel indexing (#5398)
* improve single kernel indexing

* metadata in graph (#5399)

* indexing is O(1)

* add failing test

* ugh, that all needs to be replaced with symbolic

* broken on ptx, it's fine

---------

Co-authored-by: wozeparrot <wozeparrot@gmail.com>
2024-07-11 19:00:57 -07:00
wozeparrot
c24d495ef9 metadata in handcode_opt (#5400) 2024-07-11 17:45:34 -07:00
wozeparrot
c60838594c metadata in graph (#5399) 2024-07-11 17:02:12 -07:00
George Hotz
c2da4454cd indexing getting better (#5389)
* indexing getting better [run_process_replay] [no_assert]

* fix test

* test_arange_2_reduce is a simpler test

* put that print back, NOOPT

* don't merge reduces (they could be different reduces)

* FUSE_AS_ONE_KERNEL

* fix tests

* fix test_var_multireduce

* w/e put that there

* fails on others too

* fix test, revert UNMUL change

* in case order matters

* one kernel indexing works

* one kernel indexing works (test other)
2024-07-11 16:41:51 -07:00
qazal
9712d9ffb6 pass lowering errors if not asserting process replay (#5395)
* pass lowering errors if not asserting process replay

* ProcessReplayError
2024-07-11 19:09:12 -04:00
wozeparrot
a02b38c0ac download openimages by running it (#5396) 2024-07-11 16:06:13 -07:00
qazal
0421f5d83e hotfix: compare test_var_multireduce against numpy (#5394) 2024-07-11 18:57:08 -04:00
qazal
b91a0ccdc3 make [run_process_replay] [no_assert] the default (#5390) 2024-07-11 22:36:59 +03:00
George Hotz
e8191479a3 add bigint type for indexing [run_process_replay] (#5387) 2024-07-11 11:37:10 -07:00
George Hotz
5232e405ce hotfix: add BS to beautiful_mnist 2024-07-11 10:55:05 -07:00
George Hotz
3e40211e45 add UOP_IS_SYMBOLIC [run_process_replay] [no_assert] (#5386)
* cleanup a few things in uops [run_process_replay] [no_assert]

* add optional UOP_IS_SYMBOLIC
2024-07-11 10:48:45 -07:00
nimlgen
b3790b759b nv cleanup gpfifo setup (#5382)
* nv cleanup gpfifo setup

* save lines
2024-07-11 17:50:52 +03:00
chenyu
416f838a1a hotfix tqdm respects total=0 if set (#5380)
if you insist total=0, it should use 0 instead of inferring from iterable. matched tqdm
2024-07-11 10:30:12 -04:00
nimlgen
2ba96d4c29 nv use mv_address (#5381)
* nv use mv_address

* unsued import
2024-07-11 16:45:03 +03:00
nimlgen
bd77efda2f add HWCommandQueue base class for hcq devices (#5303)
* add HWCommandQueue as base queue for hcq devices

* try this

* fixes

* comments

* linter

* linetr2

* linter

* linter

* fixed

* revert this
2024-07-11 16:19:13 +03:00
qazal
dc3ea78560 hotfix: faster UOps.END* insert [run_process_replay] (#5377)
* is this faster

* p2

* don't waste lines
2024-07-11 13:20:19 +03:00
qazal
004366b193 context aware process replay [run_process_replay] (#5378)
* test tc as ctx var

* remove from opts

* process replay

* pop variable

* B -> Variable

* fix re-assign

* pop temp vars

* move TRANSCENDENTAL=2
2024-07-11 13:07:28 +03:00
qazal
45e1b9d5e3 use TC options as ContextVars [run_process_replay] (#5379)
* delete from renderer

* move to ctx
2024-07-11 12:01:36 +03:00
qazal
289fd2e940 Lowerer cleanup 2 [run_process_replay] (#5376)
* test outbufs delete

* comments

* valid is bool
2024-07-11 10:56:53 +03:00
qazal
9ca2d96b6b delete extra check in DEFINE_ACC [run_process_replay] (#5375) 2024-07-11 10:49:03 +03:00
George Hotz
3e9f200905 KernelInfo + cleanups [run_process_replay] (#5372) 2024-07-10 21:00:31 -07:00
chenyu
2396ab9b33 more transcend cleanup [run_process_replay] (#5369)
fix test name, less # noqa: E501 and removed the cast
2024-07-10 23:05:03 -04:00
George Hotz
909ad72c53 remove getattr [run_process_replay] (#5370)
* remove getattr [run_process_replay]

* don't waste lines
2024-07-10 19:42:17 -07:00
George Hotz
0215c952c5 Move transcendental to UOp level (#5367)
* move uopgraph to file [run_process_replay]

* transcendental uops

* tests pass

* no skip

---------

Co-authored-by: chenyu <chenyu@fastmail.com>
2024-07-10 19:06:25 -07:00
chenyu
64986f949c more transcend math tests in ci (#5368)
* more transcend math tests in ci

test large input to trig functions that hit different reduction algo, and test TRANSCENDENTAL=2 for all backend

* no CUDACPU

* try that
2024-07-10 21:19:09 -04:00
wozeparrot
c9b3ae6bbf fix llama.py chat mode assert (#5366) 2024-07-10 18:06:14 -07:00
George Hotz
d13654a820 move uopgraph to file [run_process_replay] (#5364)
* move uopgraph to file [run_process_replay]

* fix print tree test
2024-07-10 17:34:50 -07:00
hikettei
320e7ed935 Approximations for SIN/LOG2/EXP2 passing all tests. (#5187)
* [WIP] Added an approximated implementation of Sin(FP32, FP64) passing all tests on Clang runtime

* Map nan/-inf/inf as 1.0 in order to avoid doing as_const(math.inf)

* [WIP] Added a support for LLVM IR

* cleaned up the code for the mypy and linter

* [WIP] Updated fp64 supports (bitwise shift causes the compilation error), fixed linter issue.

* [Add] added fast=true mode which disables the payne-hanek reduction which is slow

* [Fix] fails to compute elements when shape includes zero

* [WIP] Added BinaryOps.ADD/BinaryOps.OR to assembly

* [wip] update the assembly for ptx

* Enables fast=True when device is one of PTX, NV, CUDA, to avoid slow bitwise ops (as lv3 reduction is not required).

* [WIP] Added an approximation of LOG2/EXP2 (FP32, FP64)

* [Fix] Cyclic dependencies existing in xlog2

* [Fix] Cycle dependency in the graph of exp2, and log2. (passing test_symbolic_ops.py)

* [Fix] keep using higher precision for exp2, but cycle graph issue remained to be fixed...

* [Refactor] removed is_metal option. xsin does not rely on fp64 when fp32 mode.

* [WIP] fp16 xsin implementation passing all tests. (still needs to be refactored)

* [WIP] Added fp16 exp2 implementation

* [WIP] Increased the precision of Log2 from 3.5 ULP to 1.0 ULP, and added FP16 Log2 approximation.

* stashed the changes for FP16 sin

* [Fix] Patch for FP16 Sin/Exp2. (updated the dtype_via, fp32_p, and lower)

* [Refactor] migration to fastmath.py, some code simplification, renamed apis in fastmath, et al.

* [Refactor] Added the function polyN to clean-up N-terms polynomial approximation.

* [Patch] Increase fp64 precision when ldexp3k if possible, and patch for fp16 exp2

* [Patch] added bitcast_forward option

* [Patch] resolved cycle graph

* patch fix cycle graph

* set bitcast_forward=True in ilogb2k

* bitcast_forward for multi.py

* E501

* Break into multiple small PRs

* [Patch] FP16 -> FP64 upcast is not anymore required since xlog2 use quad precision polyN

* [Patch] NV still required FP64 for xlog2

* updated schedule test

* updated the count of kernels

* [Update] Removed all bitwise ops (SHL/SHR), tweaked the nan manipulation of log2, passing all tests except for AMD.

* Bitcast: make them api-compatible

* [update] force to use bitcast

* updated the count of constant folding

* [Patch] Creating a mask for exp2 using x <= Inf satisfies True as long as x is a real value

* [Update] isNaN(x) Free log2 algorithm, passing PTX tests, METAL with fastmath enabled is able to handle nan well, amd backend will not crash.

* xsin is reluctant to call payne_hanek_reduction which is slow to compile, passing stable diffusion compilation in a realistic time

* some minor simplification to payne hanek reduction

* [refactor] refactored some rebundant parts existing in payne hanek

* [refactor] more readable payne hanek impl

* [refactor] improved the code consistency of payne hanek

* [experiment] topological sort when doing _recursive_group (i dunno if this is good but at least it works.)

* Revert "[experiment] topological sort when doing _recursive_group (i dunno if this is good but at least it works.)"

This reverts commit 0eee08b87c.

* use allow_buffer_view

* lets support multilazytensor

* updated the count of kernels

* [test] added the jit tests for approx ops

* keep failed constant folding tests tested, added expectedFailure

* explict the timeout deadline when testing approx jit timeout

* [WIP] Simplified the implementation of xsin, never timeouts

* [Refactor] Improved the consistency of approx sin implementation, passing time out tests

* integrated xexp2_base into xexp2

* Set switch_over=39800.0

* delete: is_buffer_fastmath_supported

* sin: compute against abs(x)

* some cleanups

* fix typo

* removed the space between param and dtype

* allow 514 kernels on CI for sd

* [refactor] no need to upcast ad ldexp3k

* [refactor] added some comments, references to help understanding the code.

* [Fix] 1.0 ULP Sine Approximation for FP16

* [update] assume e != 0

* use pow2if instead of ldexp3k to fuse payne_hanek reduction into one

* check if approximated sin/log2/exp are fused into one

* clean up changes

* test amd exp

* some code cleanup and test sigmoid

* fix: enabled payne_hanek for fp16 to achieve higher acc

* fix: payne_hanek always accumlates the value with uint64, and fp16 sin is fused to a single kernel

* [Refactor] Rename: fastmath -> transcendental

* [Refactor] Added TRANSCENDENTAL, Moved the gate function to function.py

* updated const folding tests

* TRANSCENDENTAL as a ContextVar, removed old test of cody waite reduction, added assertions, et al.

* Add: unittest.main()

* Import TRANSCENDENTAL instead of getenv

* Refactor: Added dtype check when TRANSCENDENTAL=2, more context var

* Patch: xlog2, break expt(2, 32) x 2 -> expt(2, 16) x 4 for fp16 math

---------

Co-authored-by: George Hotz <72895+geohot@users.noreply.github.com>
Co-authored-by: chenyu <chenyu@fastmail.com>
2024-07-10 16:44:58 -07:00
George Hotz
7c0a657f08 hotfix: put process replay back 2024-07-10 16:32:50 -07:00
George Hotz
7a014d5435 move ast to the kernel (#5362)
* move ast to the kernel

* locals aren't image

* comment
2024-07-10 16:22:26 -07:00
wozeparrot
245d83a392 more tinybox docs (#5361) 2024-07-10 16:13:24 -07:00
George Hotz
6972a2569f Linearizer -> Lowerer (#4957)
* st to uops function

* lowerer

* uops reduce

* uops reduce

* acc_number correct

* reduce unroll

* complete unroll

* do upcasts

* handle multioutput

* define_accs

* fix valid

* get grouped dims

* revert lin

* minor

* fixup_ast

* group for reduce

* group works now

* all forwards pass

* all ops tests pass

* fix clang

* mypy

* lil cleanups, no image yet

* ugh, variables everywhere

* bugfix

* counters and name fix

* use symbolic, not uops

* cleanups

* Fix tests

* linearizer tests

* expands

* float4 expand load

* tests pass

* woooo, float4 test

* test ops works again

* one more lin test

* more lin tests

* bypass

* fix tests

* something like this

* const in defineacc

* uops get_reduce_acc

* move around

* allow consts in the LOAD/STORE

* each axis should only appear once, 21 failures

* 16 failures

* fix some image

* optional float4

* onnx tests

* gate the stores

* add reorder

* fix terrible skip function

* tc work

* opt add/mul merge

* fix float4 tests

* tiny tweak, 9 failing

* 7 test failures

* start tc, but i don't think this will work

* progress on tensorcores

* note

* fix ops tests

* closer on tc

* weeee...one tensor core works

* still works, more generic

* large WMMA works

* tc test passes

* use WMMA as accumulator

* basic tc tests passing

* small gemm padded works

* 4 failures

* 3 tests failing

* super barrier

* now two tests failing

* one test failing

* cleanpus, add reduce to UopGraph

* remove the linearizer

* remove unused

* lil cleanups

* Lowerer everywhere

* remove test that doesn't exist now

* image indexing

* llvm fix

* fix metal

* fix image

* fix images

* might fix ptx

* fix image type mismatch

* more tests pass

* CAST -> VECTORIZE

* forgot that one

* fix TestOps.test_flip_eye_crash

* locals shouldn't be image dtype

* change less files

* test fix

* fix recursive expands

* touches

* MULACC support in python

* delete unneeded

* alu before contract

* bug fixes

* tests

* no var multireduce

* simpler tc

* metal works in new style

* working on AMD and METAL

* fix amd

* shot in the dark, fix amd

* something for CUDA

* CUDA WORKS from the docs

* comment

* correct merge

* cleanups + ptx fix + get_reduce_acc

* local alias isn't used anymore

* add store sanity check

* fix for AMD

* cleanups and single expand pass

* more correct with acc_cache

* tests should pass

* block on WMMA

* tests pass

* merge contract and reduce

* contractor fixes issue

* multicontract

* pre expand wmma (same as a reduce)

* expand wmma and only take one

* all expands

* comments and whitespace
2024-07-10 15:07:42 -07:00
chenyu
204b6169ca minimize view _reshape_mask API [run_process_replay] (#5359)
* minimize view _reshape_mask API [run_process_replay]

_reshape_mask is only determined by mask, old_shape, new_shape. it does not need to input the whole view

* combine
2024-07-10 17:13:01 -04:00
wozeparrot
fa873df9c1 bring tinychat more inline with tinyos' version (#5358) 2024-07-10 13:13:52 -07:00
Roelof van Dijk
84839f6c58 mini refactor: simpler view.unbind (#5356)
* refactor: simpler unbind

* restore var_unboundvar_val
2024-07-10 09:29:56 -04:00
chenyu
f01192b8cd tiny View.unbind cleanup [run_process_replay] (#5355)
`Variable.val` asserts if it's None, so unbind does not need to check if `.val` is not None
2024-07-09 23:37:53 -04:00
chenyu
cc80005377 clean up view _reshape_mask [run_process_replay] (#5354)
return the new mask if reshaping is possible and None if not, instead of a mask and a bool
2024-07-09 18:48:32 -04:00
chenyu
322c37e621 use helpers.JIT in llama and gpt2 examples (#5350)
* use helpers.JIT in llama and gpt2 examples

replaced getenv("JIT"), effectively made gpt2 default jit

* fix test_gpt2
2024-07-09 15:04:43 -04:00
Elias Wahl
097268fab3 Add layerwise performance bench for bert (#5349)
* add bert bench

* dont disable by defauöt

* remove lr

* linter
2024-07-09 15:03:25 -04:00
nimlgen
1678199b15 add update_copy to hcq spec (#5348)
* add update_copy to hcq spec

* fix amd
2024-07-09 20:44:44 +03:00
chenyu
9504db1a57 remove the realize in _rebuild_tensor_v2 (#5347)
no longer needed
2024-07-09 12:28:52 -04:00
qazal
1f5de80eba multi reduce Tensor.var passing verify_lazyop (#5346)
* what about this

* reset late gate
2024-07-09 17:20:17 +03:00