Commit Graph

75 Commits

Author SHA1 Message Date
George Hotz
07b350a8f4 new uops is an actual graph (#4560)
* new uops is an actual graph

* it's way slower

* simpler

* fix define acc

* render_loop unique

* ops test pass

* add pattern matcher back, there's bugs

* rewrite

* use priority queue

* recursive children

* fix tests

* fix tests with SINK

* fix abstractions

* fix assembly

* simpler

* link define_acc

* fix DEFINE_ACC placement

* type verify

* full cmp

* fix cmp

* ACCESS_ACC

* insert DEFINE_ACC

* fix PHI

* recursive rewrite

* fix many tests

* sum collapse

* more patterns

* correct change

* fold arange

* fix that lin test

* space

* big folding rule works

* close

* has more maxes, meh

* cached node replace

* set changed

* simplest folding yet

* works

* works

* DIV

* all tests pass

* del

* fuzz linearizer fails

* sum_collapse

* test depth 2 cf

* fix lin test 14

* fix clang depth

* disable that

* failure 14 is fixed

* fix ptx

* failure 27 is fixed

* fix llama

* run_cnt

* Revert "Optimize PTX gated loads index calculation (#4304)"

This reverts commit d97d5a7689.

* fix uops loop

* fix ptx bugs

* add barrier

* print

* mem_type in ptx direct

* bypass tests that fail in CI but pass locally

* ptx remove ptr_ar

* more ptx passing

* fix ptx tests

* assert compile support

* remove  model inference benchmark from red
2024-05-17 18:00:18 -07:00
Szymon Ożóg
d97d5a7689 Optimize PTX gated loads index calculation (#4304)
* WIP but working

* Cleanup

* Remove float4 pred and alt

* Cleanup

* this is somehow slowin it down

* Simplify

* add define var to ignore when optimizing gates

* Update assembly.py

* Test for optimizing gated loads

* Cleanup

* Fix NEG needed before if

* Remove unused parameters

* Update assembly.py

* Fix for cachable gone

---------

Co-authored-by: oz <oz@oz-MS-7B86.NAT.gliwice.vectranet.pl>
Co-authored-by: chenyu <chenyu@fastmail.com>
2024-05-13 10:14:01 -07:00
George Hotz
02327b8adf simple stuff from new_uops branch (#4563) 2024-05-12 22:18:05 -07:00
George Hotz
2f970a4fc2 all realize 2 (#4527)
* all realize 2

* tests fixup

* fix more tests

* fix openpilot

* fix tests

* unneeded
2024-05-10 22:43:09 -07:00
George Hotz
347a3acb37 add renderer class (#4524)
* add renderer class

* tests pass

* fix pylint

* fix tensor cores
2024-05-10 21:40:02 -07:00
George Hotz
4eef1ee9bf move renderer into options (#4514)
* move renderer into options

* fix tests

* renders are functions
2024-05-10 10:01:51 -07:00
George Hotz
c9e84ed0da refactor to Program class (#4476)
* refactor to Program class

* switch to Program

* fix tests

* smaller diff

* self.p

* more tests

* fix metal test

* tests

* fix openpilot

* move that to linearizer

* p.launchdims
2024-05-09 17:29:07 -07:00
George Hotz
f635c4d273 fix define global (#4383)
* fix define global

* remove name from DEFINE_GLOBAL

* fix fuzzing

* fix ptx

* fix python
2024-05-01 22:32:56 -04:00
Szymon Ożóg
f1ebcffb87 Ptx beam fix (#4296)
* Fix beam search for PTX

* fix ptr arm test
2024-04-25 15:39:39 -04:00
George Hotz
bbda20c0db CompiledASTRunner -> CompiledRunner (#4148) 2024-04-11 08:49:52 -07:00
Szymon Ożóg
ba118abfec improved caching for pointer arithmetics in ptx (#3922)
* improved caching for pointer arithmetics

* Add test for pointer arithmetics caching

* Refactor test
2024-04-04 07:33:48 -07:00
chenyu
fe03725b21 const fold cast unrealized_unpadded_const (#4047)
* const fold unrealized_unpadded_const

changed the underlying arg directly

* CAST_BEFORE_VIEW folds some

* fix const index in getitem
2024-04-03 12:31:24 -04:00
chenyu
793ab0512e use ctypes to truncate float64 and float32 in uops (#3986)
this fixed the softmax.argmax bug for ops_python as the float is truncated to float32
2024-03-28 23:56:50 -04:00
chenyu
c4c243f79d update test_uops _equal to use assert_allclose (#3981)
it handles nan
2024-03-28 22:14:45 -04:00
chenyu
b47f6cebb2 LinearizerOptions -> CompilerOptions (#3978) 2024-03-28 17:50:23 -04:00
George Hotz
42b9d999ea Buffer isn't always allocated (#3974)
* buffer alloc

* allocate

* missing allocates

* last one
2024-03-28 13:33:47 -07:00
chenyu
6c7df1445b enforce UOps.CONST arg has python type based on dtype (#3952)
added an assert in uops, remove the cast in renderer
2024-03-27 01:41:38 -04:00
George Hotz
68ca4d4276 split to schedule.py (#3949)
* split to schedule.py

* split
2024-03-26 21:02:46 -07:00
George Hotz
150ea2eb76 create engine folder and move code (#3948)
* retry

* older tf

* that
2024-03-26 20:38:03 -07:00
Arseny Kapoulkine
514c43201d Fix issues with pointer provenance in load/store through ALU (#3916)
* Track pointer provenance in load/store through ALU

Previously load/store could be incorrectly rendered into
ld.global/st.global when the input was an ALU op that performed an
address computation with DEFINE_LOCAL on one of the arguments.

* Simplify the load provenance workaround

The issue is that we can render the same code twice, and on the second
run the opstream is already modified so that vin[0] isn't a DEFINE_*,
which overwrites initially correct .shared wth .global.

* Add a couple tests for basic local use

* Skip local tests on LLVM since it doesn't implement DEFINE_LOCAL
2024-03-25 14:41:05 -07:00
George Hotz
bf3e1c4df2 support pickling tensors and others (#3787)
* test pickle tensors

* pickle unrealized tensor

* pickle jit, don't save Device in every CompiledASTRunner

* real test of pickle, move delete
2024-03-17 18:29:14 -07:00
chenyu
a2d3cf64a5 move is_dtype_supported to test.helpers (#3762)
* move is_dtype_supported to test.helpers

updated all places that check if float16 is supports

* fix tests
2024-03-15 14:33:26 -04:00
chenyu
75d4344cda UOps.BITCAST (#3747)
* UOps.BITCAST

implicitly fixed no const folding for bitcast

* python backend

* ptx

* consistent llvm
2024-03-14 21:00:35 -04:00
chenyu
9a00a453c7 add test case for uop cast constant fold (#3746)
and a expected failed bitcast fold test case. Will fix with UOps.BITCAST refactor
2024-03-14 20:00:27 -04:00
George Hotz
2024b24f35 add some graph tests (#3702)
* add some graph tests

* PatternMatcher class

* speedup

* const cast test

* fix tests

* itertools chain
2024-03-12 09:49:47 -07:00
George Hotz
44a67bf783 constant folding (#3675)
* constant fold

* bool math

* fix ptx
2024-03-10 14:47:24 -07:00
George Hotz
25aede6fd9 truncate for exec_alu (#3674) 2024-03-10 14:19:04 -07:00
chenyu
906cc3a69b cleanup tests Device[Device.DEFAULT] is always Compiled (#3645) 2024-03-07 11:15:42 -05:00
George Hotz
81baf3eed3 bring ptx back (#3623)
* bring ptx back

* ptx back

* fix define var

* fix a few bugs

* bugfixes

* fixes

* fix llvm bug

* fix test bug
2024-03-06 13:34:21 -08:00
qazal
eb83e2d3a0 decouple buffer mutability from cstyle (#3617)
* buffer mutability as an arg

* update test_uops
2024-03-05 06:20:59 -08:00
Patrick Tsai
bc562c4747 Python div alu behavior differs slightly from others (#3596)
* Divide op rounding for negatives

* extra space

---------

Co-authored-by: Patrick Tsai <patosai@users.noreply.github.com>
2024-03-03 10:48:25 -08:00
George Hotz
aa9b013d79 add constant folding for WHERE in uops (#3584)
* add constant folding for WHERE in uops

* prereqs for generic constant folding

* fix test

* disable slow overflow logic

* make that test faster
2024-03-02 10:37:14 -08:00
George Hotz
bd9c2ced07 define var can be removed from vars to keep (#3549)
* define var can be removed

* sint

* oops, didn't store
2024-02-29 17:44:19 -08:00
George Hotz
83cdc85790 add index to DEFINE_GLOBAL (#3542)
* remove DEFINE_GLOBAL from uops with side effects

* add index to DEFINE_GLOBAL

* bugfix

* better var name
2024-02-29 15:22:26 -08:00
geohotstan
9268a8b154 remove MULACC (#3459)
* init

* removed mulacc

* is uoptimize the problem?

* lol hax make work temporarily fix l8er

* revert extra/ changes

* clean up

* flaky metal tests?

* add back mulacc for metal

* revert last commit

* try skipping linearizer_failure tests

* skip flammit tests... cuz tests all work locally

* try narrow down exact linearizer failure test

* try 2

* try 4

* generated code is the exact same wtf why CI fails

* code for 15 and 17 are exact same with or without mulacc, this should pass

* try only 1 failure

* try garbage collecting lol...

* try del variables lol

* try gcing after del lol...

* is diskcache the problem???

* try disabling opts cache idk

* try remove hack

* try disable github metal cache...

* try CACHELEVEL=0 :D idk anymore

* try increase newCommandQueueWithMaxCommandBufferCount_, im almost out of ideas...

* revert

* actually not a HACK

* oops
2024-02-29 07:40:40 -05:00
Carson Radtke
15df9406d6 fix exec_alu(UnaryOps.SQRT, <...>, (0,)) + add test (#3487)
* fix exec_alu(UnaryOps.SQRT, <...>, (0,)) + add test

* sqrt(0) != nan

* fix tabs
2024-02-23 18:28:00 +01:00
George Hotz
3c728d1082 compiler support (#3260)
* compiler support

* revert that

* fix tests
2024-01-26 23:36:40 -08:00
George Hotz
91a1b2bd7a the runner does the build (#3220) 2024-01-23 18:45:43 -08:00
George Hotz
228f30b96a multitensor jit (#3149)
* initial multitensor jit support and tests

* Added graphs to multitensor jit and updated tests

* update unbind api

* fix set device, add TinyJit to resnet

* update_stats includes device

---------

Co-authored-by: ramenguy99 <ramenguy99@gmail.com>
2024-01-16 09:09:15 -08:00
George Hotz
1f9aee8b6f remove numpy from device (#3123)
* remove numpy from device

* fix tests

* np item

* cleanups

* simplify with as_buffer

* no toCPU

* tinygradic

* cast to scalar
2024-01-14 19:36:05 -08:00
George Hotz
374f7659a7 remove unused reciprocal (#3053)
* remove unused reciprocal

* comment
2024-01-09 08:59:04 -08:00
chenyu
ae112c9dbe fix some long lines in tests (#3006)
* fix some long lines in tests

* better
2024-01-03 23:53:33 -05:00
George Hotz
e7a432b479 search refactor (#2969)
* minor search cleanup

* now that saves lines

* fix
2024-01-01 17:39:26 -08:00
George Hotz
a280cfe169 move dtypes to dtype.py (#2964)
* move dtypes to dtype.py

* fix urllib
2024-01-01 14:58:48 -08:00
chenyu
765f8b05e5 TernaryOps.WHERE has vin[0] as bool and BinaryOps.CMPLT always outputs bool (#2782)
* vin[0] to where is always bool

* due to better hack

* update test

* fix test_uops
2023-12-15 14:51:51 -05:00
chenyu
c0f76ed4ea transformer kvcache and mask have same dtype as input (#2771)
* transformer kvcache and mask have same dtype as input

* don't use `=0` in cstyle ternary where

* (bool)

* where float16 test
2023-12-14 22:41:51 -05:00
chenyu
57017c87e9 remove duplicated dtype in DEFINE_GLOBAL args (#2768)
now DEFINE_GLOBAL uop.arg[1] is always the same as uop.dtype, we can remove the one in arg and just use uop.dtype
2023-12-14 15:42:36 -05:00
chenyu
5235cdee3d remove _arg_int32 internal type (#2767)
in DEFINE_GLOBAL, PtrDtype(int32) is buffer and int32 is int
2023-12-14 14:17:14 -05:00
George Hotz
6d6eb9302d ruff checks the max line length is 150 (#2734)
* ruff checks the max line length is 150

* fix tensor.py

* a lot more

* done
2023-12-12 17:34:47 -08:00
Ahmed Harmouche
4b01839774 support vals on WebGPU, run more tests (#2668)
* Vals on webgpu, run more tests

* Skip slow tests, run symbolic ops tests

* Balance out tests
2023-12-07 16:45:21 -08:00