George Hotz
07b350a8f4
new uops is an actual graph ( #4560 )
...
* new uops is an actual graph
* it's way slower
* simpler
* fix define acc
* render_loop unique
* ops test pass
* add pattern matcher back, there's bugs
* rewrite
* use priority queue
* recursive children
* fix tests
* fix tests with SINK
* fix abstractions
* fix assembly
* simpler
* link define_acc
* fix DEFINE_ACC placement
* type verify
* full cmp
* fix cmp
* ACCESS_ACC
* insert DEFINE_ACC
* fix PHI
* recursive rewrite
* fix many tests
* sum collapse
* more patterns
* correct change
* fold arange
* fix that lin test
* space
* big folding rule works
* close
* has more maxes, meh
* cached node replace
* set changed
* simplest folding yet
* works
* works
* DIV
* all tests pass
* del
* fuzz linearizer fails
* sum_collapse
* test depth 2 cf
* fix lin test 14
* fix clang depth
* disable that
* failure 14 is fixed
* fix ptx
* failure 27 is fixed
* fix llama
* run_cnt
* Revert "Optimize PTX gated loads index calculation (#4304 )"
This reverts commit d97d5a7689 .
* fix uops loop
* fix ptx bugs
* add barrier
* print
* mem_type in ptx direct
* bypass tests that fail in CI but pass locally
* ptx remove ptr_ar
* more ptx passing
* fix ptx tests
* assert compile support
* remove model inference benchmark from red
2024-05-17 18:00:18 -07:00
Szymon Ożóg
d97d5a7689
Optimize PTX gated loads index calculation ( #4304 )
...
* WIP but working
* Cleanup
* Remove float4 pred and alt
* Cleanup
* this is somehow slowin it down
* Simplify
* add define var to ignore when optimizing gates
* Update assembly.py
* Test for optimizing gated loads
* Cleanup
* Fix NEG needed before if
* Remove unused parameters
* Update assembly.py
* Fix for cachable gone
---------
Co-authored-by: oz <oz@oz-MS-7B86.NAT.gliwice.vectranet.pl >
Co-authored-by: chenyu <chenyu@fastmail.com >
2024-05-13 10:14:01 -07:00
George Hotz
02327b8adf
simple stuff from new_uops branch ( #4563 )
2024-05-12 22:18:05 -07:00
George Hotz
2f970a4fc2
all realize 2 ( #4527 )
...
* all realize 2
* tests fixup
* fix more tests
* fix openpilot
* fix tests
* unneeded
2024-05-10 22:43:09 -07:00
George Hotz
347a3acb37
add renderer class ( #4524 )
...
* add renderer class
* tests pass
* fix pylint
* fix tensor cores
2024-05-10 21:40:02 -07:00
George Hotz
4eef1ee9bf
move renderer into options ( #4514 )
...
* move renderer into options
* fix tests
* renders are functions
2024-05-10 10:01:51 -07:00
George Hotz
c9e84ed0da
refactor to Program class ( #4476 )
...
* refactor to Program class
* switch to Program
* fix tests
* smaller diff
* self.p
* more tests
* fix metal test
* tests
* fix openpilot
* move that to linearizer
* p.launchdims
2024-05-09 17:29:07 -07:00
George Hotz
f635c4d273
fix define global ( #4383 )
...
* fix define global
* remove name from DEFINE_GLOBAL
* fix fuzzing
* fix ptx
* fix python
2024-05-01 22:32:56 -04:00
Szymon Ożóg
f1ebcffb87
Ptx beam fix ( #4296 )
...
* Fix beam search for PTX
* fix ptr arm test
2024-04-25 15:39:39 -04:00
George Hotz
bbda20c0db
CompiledASTRunner -> CompiledRunner ( #4148 )
2024-04-11 08:49:52 -07:00
Szymon Ożóg
ba118abfec
improved caching for pointer arithmetics in ptx ( #3922 )
...
* improved caching for pointer arithmetics
* Add test for pointer arithmetics caching
* Refactor test
2024-04-04 07:33:48 -07:00
chenyu
fe03725b21
const fold cast unrealized_unpadded_const ( #4047 )
...
* const fold unrealized_unpadded_const
changed the underlying arg directly
* CAST_BEFORE_VIEW folds some
* fix const index in getitem
2024-04-03 12:31:24 -04:00
chenyu
793ab0512e
use ctypes to truncate float64 and float32 in uops ( #3986 )
...
this fixed the softmax.argmax bug for ops_python as the float is truncated to float32
2024-03-28 23:56:50 -04:00
chenyu
c4c243f79d
update test_uops _equal to use assert_allclose ( #3981 )
...
it handles nan
2024-03-28 22:14:45 -04:00
chenyu
b47f6cebb2
LinearizerOptions -> CompilerOptions ( #3978 )
2024-03-28 17:50:23 -04:00
George Hotz
42b9d999ea
Buffer isn't always allocated ( #3974 )
...
* buffer alloc
* allocate
* missing allocates
* last one
2024-03-28 13:33:47 -07:00
chenyu
6c7df1445b
enforce UOps.CONST arg has python type based on dtype ( #3952 )
...
added an assert in uops, remove the cast in renderer
2024-03-27 01:41:38 -04:00
George Hotz
68ca4d4276
split to schedule.py ( #3949 )
...
* split to schedule.py
* split
2024-03-26 21:02:46 -07:00
George Hotz
150ea2eb76
create engine folder and move code ( #3948 )
...
* retry
* older tf
* that
2024-03-26 20:38:03 -07:00
Arseny Kapoulkine
514c43201d
Fix issues with pointer provenance in load/store through ALU ( #3916 )
...
* Track pointer provenance in load/store through ALU
Previously load/store could be incorrectly rendered into
ld.global/st.global when the input was an ALU op that performed an
address computation with DEFINE_LOCAL on one of the arguments.
* Simplify the load provenance workaround
The issue is that we can render the same code twice, and on the second
run the opstream is already modified so that vin[0] isn't a DEFINE_*,
which overwrites initially correct .shared wth .global.
* Add a couple tests for basic local use
* Skip local tests on LLVM since it doesn't implement DEFINE_LOCAL
2024-03-25 14:41:05 -07:00
George Hotz
bf3e1c4df2
support pickling tensors and others ( #3787 )
...
* test pickle tensors
* pickle unrealized tensor
* pickle jit, don't save Device in every CompiledASTRunner
* real test of pickle, move delete
2024-03-17 18:29:14 -07:00
chenyu
a2d3cf64a5
move is_dtype_supported to test.helpers ( #3762 )
...
* move is_dtype_supported to test.helpers
updated all places that check if float16 is supports
* fix tests
2024-03-15 14:33:26 -04:00
chenyu
75d4344cda
UOps.BITCAST ( #3747 )
...
* UOps.BITCAST
implicitly fixed no const folding for bitcast
* python backend
* ptx
* consistent llvm
2024-03-14 21:00:35 -04:00
chenyu
9a00a453c7
add test case for uop cast constant fold ( #3746 )
...
and a expected failed bitcast fold test case. Will fix with UOps.BITCAST refactor
2024-03-14 20:00:27 -04:00
George Hotz
2024b24f35
add some graph tests ( #3702 )
...
* add some graph tests
* PatternMatcher class
* speedup
* const cast test
* fix tests
* itertools chain
2024-03-12 09:49:47 -07:00
George Hotz
44a67bf783
constant folding ( #3675 )
...
* constant fold
* bool math
* fix ptx
2024-03-10 14:47:24 -07:00
George Hotz
25aede6fd9
truncate for exec_alu ( #3674 )
2024-03-10 14:19:04 -07:00
chenyu
906cc3a69b
cleanup tests Device[Device.DEFAULT] is always Compiled ( #3645 )
2024-03-07 11:15:42 -05:00
George Hotz
81baf3eed3
bring ptx back ( #3623 )
...
* bring ptx back
* ptx back
* fix define var
* fix a few bugs
* bugfixes
* fixes
* fix llvm bug
* fix test bug
2024-03-06 13:34:21 -08:00
qazal
eb83e2d3a0
decouple buffer mutability from cstyle ( #3617 )
...
* buffer mutability as an arg
* update test_uops
2024-03-05 06:20:59 -08:00
Patrick Tsai
bc562c4747
Python div alu behavior differs slightly from others ( #3596 )
...
* Divide op rounding for negatives
* extra space
---------
Co-authored-by: Patrick Tsai <patosai@users.noreply.github.com >
2024-03-03 10:48:25 -08:00
George Hotz
aa9b013d79
add constant folding for WHERE in uops ( #3584 )
...
* add constant folding for WHERE in uops
* prereqs for generic constant folding
* fix test
* disable slow overflow logic
* make that test faster
2024-03-02 10:37:14 -08:00
George Hotz
bd9c2ced07
define var can be removed from vars to keep ( #3549 )
...
* define var can be removed
* sint
* oops, didn't store
2024-02-29 17:44:19 -08:00
George Hotz
83cdc85790
add index to DEFINE_GLOBAL ( #3542 )
...
* remove DEFINE_GLOBAL from uops with side effects
* add index to DEFINE_GLOBAL
* bugfix
* better var name
2024-02-29 15:22:26 -08:00
geohotstan
9268a8b154
remove MULACC ( #3459 )
...
* init
* removed mulacc
* is uoptimize the problem?
* lol hax make work temporarily fix l8er
* revert extra/ changes
* clean up
* flaky metal tests?
* add back mulacc for metal
* revert last commit
* try skipping linearizer_failure tests
* skip flammit tests... cuz tests all work locally
* try narrow down exact linearizer failure test
* try 2
* try 4
* generated code is the exact same wtf why CI fails
* code for 15 and 17 are exact same with or without mulacc, this should pass
* try only 1 failure
* try garbage collecting lol...
* try del variables lol
* try gcing after del lol...
* is diskcache the problem???
* try disabling opts cache idk
* try remove hack
* try disable github metal cache...
* try CACHELEVEL=0 :D idk anymore
* try increase newCommandQueueWithMaxCommandBufferCount_, im almost out of ideas...
* revert
* actually not a HACK
* oops
2024-02-29 07:40:40 -05:00
Carson Radtke
15df9406d6
fix exec_alu(UnaryOps.SQRT, <...>, (0,)) + add test ( #3487 )
...
* fix exec_alu(UnaryOps.SQRT, <...>, (0,)) + add test
* sqrt(0) != nan
* fix tabs
2024-02-23 18:28:00 +01:00
George Hotz
3c728d1082
compiler support ( #3260 )
...
* compiler support
* revert that
* fix tests
2024-01-26 23:36:40 -08:00
George Hotz
91a1b2bd7a
the runner does the build ( #3220 )
2024-01-23 18:45:43 -08:00
George Hotz
228f30b96a
multitensor jit ( #3149 )
...
* initial multitensor jit support and tests
* Added graphs to multitensor jit and updated tests
* update unbind api
* fix set device, add TinyJit to resnet
* update_stats includes device
---------
Co-authored-by: ramenguy99 <ramenguy99@gmail.com >
2024-01-16 09:09:15 -08:00
George Hotz
1f9aee8b6f
remove numpy from device ( #3123 )
...
* remove numpy from device
* fix tests
* np item
* cleanups
* simplify with as_buffer
* no toCPU
* tinygradic
* cast to scalar
2024-01-14 19:36:05 -08:00
George Hotz
374f7659a7
remove unused reciprocal ( #3053 )
...
* remove unused reciprocal
* comment
2024-01-09 08:59:04 -08:00
chenyu
ae112c9dbe
fix some long lines in tests ( #3006 )
...
* fix some long lines in tests
* better
2024-01-03 23:53:33 -05:00
George Hotz
e7a432b479
search refactor ( #2969 )
...
* minor search cleanup
* now that saves lines
* fix
2024-01-01 17:39:26 -08:00
George Hotz
a280cfe169
move dtypes to dtype.py ( #2964 )
...
* move dtypes to dtype.py
* fix urllib
2024-01-01 14:58:48 -08:00
chenyu
765f8b05e5
TernaryOps.WHERE has vin[0] as bool and BinaryOps.CMPLT always outputs bool ( #2782 )
...
* vin[0] to where is always bool
* due to better hack
* update test
* fix test_uops
2023-12-15 14:51:51 -05:00
chenyu
c0f76ed4ea
transformer kvcache and mask have same dtype as input ( #2771 )
...
* transformer kvcache and mask have same dtype as input
* don't use `=0` in cstyle ternary where
* (bool)
* where float16 test
2023-12-14 22:41:51 -05:00
chenyu
57017c87e9
remove duplicated dtype in DEFINE_GLOBAL args ( #2768 )
...
now DEFINE_GLOBAL uop.arg[1] is always the same as uop.dtype, we can remove the one in arg and just use uop.dtype
2023-12-14 15:42:36 -05:00
chenyu
5235cdee3d
remove _arg_int32 internal type ( #2767 )
...
in DEFINE_GLOBAL, PtrDtype(int32) is buffer and int32 is int
2023-12-14 14:17:14 -05:00
George Hotz
6d6eb9302d
ruff checks the max line length is 150 ( #2734 )
...
* ruff checks the max line length is 150
* fix tensor.py
* a lot more
* done
2023-12-12 17:34:47 -08:00
Ahmed Harmouche
4b01839774
support vals on WebGPU, run more tests ( #2668 )
...
* Vals on webgpu, run more tests
* Skip slow tests, run symbolic ops tests
* Balance out tests
2023-12-07 16:45:21 -08:00