Commit Graph

76 Commits

Author SHA1 Message Date
chenyu
cb4cfc078a parameterize multitensor tests for reduce (#3181)
uneven shards reduce is incorrect now
2024-01-19 14:03:01 -05:00
chenyu
b2571d586c hypothesis.st -> hypothesis.strat (#3179)
leave `st` for shapetracker
2024-01-19 11:55:26 -05:00
George Hotz
ca0beeef38 Christopherm99 ptx (#3139)
* get basic ptx impl working

* test ops passing

* mypy

* dont hardcode target

* more walrus

* ptx in ci

* bool cast and f16 load/store

* weird numpy bug and f16 cast tolerance

* cast half to bool

* fix 1 byte load/store

* disable half for ptx

* fix args and enable xid

* fix non-ptr args

* allow bitcast

* mypy

* cleanups

* midcast use allclose

* add xor

* Revert "disable half for ptx"

This reverts commit 73391c05fd.

* enable float16

* mypy

* no more crashing in ci

* fix ci

* minor cleanups

* use new fn for ptx compiler

* no diskcache in ptx compile

* use rn instead of rz

* save some lines

* new DEFINE_GLOBAL syntax

* line length

* new llvm

* cmpeq

* minor fix

* cast in mulacc

* update test_recursive_add to check line count

* mypy

* remove llvmir.py

* fix bool const

* wip

* cleanups

* working

* llvm in separate pr

* cleanups

* more cleanups

* fix ci

* use in_features directly in nn.Linear.__init__ bound check (#3050)

* use in_features directly in nn.Linear.__init__ bound check

get rid of the unnecessary check of isinstance int

* that is always int

* long lines

* Device._buffers -> Device._devices (#3052)

backend devices used to be called buffers

* make Embedding device aware for multigpu (#3051)

* make Embedding device aware for multigpu

* split line instead of igore because that's cheating

* add test incomplete

* add test complete

* remove comment

* fix white space

* remove nn.Embedding

* remove unused reciprocal (#3053)

* remove unused reciprocal

* comment

* unit tests for Device.canonicalize (#3055)

* add multigpu test for RMSNorm (#3056)

* need all gather

* add two multigpu test scenarios for RMSNorm

* No extra vars call (#3054)

* remove unused reciprocal

* comment

* remove unneeded call to vars

* free speedup

* explicit lazybuffer caching (#3058)

* hotfix: remove useless slow assert from ShapeTracker

* Speed tweaks (#3059)

* base doesn't have to be a function

* no double fetch

* pop, don't check

* make the gc happy

* avoid hasattr

* cache canonicalize

* remove assert, faster base

* don't redefine that every time

* fix gpt2 attention with start_pos = 0 (#3061)

* fix gpt2 attention with start_pos size 1

test cases taken from ll_transformer branch

* fix interpreted

* Tensor.cat with 0 shape tensors (#3062)

* Tensor.cat with 0 shape tensors

supported both 0 in cat axis (for a subset of input), or 0 in non-cat axis (all needs to be 0)

* no shp

* test scaled dot product attention (#3063)

* add test

* add initial test for scaled dot product attention

* test pass for scaled dot product attention

* cached size (#3060)

* cached size

* simplify simplify

* 0 doesn't have base

* fix test

* cleaner cache

* hmm, metal is flaky on this...might be real(ish) but useless as test

* short circuit reshape/expand properly

* better reshape bypass

* hotfix: use is for enum compare

* hotfix: use is for enum compare, a few more

* speedtweaks3: apply shouldn't use the tensor constructor (#3065)

* speedtweaks3: apply shouldn't use the tensor constructor

* replace 0 size with CONST, not 0 in shape

* update gh actions (#3033)

* update checkout actions

* update upload artifact

* update setup python

---------

Co-authored-by: George Hotz <72895+geohot@users.noreply.github.com>

* unbind view or shapetracker also returns var_val (#3067)

* unbind view or shapetracker also returns var_val

4% faster for llama compile time

* one line less

* unbound_views

* hotfix: examples/transformer.py

* jit autorealizes output (#3069)

* early gate the graph (#3070)

* simpler idxs_to_idx (#3071)

* filter_strides -> canonicalize_strides (#3072)

* fix onehot and jit in examples/transformer (#3073)

trained to 0.999 in < 6 seconds on M1 Max consistently

* better test demonstration (#3077)

* a better test demonstration

* fix white space

* Tensor.expand resolves the new_shape before shortcut return (#3078)

similar to how reshape is done. also updated shrink shortcut criteria to read similar to pad

* minor cleanups of lazy.py (#3080)

* wmma: clean up device specific tensor core code (#3081)

* mem_estimate is always int, not symbolic (#3083)

* mem_estimate is always int, not symbolic

op_estimate can be symbolic, but mem_estimate is always int, thus we don't need to sym_infer it.
fixed some long lines too. update_stats is a very big function

* operator does not need underscores

* cat works (#3086)

* hotfix disable flaky mac runner wino cifar (#3087)

* remove the third merging state in view._merge_dims (#3085)

no logic depends on state == 0 or state == 2

* minor cleanup of View.reshape (#3088)

* minor cleanup of View.reshape

removed some redundant logic

* new_strides

* revert that

* use BEAM=2 instead of BEAM=4 in cuda ci gpt2 (#3089)

BEAM=2 is faster and less search time. investigating why BEAM2+BEAM4 is slower than BEAM2 alone

* use device from LinearizerOptions in kernel search (#3090)

* use device from LinearizerOptions in kernel search

removed all Device.DEFAULT in search.py

* pass device string for parallel pickle

* device for interpreted backends in LinearizerOptions

* update jit type annotation post lazy rewrite (#3091)

* add mutigpu support for llama attention (#3064)

* add llama attention test for multigpu

* test fails

* kv cache trying to shrink on sharded axis

* mask None works for scale dot product

* kv cache seems to be working but scale dot product breaks

* scaled dot product works, but the last linear layer failed

* running into the reshape case where it could be wrong for multigpu

* making sure it was the reshape

* adding contiguous doesn't solve

* need to shard more properly

* remove reshape test

* minor adjustment to scale dot product attention test

* weights are sharded wrong

* continue fix new weight sharding

* clean up

* fix attention when start_pos is 0

* remove print

* add TODOs for the best mutigpu interface

* bugfix do not reset shapetracker of 0 size lazybuffer (#3096)

it might be coming from an expand, and resetting results incorrect stride. caught by interpreted backend

* One hot in tensor.py (#3093)

* onehot in Tensor.py

* one_hot tests

* works for all shapes, not just 1

* pylint

* not a static method

* moved around, num_classes mandatory

* pylint

* pylint

* space & moving

* formatting

* moved tests

* fix broadcasted logic if there's 0 in shapes (#3097)

* fix broadcasted logic if there's 0 in shapes

should always expand into 0, not the other way around. fixed matmul with 0 in input shapes.
for forwards for now though, backward is more involved and would need to change 0 size shortcuts

* fix tests

* replace with tensor op (#3099)

* fix gpt2 with empty prompt (#3100)

logits would be empty so need to replace that with ones before sampling, also cannot reshape with -1 when there's 0 in other axes

* Revert "fix gpt2 with empty prompt" (#3101)

* fix gpt2 with empty prompt take 2 (#3102)

logits would be empty so need to replace that with ones before sampling, also cannot reshape with -1 when there's 0 in other axes

* wmma: enable METAL half tensor cores and clean up cstyle (#3095)

* wmma: enable METAL half tensor cores and clean up cstyle

* revert simple_matmul rand changes and break line in tensor

* added metal fp16->fp32 tensor core

* add half @ half to mac benchmark (#3103)

* flag to profile mixtral - 1.7 tok/s now (#3104)

* update NumNode.__hash__ to be hash(self.b) (#3105)

with this, `a:=NumNode(x) == b` implies `hash(a) == hash(b)`

* catch runtime error in search._time_program (#3106)

return inf if search encountered runtime errors.

* no exceptions in __del__ when module creation is failed in hip/cuda (#3107)

* failed test case due to cast resets shapetracker (#3109)

cast implicitly resets shapetracker and makes it contiguous (for disk tensor), which fails for Interpreted backend if inputs contain non-contiguous st.

* cleanup ops_disk type annotation and redundant str cast (#3110)

* minor cleanup of test_disk_tensor (#3112)

* add Tensor.var (#3114)

also updated MeanVarianceNormalization and made test_ops test tensors of var and std smaller

* move sample inside jit for beautiful_mnist (#3115)

also removed .realize() for jit functions since jit does it automatically now. a little more beautiful

* minor cleanups of onnx_ops (#3116)

* fix conversation: llama generates token not prob now (#3120)

* add device options for tests in multigpu (#3121)

* make DType a dataclass (#3111)

* remove np from DType

* convert to dataclass

* remove dunder hash, eq, ne overrides from ImageDType

* is dataclass required for PtrDType?

* fix GPU tests

* reduce lines

* revert changes to np

* minor cleanup

* hotfix: ptrdtype compare was broken

* move fromcpu out of lazy.py (#3122)

* move fromcpu out of lazy.py

* fix abstractions2

* remove numpy from device (#3123)

* remove numpy from device

* fix tests

* np item

* cleanups

* simplify with as_buffer

* no toCPU

* tinygradic

* cast to scalar

* remove numpy from ops_torch (#3124)

updated mnist test to cast label to int8 and avoid hacking cast issue of torch uint8

* Fix backward fn for `<` and `==` (#3037)

* fix no grad fn for < and ==

* remove 2 line breaks

* Remove deprecated autograd variable

---------

Co-authored-by: George Hotz <72895+geohot@users.noreply.github.com>

* separate try except blocks in onnx2torch in model benchmark (#3126)

exceptions can be raised from either model conversion or individual backend failed. openpilot on torch mps works, but does not work with torch cpu.
seperate the expcetion block so that the benchmark can inlcude torch mps for openpilot.

* update env_vars.md (#3127)

mostly removed deprecated ones. not clear how to maintain this especially for extra/examples

* update test_ptr_ne (#3130)

* remove np from metal graph (#3129)

* dtype fmt (#3132)

* dtype fmt

* three ways to access

* fix off-by-one error in st_equal (#3131)

* fix off by one error

* whitespace

* no numpy (#3134)

* fast resnet eval (#3135)

* fast resnet eval

* fix HIP multidevice graph

* neater expression for devices

* lines

* add decorator test

* remove LLVMOPT

* move ptx

* Update ops_cuda.py

---------

Co-authored-by: Christopher Milan <chrismilan@ucla.edu>
Co-authored-by: chenyu <chenyu@fastmail.com>
Co-authored-by: Yixiang Gao <yixiangg310573@gmail.com>
Co-authored-by: jxdv <virgoj@protonmail.com>
Co-authored-by: Francis Lam <flam@alum.mit.edu>
Co-authored-by: SnakeOnex <sheeproman@gmail.com>
Co-authored-by: nimlgen <138685161+nimlgen@users.noreply.github.com>
Co-authored-by: Jyotirmaya Mahanta <jyotirmaya.mahanta@gmail.com>
Co-authored-by: Guy Leroy <g.m.leroy@outlook.com>
Co-authored-by: Paul Gustafson <paul.gustafson@theambrusgroup.com>
2024-01-15 16:44:20 -08:00
Jyotirmaya Mahanta
2ef09ca641 update test_ptr_ne (#3130) 2024-01-15 11:36:29 -05:00
George Hotz
c5a941d466 webgl backend in extra (#3041)
* WebGL WIP

* 84% of ops passing test

* tests passing 100%

* Cleanup, refactor

* Shave off some lines

* Work on dtypes

* TestOps at 100% again

* Efficient net shaders compile in browser webgl2

* Compile all efficientnet shaders in browser

* Create empty textures for tensor buffers

* Run program. Up next weight loading

* Exported WebGL model working

* Add tests, refactor

* Explicit cast alu for GLSL

* Fix CI tests

* WebGL efficientnet demo

* Compile and run yolov8 in browser

* Fix imports

* Simplify yolo compile

* Fix bool*bool and cast cmplt to float

* More tests

* Do std tests pass on CI?

* Skip std tests on CI

* Remove explicit_cast_alu hack, and solve it in code_for_op

* Move to new dtype-less alloc api

* Remove local size hack: optimize local_size only if device has local

* Remove glsl.py, and move content to cstyle

* dont_use_locals in opts

* Fix dtype tests

* type_map in CStyleLanguage

* Make core changes smaller, cleaner, refactor export_model and demo

* Skip pad_slice

* Simplify: render_const, render_conditional

* solve bool alu for other binops, cleaner ops_webgl

* Fix noopt hack

* Remove some skipIfs

* WebGL image hack

* type_names is a better name

* global_max

* Fix dtype import

* Fix type_names -> type_map

* Fix lint

* Remove webgpu, back to 5k lines (#3040)

* remove webgpu

* max 5000 lines

* revert those to master

* retain that cstyle

---------

Co-authored-by: Ahmed Harmouche <ahmedharmouche92@gmail.com>
2024-01-08 09:29:13 -08:00
George Hotz
f432ec9c33 Bitcast hip fix + fix mixtral (#3022)
* fix bitcast in hip

* wrong dtype for precast, double COPY
2024-01-05 14:51:25 -08:00
chenyu
9f39165188 correct (dtype, device) in test_dtype.is_dtype_supported (#3007)
corrected dtypes for TORCH and float64 support
2024-01-04 00:25:37 -05:00
chenyu
ff5399f053 move one last dtype test from test_helpers to test_dtype (#2975) 2024-01-02 12:37:56 -05:00
George Hotz
a280cfe169 move dtypes to dtype.py (#2964)
* move dtypes to dtype.py

* fix urllib
2024-01-01 14:58:48 -08:00
George Hotz
063f465604 simpler webgpu (#2956)
* simpler webgpu

* skip that test
2024-01-01 10:28:59 -08:00
chenyu
54629b56d2 minor cleanup in kernel and linearizer (#2937)
* minor cleanup in kernel and linearizer

less long line, spaces and colocate variables

* no deadline in hypothesis test
2023-12-26 12:05:32 -05:00
qazal
dca5e4fe74 tensor == tensor should be bool (#2916)
* return bool

* add tests to the type spec

* fix multinomial

* fix tril

* fix round

* fix NegativeLogLikelihoodLoss

* rm debug

* webgpu

* more webgpu

* bitwise or for adding two bools

* onnx ops dont need to cast anymore

* Revert "bitwise or for adding two bools"

This reverts commit b413babffa.

* workaround for metal neg

* just the tests in the type spec
2023-12-25 12:38:47 -05:00
chenyu
8a8aed23d2 test dtypes of return values of cumsum, argmax/min, multinomial (#2933)
* test dtypes of return values of cumsum, argmax/min, multinomial

cumsum behaves like sum, and functions that return an index return in dtypes.default_int

* because webgpu is different
2023-12-25 11:33:17 -05:00
chenyu
b55b55d56e use at least int32 and uint32 for sum output (#2926)
* use at least int32 and uint32 for sum output

* use the correct type for acc

* fix opencl

* llvm mulacc
2023-12-24 01:14:54 -05:00
chenyu
50927defad s/lazydata.realized/lazydata.base.realized/g (#2914)
* s/lazydata.realized/lazydata.base.realized/g

* not that
2023-12-22 14:45:13 -05:00
chenyu
3855432265 don't use numpy to create Tensor(None) (#2909)
* don't use numpy to create Tensor(None)

empty suffices

* parentheses
2023-12-22 01:07:44 -05:00
chenyu
a543d8bea8 fuzz default dtypes for some test_dtype tests (#2906)
* fuzz default dtypes for some test_dtype tests

* ocd

* setUp and tearDown
2023-12-21 22:00:21 -05:00
chenyu
264fe9c93f clean up test_dtype.py (#2827)
make is_dtype_supported a pure function and clean up long lines
2023-12-18 16:06:09 -05:00
chenyu
0723f26c80 dtypes.default_float and dtypes.default_int (#2824) 2023-12-18 12:21:44 -05:00
chenyu
8aab19ce3d Tensor.full of bool has dtypes.bool (#2823) 2023-12-18 10:51:17 -05:00
Maksym Sobolyev
887f3d9933 Make torch backend more usable, fix bfloat support in the llvm backend (#2765)
* Uncripple dtype tests, TestBFloat16DType never actually runs.

* Fix conversion from/to bfloat16.

Call cast() recursively, so that it works for any type combo.

* Run this test on torch backend as well.

* Add torch.bfloat16.

* Add support for ushort and uint.

* Convert np.uint32 to np.int32 when loading.

* Fix warning.
2023-12-17 14:04:26 -05:00
chenyu
baa94d6142 Tensor(False) has dtypes.bool (#2805) 2023-12-16 19:04:08 -05:00
chenyu
88ff1edcf0 fix tensor creation with a list and dtype bfloat16 (#2795)
it went through numpy and numpy does not have bfloat16.

also added broadcasted with a python bool.
2023-12-16 10:06:47 -05:00
chenyu
bb6f7b6172 rsqrt is self.reciprocal().sqrt() (#2790)
(1/self) is incorrect for int tensor
2023-12-16 01:58:05 -05:00
chenyu
c5fa9eb36e int / List[int] data -> dtypes.int32 (#2789) 2023-12-16 01:25:44 -05:00
chenyu
dad4ee4539 use least_upper_dtype mlops to upcast the output type in mlops (#2788)
* InterpretedFlopCounter uses least_upper_dtype for output dtype

* fix target dtype check

* fix that
2023-12-15 23:46:57 -05:00
chenyu
1bc378c3d6 _broadcasted handles the python number types (#2785)
* _broadcasted handles the python number types

* disable that test
2023-12-15 22:43:27 -05:00
chenyu
0703075357 bf16 is float (#2786)
* add bfloat16 to is_float check

* and test
2023-12-15 21:41:30 -05:00
qazal
66f07d97e2 don't auto-cast half to float in unary functions (#2776)
* least upper float

* dont cast to the same thing

* tests for least_upper_float

* add regression tests to test_dtype_alu

* the call is pretty cheap probably cache is too much overhead
2023-12-15 10:11:47 -05:00
chenyu
66d9eb10b6 arange default dtype to int and zeros/ones default to float (#2769) 2023-12-14 17:53:00 -05:00
chenyu
5235cdee3d remove _arg_int32 internal type (#2767)
in DEFINE_GLOBAL, PtrDtype(int32) is buffer and int32 is int
2023-12-14 14:17:14 -05:00
chenyu
2ef33abd20 some unary functions cast int input into float (#2740)
* some unary functions cast int input into float

* precision

* image dtype
2023-12-13 00:10:29 -05:00
George Hotz
6d6eb9302d ruff checks the max line length is 150 (#2734)
* ruff checks the max line length is 150

* fix tensor.py

* a lot more

* done
2023-12-12 17:34:47 -08:00
chenyu
00b611c156 simplify type promotion - remove weak types (#2730) 2023-12-12 16:12:57 -05:00
chenyu
ef6e942a23 dtype promotion helpers (#2724)
* dtype promotion helpers

* better tests

* space
2023-12-11 23:14:23 -05:00
Christopher Mauri Milan
0232db294d fix tolist issue (#2723) 2023-12-11 19:14:00 -08:00
chenyu
4075208127 some dtype creation spec test cases (#2722) 2023-12-11 19:33:49 -05:00
qazal
a43bc78804 fix dtypes helpers for integers (#2716)
* scalar

* maybe do this instead

* Revert "scalar"

everything is a scalar

* add tests in test_dtype

* fuzz testing + fix unsigned ints

* fuzz everything
2023-12-11 09:28:19 -08:00
qazal
be09cc87c1 Bitcast support / fast bf16 load (#2011)
* bitcast renderers

* fast llama load

* make it one kernel

* regression testing p1: re-enable test_dtype for all backends

fix GPU

* regression testing p2: fuzz all possible cases against numpy

remove hancoded tests since the fuzzer covers them

* define ushort

* fix indent, probably need flake8 back for CI to catch

---------

Co-authored-by: George Hotz <72895+geohot@users.noreply.github.com>
2023-12-05 16:19:28 -08:00
George Hotz
8c67eb1c92 GPT bugfixes (#2624)
* simple fixes

* fix exp2

* fixed

* parallel beam for CUDA

* fix image dtypes
2023-12-05 11:42:28 -08:00
George Hotz
d87a246439 move to new cached fetch (#2493)
* move to new cached fetch

* extra.utils is over

* loads

* bump download cache

* bump timeout
2023-11-28 17:36:55 -08:00
George Hotz
9e07824542 move device to device.py (#2466)
* move device to device.py

* pylint test --disable R,C,W,E --enable E0611

* fix tests
2023-11-27 11:34:37 -08:00
George Hotz
8ff2e13550 From teeny (#2426)
* changes from teenygrad work

* support not supporting ImageDType/PtrDType

* fixups from teeny
2023-11-24 12:50:56 -08:00
qazal
b6aaf12df7 Internal cast 2 with more tests (#2257)
* Change linearizer to parse CAST

* Oneliner renders for cstyle and triton

* LLVM cast and ALU implementation

* pylint fixes

* cast in gep

* remove printbufs

* use cast for post-load ops

* get rid of parse_cast

* partially supported vectorized dtypes for initial dev

* render phi as the dtype

* Revert "partially supported vectorized dtypes for initial dev"

This reverts commit 1bf1a818a3.

* Revert "render phi as the dtype"

This reverts commit d08cb270b4.

* reenable triton tests

* no vstore_half if dtype is already half

* upcast max
2023-11-10 10:42:39 -08:00
George Hotz
330484c072 Revert "Internal casting support (#2046)" (#2256)
This reverts commit 7e1d08b2ae.
2023-11-09 21:27:13 -08:00
qazal
7e1d08b2ae Internal casting support (#2046)
* Change linearizer to parse CAST

* Oneliner renders for cstyle and triton

* LLVM cast and ALU implementation

* pylint fixes

* cast in gep

* remove printbufs

* use cast for post-load ops

* get rid of parse_cast

* partially supported vectorized dtypes for initial dev

* render phi as the dtype

* Revert "partially supported vectorized dtypes for initial dev"

This reverts commit 1bf1a818a3.

* Revert "render phi as the dtype"

This reverts commit d08cb270b4.

* reenable triton tests

---------

Co-authored-by: George Hotz <72895+geohot@users.noreply.github.com>
2023-11-09 21:02:32 -08:00
qazal
2465d5d267 fix ops tests in test_dtype (#2237)
* fix test ops

* decompose the err from test_ops

* skipTest skips the entire test, we dont want that

* handle cases with the same priority

* add int16 to torch map
2023-11-09 15:17:43 -08:00
qazal
be5f185ac0 Higher test coverage for dtypes (#2156)
* refactor unit tests for dtypes

* add missing dtypes in llvmir.py and lib.py

* skip torch tests

* webgpu

* cleaner skips

* fix llvm bool casting issue using compare

* llvm 100% passing

* llvm segfault

* TEMP decrease timeout mins to 11

debug

* add bf16 to setup

* skip half tests in cuda cpu

* check for CUDACPU insetad

* add int16 to triton dtypes

* u16 for triton

* remove debug - diff is still hard to read

* derive from base class TestDType

* enhance test_upcast and downcast by running on every possible version

* dummy commit to rerun the flakey test

* skip the correct tests for CUDA

* bf16 should be skipped in the common TestDType cases

* re-enable bf16

* more consistent structure

* tiny changes to is_dtype_supported 1

* tiny changes 2

add reason

* fuzz

* fuzzer p2

* run fp32 twice

* remove duplicate fp32 run

* clang: use stdbool

* skip triton on bool casts

* merge and resolve conflicts
2023-10-30 22:38:42 -07:00
qazal
a7439af786 Fix llvm int->bool cast (#2164)
* add to ir

* add test case

* minimize diff

* todo

* enable fast math

* added both False and True case
2023-10-30 15:28:23 -07:00
George Hotz
1bf4aef0f5 fix image dtype cmp (#2089)
* fix image dtype cmp

* print that with debug 3
2023-10-16 17:52:38 -07:00