Commit Graph

3361 Commits

Author SHA1 Message Date
SnakeOnex
025fbf4e80 One hot in tensor.py (#3093)
* onehot in Tensor.py

* one_hot tests

* works for all shapes, not just 1

* pylint

* not a static method

* moved around, num_classes mandatory

* pylint

* pylint

* space & moving

* formatting

* moved tests
2024-01-12 13:31:18 -05:00
chenyu
7086d77db1 bugfix do not reset shapetracker of 0 size lazybuffer (#3096)
it might be coming from an expand, and resetting results incorrect stride. caught by interpreted backend
2024-01-11 23:22:52 -05:00
Yixiang Gao
13e872b53f add mutigpu support for llama attention (#3064)
* add llama attention test for multigpu

* test fails

* kv cache trying to shrink on sharded axis

* mask None works for scale dot product

* kv cache seems to be working but scale dot product breaks

* scaled dot product works, but the last linear layer failed

* running into the reshape case where it could be wrong for multigpu

* making sure it was the reshape

* adding contiguous doesn't solve

* need to shard more properly

* remove reshape test

* minor adjustment to scale dot product attention test

* weights are sharded wrong

* continue fix new weight sharding

* clean up

* fix attention when start_pos is 0

* remove print

* add TODOs for the best mutigpu interface
2024-01-11 16:31:02 -08:00
chenyu
dcf7ecaaff update jit type annotation post lazy rewrite (#3091) 2024-01-11 15:49:30 -05:00
chenyu
0fe6904351 use device from LinearizerOptions in kernel search (#3090)
* use device from LinearizerOptions in kernel search

removed all Device.DEFAULT in search.py

* pass device string for parallel pickle

* device for interpreted backends in LinearizerOptions
2024-01-11 14:46:03 -05:00
chenyu
93e3f952aa use BEAM=2 instead of BEAM=4 in cuda ci gpt2 (#3089)
BEAM=2 is faster and less search time. investigating why BEAM2+BEAM4 is slower than BEAM2 alone
2024-01-11 13:21:06 -05:00
chenyu
f502c9b08f minor cleanup of View.reshape (#3088)
* minor cleanup of View.reshape

removed some redundant logic

* new_strides

* revert that
2024-01-11 13:05:54 -05:00
chenyu
f40299c3fe remove the third merging state in view._merge_dims (#3085)
no logic depends on state == 0 or state == 2
2024-01-11 12:07:43 -05:00
chenyu
7f9590d357 hotfix disable flaky mac runner wino cifar (#3087) 2024-01-11 11:57:05 -05:00
Yixiang Gao
adcc844755 cat works (#3086) 2024-01-11 08:25:20 -08:00
chenyu
cdeab9ad97 mem_estimate is always int, not symbolic (#3083)
* mem_estimate is always int, not symbolic

op_estimate can be symbolic, but mem_estimate is always int, thus we don't need to sym_infer it.
fixed some long lines too. update_stats is a very big function

* operator does not need underscores
2024-01-10 23:39:51 -05:00
Francis Lam
162fa61a32 wmma: clean up device specific tensor core code (#3081) 2024-01-10 21:03:09 -05:00
chenyu
d218d13885 minor cleanups of lazy.py (#3080) 2024-01-10 20:17:56 -05:00
chenyu
56dda33fc6 Tensor.expand resolves the new_shape before shortcut return (#3078)
similar to how reshape is done. also updated shrink shortcut criteria to read similar to pad
2024-01-10 14:29:15 -05:00
Yixiang Gao
6842476ca6 better test demonstration (#3077)
* a better test demonstration

* fix white space
2024-01-10 10:50:52 -08:00
chenyu
507e0afba0 fix onehot and jit in examples/transformer (#3073)
trained to 0.999 in < 6 seconds on M1 Max consistently
2024-01-10 02:22:41 -05:00
chenyu
4342fccc83 filter_strides -> canonicalize_strides (#3072) 2024-01-10 01:06:48 -05:00
chenyu
023f5df0e9 simpler idxs_to_idx (#3071) 2024-01-10 00:30:10 -05:00
George Hotz
2495ca95c7 early gate the graph (#3070) 2024-01-09 20:17:13 -08:00
George Hotz
ff0d6e4551 jit autorealizes output (#3069) 2024-01-09 20:10:22 -08:00
George Hotz
ae83733431 hotfix: examples/transformer.py 2024-01-09 19:28:09 -08:00
chenyu
145718a90f unbind view or shapetracker also returns var_val (#3067)
* unbind view or shapetracker also returns var_val

4% faster for llama compile time

* one line less

* unbound_views
2024-01-09 21:45:05 -05:00
jxdv
ef3aa6d7fb update gh actions (#3033)
* update checkout actions

* update upload artifact

* update setup python

---------

Co-authored-by: George Hotz <72895+geohot@users.noreply.github.com>
2024-01-09 17:52:22 -08:00
George Hotz
3f80c1a098 speedtweaks3: apply shouldn't use the tensor constructor (#3065)
* speedtweaks3: apply shouldn't use the tensor constructor

* replace 0 size with CONST, not 0 in shape
2024-01-09 17:42:33 -08:00
George Hotz
0abe72b677 hotfix: use is for enum compare, a few more 2024-01-09 16:53:13 -08:00
George Hotz
b2b5849f74 hotfix: use is for enum compare 2024-01-09 16:47:27 -08:00
George Hotz
ac3f246c11 cached size (#3060)
* cached size

* simplify simplify

* 0 doesn't have base

* fix test

* cleaner cache

* hmm, metal is flaky on this...might be real(ish) but useless as test

* short circuit reshape/expand properly

* better reshape bypass
2024-01-09 16:37:37 -08:00
Yixiang Gao
73b72b8de2 test scaled dot product attention (#3063)
* add test

* add initial test for scaled dot product attention

* test pass for scaled dot product attention
2024-01-09 14:30:57 -08:00
chenyu
55ac2a2cf7 Tensor.cat with 0 shape tensors (#3062)
* Tensor.cat with 0 shape tensors

supported both 0 in cat axis (for a subset of input), or 0 in non-cat axis (all needs to be 0)

* no shp
2024-01-09 16:54:06 -05:00
chenyu
f0d7ad8aaa fix gpt2 attention with start_pos = 0 (#3061)
* fix gpt2 attention with start_pos size 1

test cases taken from ll_transformer branch

* fix interpreted
2024-01-09 16:14:55 -05:00
George Hotz
39b91131bc Speed tweaks (#3059)
* base doesn't have to be a function

* no double fetch

* pop, don't check

* make the gc happy

* avoid hasattr

* cache canonicalize

* remove assert, faster base

* don't redefine that every time
2024-01-09 11:34:17 -08:00
George Hotz
bf6281f316 hotfix: remove useless slow assert from ShapeTracker 2024-01-09 10:56:36 -08:00
George Hotz
4b687af98f explicit lazybuffer caching (#3058) 2024-01-09 10:52:37 -08:00
George Hotz
2c6f2e899d No extra vars call (#3054)
* remove unused reciprocal

* comment

* remove unneeded call to vars

* free speedup
v0.8.0
2024-01-09 09:52:58 -08:00
Yixiang Gao
259bf9bffc add multigpu test for RMSNorm (#3056)
* need all gather

* add two multigpu test scenarios for RMSNorm
2024-01-09 09:52:51 -08:00
chenyu
dab8214103 unit tests for Device.canonicalize (#3055) 2024-01-09 12:47:20 -05:00
George Hotz
374f7659a7 remove unused reciprocal (#3053)
* remove unused reciprocal

* comment
2024-01-09 08:59:04 -08:00
Yixiang Gao
a686663657 make Embedding device aware for multigpu (#3051)
* make Embedding device aware for multigpu

* split line instead of igore because that's cheating

* add test incomplete

* add test complete

* remove comment

* fix white space

* remove nn.Embedding
2024-01-08 20:09:26 -08:00
chenyu
19298e7a3f Device._buffers -> Device._devices (#3052)
backend devices used to be called buffers
2024-01-08 21:30:38 -05:00
chenyu
4f4e8634b8 use in_features directly in nn.Linear.__init__ bound check (#3050)
* use in_features directly in nn.Linear.__init__ bound check

get rid of the unnecessary check of isinstance int

* that is always int

* long lines
2024-01-08 19:32:35 -05:00
chenyu
ee6a73826b clean up test_nn.py (#3049)
used Tensor.train decorator, reordered to always tinygrad instances first, and removed redundant idx cast
2024-01-08 18:45:03 -05:00
chenyu
3eb3664074 fix nn.Embedding with empty length input (#3048) 2024-01-08 18:08:36 -05:00
George Hotz
7ea2e0035b move children for speed (#3047)
* move children for speed

* no children anymore
2024-01-08 15:02:32 -08:00
George Hotz
655c6f61d3 St real size (#3046)
* track the size in the lazybuffer

* shapetracker real size

* lint
2024-01-08 14:44:53 -08:00
chenyu
1d730b8853 remove ACCUM_FP32 in simple_matmul.py (#3045)
* remove ACCUM_FP32 in simple_matmul.py

accumate for half inputs is always in float

* move test llama compile speed to metal
2024-01-08 17:37:57 -05:00
George Hotz
47d67da830 track the size in the lazybuffer (#3044) 2024-01-08 13:44:55 -08:00
George Hotz
c003be7309 Revert "track size in shapetracker" (#3043)
* Revert "track size in shapetracker (#3026)"

This reverts commit a8ba1ac08f.

* st.size
2024-01-08 13:13:39 -08:00
George Hotz
50754f1494 add caches there (#3042)
* add caches there

* no curl
2024-01-08 13:02:16 -08:00
George Hotz
c5a941d466 webgl backend in extra (#3041)
* WebGL WIP

* 84% of ops passing test

* tests passing 100%

* Cleanup, refactor

* Shave off some lines

* Work on dtypes

* TestOps at 100% again

* Efficient net shaders compile in browser webgl2

* Compile all efficientnet shaders in browser

* Create empty textures for tensor buffers

* Run program. Up next weight loading

* Exported WebGL model working

* Add tests, refactor

* Explicit cast alu for GLSL

* Fix CI tests

* WebGL efficientnet demo

* Compile and run yolov8 in browser

* Fix imports

* Simplify yolo compile

* Fix bool*bool and cast cmplt to float

* More tests

* Do std tests pass on CI?

* Skip std tests on CI

* Remove explicit_cast_alu hack, and solve it in code_for_op

* Move to new dtype-less alloc api

* Remove local size hack: optimize local_size only if device has local

* Remove glsl.py, and move content to cstyle

* dont_use_locals in opts

* Fix dtype tests

* type_map in CStyleLanguage

* Make core changes smaller, cleaner, refactor export_model and demo

* Skip pad_slice

* Simplify: render_const, render_conditional

* solve bool alu for other binops, cleaner ops_webgl

* Fix noopt hack

* Remove some skipIfs

* WebGL image hack

* type_names is a better name

* global_max

* Fix dtype import

* Fix type_names -> type_map

* Fix lint

* Remove webgpu, back to 5k lines (#3040)

* remove webgpu

* max 5000 lines

* revert those to master

* retain that cstyle

---------

Co-authored-by: Ahmed Harmouche <ahmedharmouche92@gmail.com>
2024-01-08 09:29:13 -08:00
George Hotz
8cbcd1b342 Remove webgpu, back to 5k lines (#3040)
* remove webgpu

* max 5000 lines
2024-01-08 09:10:07 -08:00