Yixiang Gao
6842476ca6
better test demonstration ( #3077 )
...
* a better test demonstration
* fix white space
2024-01-10 10:50:52 -08:00
chenyu
507e0afba0
fix onehot and jit in examples/transformer ( #3073 )
...
trained to 0.999 in < 6 seconds on M1 Max consistently
2024-01-10 02:22:41 -05:00
chenyu
4342fccc83
filter_strides -> canonicalize_strides ( #3072 )
2024-01-10 01:06:48 -05:00
chenyu
023f5df0e9
simpler idxs_to_idx ( #3071 )
2024-01-10 00:30:10 -05:00
George Hotz
2495ca95c7
early gate the graph ( #3070 )
2024-01-09 20:17:13 -08:00
George Hotz
ff0d6e4551
jit autorealizes output ( #3069 )
2024-01-09 20:10:22 -08:00
George Hotz
ae83733431
hotfix: examples/transformer.py
2024-01-09 19:28:09 -08:00
chenyu
145718a90f
unbind view or shapetracker also returns var_val ( #3067 )
...
* unbind view or shapetracker also returns var_val
4% faster for llama compile time
* one line less
* unbound_views
2024-01-09 21:45:05 -05:00
jxdv
ef3aa6d7fb
update gh actions ( #3033 )
...
* update checkout actions
* update upload artifact
* update setup python
---------
Co-authored-by: George Hotz <72895+geohot@users.noreply.github.com >
2024-01-09 17:52:22 -08:00
George Hotz
3f80c1a098
speedtweaks3: apply shouldn't use the tensor constructor ( #3065 )
...
* speedtweaks3: apply shouldn't use the tensor constructor
* replace 0 size with CONST, not 0 in shape
2024-01-09 17:42:33 -08:00
George Hotz
0abe72b677
hotfix: use is for enum compare, a few more
2024-01-09 16:53:13 -08:00
George Hotz
b2b5849f74
hotfix: use is for enum compare
2024-01-09 16:47:27 -08:00
George Hotz
ac3f246c11
cached size ( #3060 )
...
* cached size
* simplify simplify
* 0 doesn't have base
* fix test
* cleaner cache
* hmm, metal is flaky on this...might be real(ish) but useless as test
* short circuit reshape/expand properly
* better reshape bypass
2024-01-09 16:37:37 -08:00
Yixiang Gao
73b72b8de2
test scaled dot product attention ( #3063 )
...
* add test
* add initial test for scaled dot product attention
* test pass for scaled dot product attention
2024-01-09 14:30:57 -08:00
chenyu
55ac2a2cf7
Tensor.cat with 0 shape tensors ( #3062 )
...
* Tensor.cat with 0 shape tensors
supported both 0 in cat axis (for a subset of input), or 0 in non-cat axis (all needs to be 0)
* no shp
2024-01-09 16:54:06 -05:00
chenyu
f0d7ad8aaa
fix gpt2 attention with start_pos = 0 ( #3061 )
...
* fix gpt2 attention with start_pos size 1
test cases taken from ll_transformer branch
* fix interpreted
2024-01-09 16:14:55 -05:00
George Hotz
39b91131bc
Speed tweaks ( #3059 )
...
* base doesn't have to be a function
* no double fetch
* pop, don't check
* make the gc happy
* avoid hasattr
* cache canonicalize
* remove assert, faster base
* don't redefine that every time
2024-01-09 11:34:17 -08:00
George Hotz
bf6281f316
hotfix: remove useless slow assert from ShapeTracker
2024-01-09 10:56:36 -08:00
George Hotz
4b687af98f
explicit lazybuffer caching ( #3058 )
2024-01-09 10:52:37 -08:00
George Hotz
2c6f2e899d
No extra vars call ( #3054 )
...
* remove unused reciprocal
* comment
* remove unneeded call to vars
* free speedup
v0.8.0
2024-01-09 09:52:58 -08:00
Yixiang Gao
259bf9bffc
add multigpu test for RMSNorm ( #3056 )
...
* need all gather
* add two multigpu test scenarios for RMSNorm
2024-01-09 09:52:51 -08:00
chenyu
dab8214103
unit tests for Device.canonicalize ( #3055 )
2024-01-09 12:47:20 -05:00
George Hotz
374f7659a7
remove unused reciprocal ( #3053 )
...
* remove unused reciprocal
* comment
2024-01-09 08:59:04 -08:00
Yixiang Gao
a686663657
make Embedding device aware for multigpu ( #3051 )
...
* make Embedding device aware for multigpu
* split line instead of igore because that's cheating
* add test incomplete
* add test complete
* remove comment
* fix white space
* remove nn.Embedding
2024-01-08 20:09:26 -08:00
chenyu
19298e7a3f
Device._buffers -> Device._devices ( #3052 )
...
backend devices used to be called buffers
2024-01-08 21:30:38 -05:00
chenyu
4f4e8634b8
use in_features directly in nn.Linear.__init__ bound check ( #3050 )
...
* use in_features directly in nn.Linear.__init__ bound check
get rid of the unnecessary check of isinstance int
* that is always int
* long lines
2024-01-08 19:32:35 -05:00
chenyu
ee6a73826b
clean up test_nn.py ( #3049 )
...
used Tensor.train decorator, reordered to always tinygrad instances first, and removed redundant idx cast
2024-01-08 18:45:03 -05:00
chenyu
3eb3664074
fix nn.Embedding with empty length input ( #3048 )
2024-01-08 18:08:36 -05:00
George Hotz
7ea2e0035b
move children for speed ( #3047 )
...
* move children for speed
* no children anymore
2024-01-08 15:02:32 -08:00
George Hotz
655c6f61d3
St real size ( #3046 )
...
* track the size in the lazybuffer
* shapetracker real size
* lint
2024-01-08 14:44:53 -08:00
chenyu
1d730b8853
remove ACCUM_FP32 in simple_matmul.py ( #3045 )
...
* remove ACCUM_FP32 in simple_matmul.py
accumate for half inputs is always in float
* move test llama compile speed to metal
2024-01-08 17:37:57 -05:00
George Hotz
47d67da830
track the size in the lazybuffer ( #3044 )
2024-01-08 13:44:55 -08:00
George Hotz
c003be7309
Revert "track size in shapetracker" ( #3043 )
...
* Revert "track size in shapetracker (#3026 )"
This reverts commit a8ba1ac08f .
* st.size
2024-01-08 13:13:39 -08:00
George Hotz
50754f1494
add caches there ( #3042 )
...
* add caches there
* no curl
2024-01-08 13:02:16 -08:00
George Hotz
c5a941d466
webgl backend in extra ( #3041 )
...
* WebGL WIP
* 84% of ops passing test
* tests passing 100%
* Cleanup, refactor
* Shave off some lines
* Work on dtypes
* TestOps at 100% again
* Efficient net shaders compile in browser webgl2
* Compile all efficientnet shaders in browser
* Create empty textures for tensor buffers
* Run program. Up next weight loading
* Exported WebGL model working
* Add tests, refactor
* Explicit cast alu for GLSL
* Fix CI tests
* WebGL efficientnet demo
* Compile and run yolov8 in browser
* Fix imports
* Simplify yolo compile
* Fix bool*bool and cast cmplt to float
* More tests
* Do std tests pass on CI?
* Skip std tests on CI
* Remove explicit_cast_alu hack, and solve it in code_for_op
* Move to new dtype-less alloc api
* Remove local size hack: optimize local_size only if device has local
* Remove glsl.py, and move content to cstyle
* dont_use_locals in opts
* Fix dtype tests
* type_map in CStyleLanguage
* Make core changes smaller, cleaner, refactor export_model and demo
* Skip pad_slice
* Simplify: render_const, render_conditional
* solve bool alu for other binops, cleaner ops_webgl
* Fix noopt hack
* Remove some skipIfs
* WebGL image hack
* type_names is a better name
* global_max
* Fix dtype import
* Fix type_names -> type_map
* Fix lint
* Remove webgpu, back to 5k lines (#3040 )
* remove webgpu
* max 5000 lines
* revert those to master
* retain that cstyle
---------
Co-authored-by: Ahmed Harmouche <ahmedharmouche92@gmail.com >
2024-01-08 09:29:13 -08:00
George Hotz
8cbcd1b342
Remove webgpu, back to 5k lines ( #3040 )
...
* remove webgpu
* max 5000 lines
2024-01-08 09:10:07 -08:00
George Hotz
cf2eea961c
more beautiful_cartpole with exposed hparams
2024-01-07 17:41:09 -08:00
Yixiang Gao
44618427f1
add bf16 type_map for both cuda and hip ( #3036 )
...
* add typemap bfloat16 for cuda and hip
* add render_dtype
* add def in CStyleLanguage
* fix def
* save one line
* add header file for cuda bf16
2024-01-07 14:26:55 -08:00
chenyu
ef5f545fd8
add more Tensor.clip test cases ( #3034 )
...
* add more Tensor.clip test cases
add cases for same low/high and both negative etc
* case min > max
2024-01-07 13:08:59 -05:00
chenyu
c9371f0d31
hotfix llama conversation mode ( #3031 )
...
without contiguous on keys and values, it runs but the update is incorrect
2024-01-06 16:57:07 -05:00
chenyu
fa707c81e5
move beautiful cartpole action sampling inside jit ( #3028 )
...
tested by getting 3 full scores in a row
2024-01-06 00:39:55 -05:00
George Hotz
ebb81e8f11
hotfix: st.size() -> st.size in llama
2024-01-05 20:18:52 -08:00
George Hotz
a8ba1ac08f
track size in shapetracker ( #3026 )
...
* track size in shapetracker
* shapetracker adapter
* size is an int
* create Buffer with st.size
* only compare the views for the jit
* fix webgpu
2024-01-05 20:15:53 -08:00
chenyu
138c17c094
enable argmax tests for METAL/WEBGPU in CI ( #3027 )
...
not sure why it was skipped but works now in CI
2024-01-05 21:43:00 -05:00
George Hotz
2a2d3233d2
add test that the compiler isn't used ( #3025 )
...
* add test that the compiler isn't used
* one print_tree
* improve speed with st size cache
* switch to gpt-2
2024-01-05 17:24:01 -08:00
chenyu
520406cf3a
add Tensor.unflatten and Tensor.flatten(end_dim) ( #3023 )
...
simplified cases when splitting a dim, or merge dims in predix
2024-01-05 17:55:29 -05:00
George Hotz
f432ec9c33
Bitcast hip fix + fix mixtral ( #3022 )
...
* fix bitcast in hip
* wrong dtype for precast, double COPY
2024-01-05 14:51:25 -08:00
chenyu
eda43767de
use Scalar = Union[float, int, bool] in tensor.py ( #3021 )
...
unify the type spec for Tensor creation functions and broadcasted elementwise ops that take python scalar
2024-01-05 13:56:26 -05:00
George Hotz
60abc62a3f
fast hip read ( #3014 )
...
* fast hip read
* hip read faster
* fix tests
* to_mv
* simplify
* bump to 6k lines
2024-01-05 10:33:13 -08:00
chenyu
4465ef28c5
add test_softmax to test_ops ( #3020 )
...
* add test_softmax to test_ops
somehow it was not tested
* too many buffers in softmax backward for WEBGPU
2024-01-05 11:19:49 -05:00