3328 Commits

Author SHA1 Message Date
George Hotz
2c6f2e899d No extra vars call (#3054)
* remove unused reciprocal

* comment

* remove unneeded call to vars

* free speedup
v0.8.0
2024-01-09 09:52:58 -08:00
Yixiang Gao
259bf9bffc add multigpu test for RMSNorm (#3056)
* need all gather

* add two multigpu test scenarios for RMSNorm
2024-01-09 09:52:51 -08:00
chenyu
dab8214103 unit tests for Device.canonicalize (#3055) 2024-01-09 12:47:20 -05:00
George Hotz
374f7659a7 remove unused reciprocal (#3053)
* remove unused reciprocal

* comment
2024-01-09 08:59:04 -08:00
Yixiang Gao
a686663657 make Embedding device aware for multigpu (#3051)
* make Embedding device aware for multigpu

* split line instead of igore because that's cheating

* add test incomplete

* add test complete

* remove comment

* fix white space

* remove nn.Embedding
2024-01-08 20:09:26 -08:00
chenyu
19298e7a3f Device._buffers -> Device._devices (#3052)
backend devices used to be called buffers
2024-01-08 21:30:38 -05:00
chenyu
4f4e8634b8 use in_features directly in nn.Linear.__init__ bound check (#3050)
* use in_features directly in nn.Linear.__init__ bound check

get rid of the unnecessary check of isinstance int

* that is always int

* long lines
2024-01-08 19:32:35 -05:00
chenyu
ee6a73826b clean up test_nn.py (#3049)
used Tensor.train decorator, reordered to always tinygrad instances first, and removed redundant idx cast
2024-01-08 18:45:03 -05:00
chenyu
3eb3664074 fix nn.Embedding with empty length input (#3048) 2024-01-08 18:08:36 -05:00
George Hotz
7ea2e0035b move children for speed (#3047)
* move children for speed

* no children anymore
2024-01-08 15:02:32 -08:00
George Hotz
655c6f61d3 St real size (#3046)
* track the size in the lazybuffer

* shapetracker real size

* lint
2024-01-08 14:44:53 -08:00
chenyu
1d730b8853 remove ACCUM_FP32 in simple_matmul.py (#3045)
* remove ACCUM_FP32 in simple_matmul.py

accumate for half inputs is always in float

* move test llama compile speed to metal
2024-01-08 17:37:57 -05:00
George Hotz
47d67da830 track the size in the lazybuffer (#3044) 2024-01-08 13:44:55 -08:00
George Hotz
c003be7309 Revert "track size in shapetracker" (#3043)
* Revert "track size in shapetracker (#3026)"

This reverts commit a8ba1ac08f.

* st.size
2024-01-08 13:13:39 -08:00
George Hotz
50754f1494 add caches there (#3042)
* add caches there

* no curl
2024-01-08 13:02:16 -08:00
George Hotz
c5a941d466 webgl backend in extra (#3041)
* WebGL WIP

* 84% of ops passing test

* tests passing 100%

* Cleanup, refactor

* Shave off some lines

* Work on dtypes

* TestOps at 100% again

* Efficient net shaders compile in browser webgl2

* Compile all efficientnet shaders in browser

* Create empty textures for tensor buffers

* Run program. Up next weight loading

* Exported WebGL model working

* Add tests, refactor

* Explicit cast alu for GLSL

* Fix CI tests

* WebGL efficientnet demo

* Compile and run yolov8 in browser

* Fix imports

* Simplify yolo compile

* Fix bool*bool and cast cmplt to float

* More tests

* Do std tests pass on CI?

* Skip std tests on CI

* Remove explicit_cast_alu hack, and solve it in code_for_op

* Move to new dtype-less alloc api

* Remove local size hack: optimize local_size only if device has local

* Remove glsl.py, and move content to cstyle

* dont_use_locals in opts

* Fix dtype tests

* type_map in CStyleLanguage

* Make core changes smaller, cleaner, refactor export_model and demo

* Skip pad_slice

* Simplify: render_const, render_conditional

* solve bool alu for other binops, cleaner ops_webgl

* Fix noopt hack

* Remove some skipIfs

* WebGL image hack

* type_names is a better name

* global_max

* Fix dtype import

* Fix type_names -> type_map

* Fix lint

* Remove webgpu, back to 5k lines (#3040)

* remove webgpu

* max 5000 lines

* revert those to master

* retain that cstyle

---------

Co-authored-by: Ahmed Harmouche <ahmedharmouche92@gmail.com>
2024-01-08 09:29:13 -08:00
George Hotz
8cbcd1b342 Remove webgpu, back to 5k lines (#3040)
* remove webgpu

* max 5000 lines
2024-01-08 09:10:07 -08:00
George Hotz
cf2eea961c more beautiful_cartpole with exposed hparams 2024-01-07 17:41:09 -08:00
Yixiang Gao
44618427f1 add bf16 type_map for both cuda and hip (#3036)
* add typemap bfloat16 for cuda and hip

* add render_dtype

* add def in CStyleLanguage

* fix def

* save one line

* add header file for cuda bf16
2024-01-07 14:26:55 -08:00
chenyu
ef5f545fd8 add more Tensor.clip test cases (#3034)
* add more Tensor.clip test cases

add cases for same low/high and both negative etc

* case min > max
2024-01-07 13:08:59 -05:00
chenyu
c9371f0d31 hotfix llama conversation mode (#3031)
without contiguous on keys and values, it runs but the update is incorrect
2024-01-06 16:57:07 -05:00
chenyu
fa707c81e5 move beautiful cartpole action sampling inside jit (#3028)
tested by getting 3 full scores in a row
2024-01-06 00:39:55 -05:00
George Hotz
ebb81e8f11 hotfix: st.size() -> st.size in llama 2024-01-05 20:18:52 -08:00
George Hotz
a8ba1ac08f track size in shapetracker (#3026)
* track size in shapetracker

* shapetracker adapter

* size is an int

* create Buffer with st.size

* only compare the views for the jit

* fix webgpu
2024-01-05 20:15:53 -08:00
chenyu
138c17c094 enable argmax tests for METAL/WEBGPU in CI (#3027)
not sure why it was skipped but works now in CI
2024-01-05 21:43:00 -05:00
George Hotz
2a2d3233d2 add test that the compiler isn't used (#3025)
* add test that the compiler isn't used

* one print_tree

* improve speed with st size cache

* switch to gpt-2
2024-01-05 17:24:01 -08:00
chenyu
520406cf3a add Tensor.unflatten and Tensor.flatten(end_dim) (#3023)
simplified cases when splitting a dim, or merge dims in predix
2024-01-05 17:55:29 -05:00
George Hotz
f432ec9c33 Bitcast hip fix + fix mixtral (#3022)
* fix bitcast in hip

* wrong dtype for precast, double COPY
2024-01-05 14:51:25 -08:00
chenyu
eda43767de use Scalar = Union[float, int, bool] in tensor.py (#3021)
unify the type spec for Tensor creation functions and broadcasted elementwise ops that take python scalar
2024-01-05 13:56:26 -05:00
George Hotz
60abc62a3f fast hip read (#3014)
* fast hip read

* hip read faster

* fix tests

* to_mv

* simplify

* bump to 6k lines
2024-01-05 10:33:13 -08:00
chenyu
4465ef28c5 add test_softmax to test_ops (#3020)
* add test_softmax to test_ops

somehow it was not tested

* too many buffers in softmax backward for WEBGPU
2024-01-05 11:19:49 -05:00
chenyu
7c80b78be9 cleanup gpt2 build function (#3018) 2024-01-04 23:14:53 -05:00
chenyu
55e52abeba minor cleanup of matvec in hand_coded_optimizations (#3015)
remove noop isinstance check and fix long lines
2024-01-04 19:43:49 -05:00
chenyu
f88506e630 move gpt2/llama sampling inside the model call (#3013)
* move gpt2/llama sampling inside the model call

* argmax uses one more kernel
2024-01-04 17:01:50 -05:00
George Hotz
c2a044ed83 disk_read_speed example 2024-01-04 13:59:43 -08:00
Yixiang Gao
8a63f26a0f make LR scheduler work with multigpu (#3011)
* add a failing test for LR scheduler when using multigpu

* fix calculation order and unnecessary tensor created for float

* min_lr is no longer tensor
2024-01-04 12:10:56 -08:00
chenyu
8524493748 minor gpt2 cleanup (#3012) 2024-01-04 13:53:18 -05:00
chenyu
2b6670d2ea separate entry for HALF hlb_cifar10 in benchmark (#3010) 2024-01-04 13:24:10 -05:00
chenyu
5337211058 llvm CMPEQ 2024-01-04 13:12:22 -05:00
chenyu
b8c30eb358 no midcast MULACC for llvm 2024-01-04 13:12:22 -05:00
chenyu
91665ef143 rewrite MUL CAST SUM to CAST MULACC 2024-01-04 13:12:22 -05:00
chenyu
ab7dfd637b use float for acc dtype for half tensor sum
we previously only upcast uint and int, and half was using half for acc.
change to acc in float for precision. but cast the result back to half to match torch/jax output dtype
2024-01-04 13:12:22 -05:00
chenyu
6fa285b943 touchup onnx xor and not (#3008) 2024-01-04 02:02:42 -05:00
geohotstan
57817028bb removed redundant dtype hacks in onnx_ops (#2939)
* updated most dtype hacks in onnx_ops

* temporarily revert dequantizelinear change

* I think this is right...

* MORE FIXES WOOOO NEW DTYPE IS AWESOME

* ok

* oops missed a print

* half -> float32 for CI

* is npdtype

* some more

* fix if ordering

* more clean ups

* final cleanups

* casting to half not allowed

* k nvm

* revert ArgMax change

* only GPU

* llvm begone

* teeny tiny change

* fix: attempt to add cast tests

* try this

* fix dequantizelinear

* revert some stuff

* tests pass pls

* less lines in onnx_tests

* oops missed string tensor tests

* clean up

* try: revert default behavior changes

* fix: disabled Cast and Castlike tests

* docs: small changes

* fix: fixed isNaN op and enabled associated tests

* fix: forgot about float16

* done

* update disabled test

* gah missed another float16

* disable rest of failing tests

* rm extra line

* try...

---------

Co-authored-by: chenyu <chenyu@fastmail.com>
2024-01-04 01:45:24 -05:00
chenyu
9f39165188 correct (dtype, device) in test_dtype.is_dtype_supported (#3007)
corrected dtypes for TORCH and float64 support
2024-01-04 00:25:37 -05:00
chenyu
ae112c9dbe fix some long lines in tests (#3006)
* fix some long lines in tests

* better
2024-01-03 23:53:33 -05:00
George Hotz
7e191fbb86 hotfix: don't jitcache with 1 kernel. improvements to hip sniffer 2024-01-03 19:17:08 -08:00
George Hotz
bcc1aa21ac make disk simpler (#3002)
* make disk simpler

* upd ops_disk

* works on osx too

* revert ops_hip
2024-01-03 17:46:21 -08:00
George Hotz
9699c8c90b don't alloc for InterpretedASTRunner (#2999) 2024-01-03 17:05:53 -08:00
chenyu
bca0b95ee3 bump shapetracker simplify message to DEBUG >= 5 (#2998) 2024-01-03 20:00:36 -05:00