Commit Graph

1363 Commits

Author SHA1 Message Date
George Hotz
41efaa848c move graph.py and jit.py into features (#3376)
* move graph.py into features

* move jit into features

* fix quickstart
2024-02-12 17:34:34 +01:00
Yoshinori Sano
98c732cf9d fix metal compile error in extra/gemm (#3365) 2024-02-10 12:54:41 +01:00
terafo
3752e97c8f Fix: Always cast ONNX Slice op arguments into ints (#3317)
* fix: ensure that axes and steps are always ints

* Cast everything in tinygrad

---------

Co-authored-by: terafo <terafo@protonmail.com>
2024-02-04 18:40:48 -05:00
chenyu
9b8c1a0408 Tensor.batchnorm works more than 2d and reuse in onnx (#3284) 2024-01-30 19:02:45 -05:00
chenyu
7816c3b692 onnx update for trilu and argmax (#3283)
* support 0 in shape for tril and triu

* select_last_index for ArgMax and ArgMin

* pass **kwargs
2024-01-30 18:39:16 -05:00
Francis Lam
4273aabe31 extra/gemm: add a simple_conv.py along with correctness check (#3236)
* extra/gemm: add a simple_conv.py along with correctness check

The goal is to easily test tensor core triggering situations

* test: add tests for acc_dtype handling and fixed typing
2024-01-26 19:06:57 -08:00
George Hotz
473935125a use comgr to compile (#3248)
* use comgr to compile

* fast

* bfloat16

* move comgr to it's own file

* cleaner style

* comgr in new place

* comgr free + dtype cleanup
2024-01-26 18:27:49 -08:00
George Hotz
03a6bc59c1 move autogen to runtime/autogen (#3254) 2024-01-26 12:44:19 -08:00
George Hotz
a3869ffd46 move gpuctypes in tree (#3253)
* move gpuctypes in tree

* fix mypy

* regex exclude

* autogen sh

* mypy exclude

* does that fix it

* fix mypy

* add hip confirm

* verify all autogens

* build clang2py

* opencl headers

* gpu on 22.04
2024-01-26 12:25:03 -08:00
chenyu
bc92c4cc32 onnx Einsum, CumSum, DepthToSpace, SpaceToDepth (#3252)
* onnx Einsum, CumSum, DepthToSpace, SpaceToDepth

Einsum inner product and `...` are not supported

* --durations=20
2024-01-26 10:47:53 -05:00
chenyu
e45ffdb6cf cleanup onnx (#3249)
* add onnx test_reduce_log_sum_exp

* more reuse

* more

* stuff

* good CenterCropPad

* imports

* good ArrayFeatureExtractor

* pretty good Pad

* stuff

* stuff

* onnx.py

* Atan

* pass int8 test

* dtype related

* fastmath stuff

* Resize linear

* fix CI

* move back
2024-01-25 20:39:59 -05:00
Ahmed Harmouche
168b1f879c Fix hip_matmul gemm in extra (#3241) 2024-01-25 16:03:04 -08:00
geohotstan
3628bea910 fix: big round even rounder round (#3242)
* fix: big round even rounder round

* fix: variable name lol

* feat: 1 less potential cast

* consistant naming (im just spaming commits now)

* LOL MISSED ONNX ANOTHER COMMIT

* test: fix test_ops and remove _round

* test: tensor methods oops
2024-01-25 12:24:15 -05:00
geohotstan
b0b5eba535 fix _round in onnx_ops to look more like new Tensor.round (#3239)
* fix: _round in onnxops

* fix: minor things

* fix: no more n

* fix: smol

* fix: smoller
2024-01-25 01:18:58 -05:00
chenyu
afeadbedc9 touch up Tensor.round and Tensor.neg (#3228) 2024-01-24 12:29:37 -05:00
geohotstan
842053873d fix neg logical_not inconsistencies (#3222)
* try

* test: add logical_not tests

* gah im retarded, but this doesn't match types for const()

* fix: can't we jsut do this?

* big change: I don't actually know what I'm doing

* WOOO IM JUST CHANGING EVERYTHING WOW probably gon revert later

* BYE BYE noqa: E501

* fix: less lines and add test

* fix: rm 2 redundant tests

* fix: eq with False so we don't unintentionally implicit upcast, but it's bool anyways so w/e
2024-01-24 11:48:40 -05:00
chenyu
485332935e ring copy example (#3185)
* ring copy example

* use ones for init
2024-01-19 23:34:30 -05:00
George Hotz
c80884884e event driven hip (#3160)
* event driven hip

* simpler, src makes copy

* pass mypy
2024-01-18 14:35:18 -08:00
Max-We
0338903429 Update kits19.py (#3166) 2024-01-18 08:33:50 -08:00
George Hotz
743b36f0ce hotfix: copy size is in bytes 2024-01-17 16:44:15 +00:00
George Hotz
a72b1b6d65 sharding for llama (#3151)
* shard llama

* sharding works

* simpler

* simpler

* consume option

* disable that test

* save a line

---------

Co-authored-by: George Hotz <george@tinygrad.org>
2024-01-16 19:28:00 -08:00
George Hotz
ca0beeef38 Christopherm99 ptx (#3139)
* get basic ptx impl working

* test ops passing

* mypy

* dont hardcode target

* more walrus

* ptx in ci

* bool cast and f16 load/store

* weird numpy bug and f16 cast tolerance

* cast half to bool

* fix 1 byte load/store

* disable half for ptx

* fix args and enable xid

* fix non-ptr args

* allow bitcast

* mypy

* cleanups

* midcast use allclose

* add xor

* Revert "disable half for ptx"

This reverts commit 73391c05fd.

* enable float16

* mypy

* no more crashing in ci

* fix ci

* minor cleanups

* use new fn for ptx compiler

* no diskcache in ptx compile

* use rn instead of rz

* save some lines

* new DEFINE_GLOBAL syntax

* line length

* new llvm

* cmpeq

* minor fix

* cast in mulacc

* update test_recursive_add to check line count

* mypy

* remove llvmir.py

* fix bool const

* wip

* cleanups

* working

* llvm in separate pr

* cleanups

* more cleanups

* fix ci

* use in_features directly in nn.Linear.__init__ bound check (#3050)

* use in_features directly in nn.Linear.__init__ bound check

get rid of the unnecessary check of isinstance int

* that is always int

* long lines

* Device._buffers -> Device._devices (#3052)

backend devices used to be called buffers

* make Embedding device aware for multigpu (#3051)

* make Embedding device aware for multigpu

* split line instead of igore because that's cheating

* add test incomplete

* add test complete

* remove comment

* fix white space

* remove nn.Embedding

* remove unused reciprocal (#3053)

* remove unused reciprocal

* comment

* unit tests for Device.canonicalize (#3055)

* add multigpu test for RMSNorm (#3056)

* need all gather

* add two multigpu test scenarios for RMSNorm

* No extra vars call (#3054)

* remove unused reciprocal

* comment

* remove unneeded call to vars

* free speedup

* explicit lazybuffer caching (#3058)

* hotfix: remove useless slow assert from ShapeTracker

* Speed tweaks (#3059)

* base doesn't have to be a function

* no double fetch

* pop, don't check

* make the gc happy

* avoid hasattr

* cache canonicalize

* remove assert, faster base

* don't redefine that every time

* fix gpt2 attention with start_pos = 0 (#3061)

* fix gpt2 attention with start_pos size 1

test cases taken from ll_transformer branch

* fix interpreted

* Tensor.cat with 0 shape tensors (#3062)

* Tensor.cat with 0 shape tensors

supported both 0 in cat axis (for a subset of input), or 0 in non-cat axis (all needs to be 0)

* no shp

* test scaled dot product attention (#3063)

* add test

* add initial test for scaled dot product attention

* test pass for scaled dot product attention

* cached size (#3060)

* cached size

* simplify simplify

* 0 doesn't have base

* fix test

* cleaner cache

* hmm, metal is flaky on this...might be real(ish) but useless as test

* short circuit reshape/expand properly

* better reshape bypass

* hotfix: use is for enum compare

* hotfix: use is for enum compare, a few more

* speedtweaks3: apply shouldn't use the tensor constructor (#3065)

* speedtweaks3: apply shouldn't use the tensor constructor

* replace 0 size with CONST, not 0 in shape

* update gh actions (#3033)

* update checkout actions

* update upload artifact

* update setup python

---------

Co-authored-by: George Hotz <72895+geohot@users.noreply.github.com>

* unbind view or shapetracker also returns var_val (#3067)

* unbind view or shapetracker also returns var_val

4% faster for llama compile time

* one line less

* unbound_views

* hotfix: examples/transformer.py

* jit autorealizes output (#3069)

* early gate the graph (#3070)

* simpler idxs_to_idx (#3071)

* filter_strides -> canonicalize_strides (#3072)

* fix onehot and jit in examples/transformer (#3073)

trained to 0.999 in < 6 seconds on M1 Max consistently

* better test demonstration (#3077)

* a better test demonstration

* fix white space

* Tensor.expand resolves the new_shape before shortcut return (#3078)

similar to how reshape is done. also updated shrink shortcut criteria to read similar to pad

* minor cleanups of lazy.py (#3080)

* wmma: clean up device specific tensor core code (#3081)

* mem_estimate is always int, not symbolic (#3083)

* mem_estimate is always int, not symbolic

op_estimate can be symbolic, but mem_estimate is always int, thus we don't need to sym_infer it.
fixed some long lines too. update_stats is a very big function

* operator does not need underscores

* cat works (#3086)

* hotfix disable flaky mac runner wino cifar (#3087)

* remove the third merging state in view._merge_dims (#3085)

no logic depends on state == 0 or state == 2

* minor cleanup of View.reshape (#3088)

* minor cleanup of View.reshape

removed some redundant logic

* new_strides

* revert that

* use BEAM=2 instead of BEAM=4 in cuda ci gpt2 (#3089)

BEAM=2 is faster and less search time. investigating why BEAM2+BEAM4 is slower than BEAM2 alone

* use device from LinearizerOptions in kernel search (#3090)

* use device from LinearizerOptions in kernel search

removed all Device.DEFAULT in search.py

* pass device string for parallel pickle

* device for interpreted backends in LinearizerOptions

* update jit type annotation post lazy rewrite (#3091)

* add mutigpu support for llama attention (#3064)

* add llama attention test for multigpu

* test fails

* kv cache trying to shrink on sharded axis

* mask None works for scale dot product

* kv cache seems to be working but scale dot product breaks

* scaled dot product works, but the last linear layer failed

* running into the reshape case where it could be wrong for multigpu

* making sure it was the reshape

* adding contiguous doesn't solve

* need to shard more properly

* remove reshape test

* minor adjustment to scale dot product attention test

* weights are sharded wrong

* continue fix new weight sharding

* clean up

* fix attention when start_pos is 0

* remove print

* add TODOs for the best mutigpu interface

* bugfix do not reset shapetracker of 0 size lazybuffer (#3096)

it might be coming from an expand, and resetting results incorrect stride. caught by interpreted backend

* One hot in tensor.py (#3093)

* onehot in Tensor.py

* one_hot tests

* works for all shapes, not just 1

* pylint

* not a static method

* moved around, num_classes mandatory

* pylint

* pylint

* space & moving

* formatting

* moved tests

* fix broadcasted logic if there's 0 in shapes (#3097)

* fix broadcasted logic if there's 0 in shapes

should always expand into 0, not the other way around. fixed matmul with 0 in input shapes.
for forwards for now though, backward is more involved and would need to change 0 size shortcuts

* fix tests

* replace with tensor op (#3099)

* fix gpt2 with empty prompt (#3100)

logits would be empty so need to replace that with ones before sampling, also cannot reshape with -1 when there's 0 in other axes

* Revert "fix gpt2 with empty prompt" (#3101)

* fix gpt2 with empty prompt take 2 (#3102)

logits would be empty so need to replace that with ones before sampling, also cannot reshape with -1 when there's 0 in other axes

* wmma: enable METAL half tensor cores and clean up cstyle (#3095)

* wmma: enable METAL half tensor cores and clean up cstyle

* revert simple_matmul rand changes and break line in tensor

* added metal fp16->fp32 tensor core

* add half @ half to mac benchmark (#3103)

* flag to profile mixtral - 1.7 tok/s now (#3104)

* update NumNode.__hash__ to be hash(self.b) (#3105)

with this, `a:=NumNode(x) == b` implies `hash(a) == hash(b)`

* catch runtime error in search._time_program (#3106)

return inf if search encountered runtime errors.

* no exceptions in __del__ when module creation is failed in hip/cuda (#3107)

* failed test case due to cast resets shapetracker (#3109)

cast implicitly resets shapetracker and makes it contiguous (for disk tensor), which fails for Interpreted backend if inputs contain non-contiguous st.

* cleanup ops_disk type annotation and redundant str cast (#3110)

* minor cleanup of test_disk_tensor (#3112)

* add Tensor.var (#3114)

also updated MeanVarianceNormalization and made test_ops test tensors of var and std smaller

* move sample inside jit for beautiful_mnist (#3115)

also removed .realize() for jit functions since jit does it automatically now. a little more beautiful

* minor cleanups of onnx_ops (#3116)

* fix conversation: llama generates token not prob now (#3120)

* add device options for tests in multigpu (#3121)

* make DType a dataclass (#3111)

* remove np from DType

* convert to dataclass

* remove dunder hash, eq, ne overrides from ImageDType

* is dataclass required for PtrDType?

* fix GPU tests

* reduce lines

* revert changes to np

* minor cleanup

* hotfix: ptrdtype compare was broken

* move fromcpu out of lazy.py (#3122)

* move fromcpu out of lazy.py

* fix abstractions2

* remove numpy from device (#3123)

* remove numpy from device

* fix tests

* np item

* cleanups

* simplify with as_buffer

* no toCPU

* tinygradic

* cast to scalar

* remove numpy from ops_torch (#3124)

updated mnist test to cast label to int8 and avoid hacking cast issue of torch uint8

* Fix backward fn for `<` and `==` (#3037)

* fix no grad fn for < and ==

* remove 2 line breaks

* Remove deprecated autograd variable

---------

Co-authored-by: George Hotz <72895+geohot@users.noreply.github.com>

* separate try except blocks in onnx2torch in model benchmark (#3126)

exceptions can be raised from either model conversion or individual backend failed. openpilot on torch mps works, but does not work with torch cpu.
seperate the expcetion block so that the benchmark can inlcude torch mps for openpilot.

* update env_vars.md (#3127)

mostly removed deprecated ones. not clear how to maintain this especially for extra/examples

* update test_ptr_ne (#3130)

* remove np from metal graph (#3129)

* dtype fmt (#3132)

* dtype fmt

* three ways to access

* fix off-by-one error in st_equal (#3131)

* fix off by one error

* whitespace

* no numpy (#3134)

* fast resnet eval (#3135)

* fast resnet eval

* fix HIP multidevice graph

* neater expression for devices

* lines

* add decorator test

* remove LLVMOPT

* move ptx

* Update ops_cuda.py

---------

Co-authored-by: Christopher Milan <chrismilan@ucla.edu>
Co-authored-by: chenyu <chenyu@fastmail.com>
Co-authored-by: Yixiang Gao <yixiangg310573@gmail.com>
Co-authored-by: jxdv <virgoj@protonmail.com>
Co-authored-by: Francis Lam <flam@alum.mit.edu>
Co-authored-by: SnakeOnex <sheeproman@gmail.com>
Co-authored-by: nimlgen <138685161+nimlgen@users.noreply.github.com>
Co-authored-by: Jyotirmaya Mahanta <jyotirmaya.mahanta@gmail.com>
Co-authored-by: Guy Leroy <g.m.leroy@outlook.com>
Co-authored-by: Paul Gustafson <paul.gustafson@theambrusgroup.com>
2024-01-15 16:44:20 -08:00
George Hotz
a464909d79 fast resnet eval (#3135)
* fast resnet eval

* fix HIP multidevice graph

* neater expression for devices

* lines

* add decorator test
2024-01-15 14:15:18 -08:00
chenyu
db965a0c74 remove numpy from ops_torch (#3124)
updated mnist test to cast label to int8 and avoid hacking cast issue of torch uint8
2024-01-14 22:46:57 -05:00
chenyu
152ef7fc79 minor cleanups of onnx_ops (#3116) 2024-01-14 02:15:24 -05:00
chenyu
a313e63a9b add Tensor.var (#3114)
also updated MeanVarianceNormalization and made test_ops test tensors of var and std smaller
2024-01-14 01:11:08 -05:00
Francis Lam
ddbdb52f77 wmma: enable METAL half tensor cores and clean up cstyle (#3095)
* wmma: enable METAL half tensor cores and clean up cstyle

* revert simple_matmul rand changes and break line in tensor

* added metal fp16->fp32 tensor core
2024-01-12 16:25:28 -05:00
SnakeOnex
0c49d38ba7 replace with tensor op (#3099) 2024-01-12 14:13:40 -05:00
Yixiang Gao
13e872b53f add mutigpu support for llama attention (#3064)
* add llama attention test for multigpu

* test fails

* kv cache trying to shrink on sharded axis

* mask None works for scale dot product

* kv cache seems to be working but scale dot product breaks

* scaled dot product works, but the last linear layer failed

* running into the reshape case where it could be wrong for multigpu

* making sure it was the reshape

* adding contiguous doesn't solve

* need to shard more properly

* remove reshape test

* minor adjustment to scale dot product attention test

* weights are sharded wrong

* continue fix new weight sharding

* clean up

* fix attention when start_pos is 0

* remove print

* add TODOs for the best mutigpu interface
2024-01-11 16:31:02 -08:00
chenyu
507e0afba0 fix onehot and jit in examples/transformer (#3073)
trained to 0.999 in < 6 seconds on M1 Max consistently
2024-01-10 02:22:41 -05:00
George Hotz
ae83733431 hotfix: examples/transformer.py 2024-01-09 19:28:09 -08:00
chenyu
1d730b8853 remove ACCUM_FP32 in simple_matmul.py (#3045)
* remove ACCUM_FP32 in simple_matmul.py

accumate for half inputs is always in float

* move test llama compile speed to metal
2024-01-08 17:37:57 -05:00
George Hotz
c003be7309 Revert "track size in shapetracker" (#3043)
* Revert "track size in shapetracker (#3026)"

This reverts commit a8ba1ac08f.

* st.size
2024-01-08 13:13:39 -08:00
George Hotz
c5a941d466 webgl backend in extra (#3041)
* WebGL WIP

* 84% of ops passing test

* tests passing 100%

* Cleanup, refactor

* Shave off some lines

* Work on dtypes

* TestOps at 100% again

* Efficient net shaders compile in browser webgl2

* Compile all efficientnet shaders in browser

* Create empty textures for tensor buffers

* Run program. Up next weight loading

* Exported WebGL model working

* Add tests, refactor

* Explicit cast alu for GLSL

* Fix CI tests

* WebGL efficientnet demo

* Compile and run yolov8 in browser

* Fix imports

* Simplify yolo compile

* Fix bool*bool and cast cmplt to float

* More tests

* Do std tests pass on CI?

* Skip std tests on CI

* Remove explicit_cast_alu hack, and solve it in code_for_op

* Move to new dtype-less alloc api

* Remove local size hack: optimize local_size only if device has local

* Remove glsl.py, and move content to cstyle

* dont_use_locals in opts

* Fix dtype tests

* type_map in CStyleLanguage

* Make core changes smaller, cleaner, refactor export_model and demo

* Skip pad_slice

* Simplify: render_const, render_conditional

* solve bool alu for other binops, cleaner ops_webgl

* Fix noopt hack

* Remove some skipIfs

* WebGL image hack

* type_names is a better name

* global_max

* Fix dtype import

* Fix type_names -> type_map

* Fix lint

* Remove webgpu, back to 5k lines (#3040)

* remove webgpu

* max 5000 lines

* revert those to master

* retain that cstyle

---------

Co-authored-by: Ahmed Harmouche <ahmedharmouche92@gmail.com>
2024-01-08 09:29:13 -08:00
George Hotz
8cbcd1b342 Remove webgpu, back to 5k lines (#3040)
* remove webgpu

* max 5000 lines
2024-01-08 09:10:07 -08:00
chenyu
c9371f0d31 hotfix llama conversation mode (#3031)
without contiguous on keys and values, it runs but the update is incorrect
2024-01-06 16:57:07 -05:00
George Hotz
a8ba1ac08f track size in shapetracker (#3026)
* track size in shapetracker

* shapetracker adapter

* size is an int

* create Buffer with st.size

* only compare the views for the jit

* fix webgpu
2024-01-05 20:15:53 -08:00
George Hotz
60abc62a3f fast hip read (#3014)
* fast hip read

* hip read faster

* fix tests

* to_mv

* simplify

* bump to 6k lines
2024-01-05 10:33:13 -08:00
chenyu
f88506e630 move gpt2/llama sampling inside the model call (#3013)
* move gpt2/llama sampling inside the model call

* argmax uses one more kernel
2024-01-04 17:01:50 -05:00
George Hotz
c2a044ed83 disk_read_speed example 2024-01-04 13:59:43 -08:00
Yixiang Gao
8a63f26a0f make LR scheduler work with multigpu (#3011)
* add a failing test for LR scheduler when using multigpu

* fix calculation order and unnecessary tensor created for float

* min_lr is no longer tensor
2024-01-04 12:10:56 -08:00
chenyu
6fa285b943 touchup onnx xor and not (#3008) 2024-01-04 02:02:42 -05:00
geohotstan
57817028bb removed redundant dtype hacks in onnx_ops (#2939)
* updated most dtype hacks in onnx_ops

* temporarily revert dequantizelinear change

* I think this is right...

* MORE FIXES WOOOO NEW DTYPE IS AWESOME

* ok

* oops missed a print

* half -> float32 for CI

* is npdtype

* some more

* fix if ordering

* more clean ups

* final cleanups

* casting to half not allowed

* k nvm

* revert ArgMax change

* only GPU

* llvm begone

* teeny tiny change

* fix: attempt to add cast tests

* try this

* fix dequantizelinear

* revert some stuff

* tests pass pls

* less lines in onnx_tests

* oops missed string tensor tests

* clean up

* try: revert default behavior changes

* fix: disabled Cast and Castlike tests

* docs: small changes

* fix: fixed isNaN op and enabled associated tests

* fix: forgot about float16

* done

* update disabled test

* gah missed another float16

* disable rest of failing tests

* rm extra line

* try...

---------

Co-authored-by: chenyu <chenyu@fastmail.com>
2024-01-04 01:45:24 -05:00
George Hotz
7e191fbb86 hotfix: don't jitcache with 1 kernel. improvements to hip sniffer 2024-01-03 19:17:08 -08:00
George Hotz
753a7ecc05 Hip driver (#2992)
* start hip driver

* fix hip llama

* make HIP default if we can

* don't change those
2024-01-03 12:53:47 -08:00
chenyu
b1d9e54ea3 regenerate kernel ast dataset (#2968)
added back the log ast function and removed hacks that work around the old dataset
2024-01-01 20:26:17 -05:00
George Hotz
a280cfe169 move dtypes to dtype.py (#2964)
* move dtypes to dtype.py

* fix urllib
2024-01-01 14:58:48 -08:00
George Hotz
c81ce9643d move globalcounters to ops (#2960)
* move globalcounters to ops

* missed a few

* sick of that failing
2024-01-01 14:21:02 -08:00
chenyu
ad4472e6e8 cleanup llama apply_rotary_emb and other helpers (#2950)
* cleanup llama apply_rotary_emb and other helpers

used ellipsis and other higher level tensor function.
disabled the half @ half -> half tensor core as it fails uop dtype checks

* keep hip 8x8->8 wmma
2023-12-29 11:39:15 -05:00
chenyu
61e255d197 use max for gpt2 and llama (#2949)
not using argmax yet because there's a multinomial outside of function.
2023-12-28 23:26:00 -05:00