Commit Graph

1363 Commits

Author SHA1 Message Date
Comma Device
9e2af0a972 too far with the OPTWG 2023-01-24 13:14:59 -06:00
Comma Device
3590848b93 a little more local workgroup options 2023-01-24 12:50:27 -06:00
Comma Device
4b74752c42 fix hotspots by improving the workgroup optimizer 2023-01-24 12:46:28 -06:00
George Hotz
fd760a390a fix incremental time 2023-01-24 10:19:04 -08:00
George Hotz
a949de873b reduce 2.0 (#469)
* reduce 2.0

* works

* hacks

* DEBUG=3 for shapes

* fix types

* 0s weren't being folded

* cleaner

* last_reduce is no longer needed

* comments and cleanup
2023-01-23 15:11:13 -08:00
George Hotz
f1196984e6 harmless to intertwine the math and the stores 2023-01-21 09:31:56 -08:00
George Hotz
708215d06b Typing (#468)
* we typing

* types look good in theory

* most tests pass

* gpu tests pass

* TEST_AST

* delete comments

* i must have written that bug so many times

* bugfix

* don't merge the small ones

* add f to constants

* commits from reduce

* don't GCD the mod nodes

* broken and a hack IMAGE=3

* group for reduce

* fix linter + mypy

* move out test ast

* insource TENSOR_TYPE_TO_NP_TYPE

* does this fix it?

* move imports out
2023-01-21 09:09:22 -08:00
George Hotz
0881d504c1 move shapetracker (#466)
* move shapetracker

* shapetracker test

* move ast

* move a few things

* fix print kernel

* fix test

* symbolic fixups
2023-01-19 09:56:31 -08:00
George Hotz
9245f4650a indexer changes for master 2023-01-18 18:02:02 -08:00
George Hotz
49c6e6d472 Latest attempt to add image (#462)
* add image

* load + store + boring stuff:

* image tests pass

* thneed print GFLOPS

* op conv test

* more debugging

* hack for multiview image

* shapetracker creates less views

* disable image tests

* working better

* ugh, lkey not key

* print in DEBUG, and allow views

* works

* simple padding conv2d

* use index for image

* that was bad code

* debug print

* fix types

* less lines

* save lines
2023-01-12 17:36:30 -08:00
George Hotz
281b0db773 three from image 2023-01-12 12:26:58 -08:00
George Hotz
9ff6c532eb Prereqs for IMAGE=1 (#461)
* contig

* move ast, debug prog

* add Token

* cleanup reduce

* exec_ast
2023-01-11 20:18:42 -08:00
George Hotz
fff1f046b0 Simple version of the new GPU backend (#458)
* newgpu

* more to delete

* hmm, tests pass with constant folding

* fix lint/type

* fix constant folding

* comment and rerun tests

* lazy touchups

* fix graph_batchnorm test

* smaller transformer to fix OOM

* Revert "smaller transformer to fix OOM"

This reverts commit a44ef8edc2.

* no func cache

* introspect

* touchups

* CLASTKernel

* ugh, it was lru_cache

* codegen

* spacing

* old gpu still in opencl

* typing fix
2023-01-10 19:16:02 -08:00
George Hotz
fad7cba590 move batchnorm to Tensor 2023-01-09 18:00:16 -08:00
George Hotz
4885fce56e shapetracker from newgpu (#456)
* shapetracker from newgpu

* touchup ops

* test

* testst

* thneed deletes unused inputs

* test

* bugfix
2023-01-09 12:40:01 -08:00
George Hotz
b8c94a67c9 Simple chonker (#431)
* chonker will make llvm fast

* work

* better speed tests, we will make them fast

* with the cache add is the same speed

* relu and neg are fast

* fix sum speed

* maximum maxnum?

* hack for gemm opt

* gemm very slow

* zeros like

* test_permute

* shapetracker returns self

* fix shapetracker factorization

* err, int strides

* permutes are faster now in tinygrad than pytorch

* support -1 in expand

* gemm unrolled

* improve final test case

* WIP GEMM

* why isn't GEMM fast?

* revert cache dim

* ffp contract works on clang, not llvm?

* ignore llvm ir

* this makes fma work at least, but no faster

* USE_4x4

* 63 GFLOPS

* 87 GFLOPS

* that wasn't matmul, 44 GFLOPS now

* 82 GFLOPS permuted

* this permute too

* a little speed for the convs

* 45 GFLOPS

* speed tests pass again

* clean up prints

* fix FMA WHAT A WASTE OF TIME

* colors

* moar fair

* GPU

* useless on chonker

* cleanups

* improve factorized shapetracker

* better threshold

* label conv

* work

* ops test pass again

* hot load the index

* run the last view, no need to create

* ZeroView needs a repr for the key to work

* fix segfault on out of bounds

* one more test

* start amx, and llvm.initialize_native_asmparser

* amx works

* nice AMX class

* nicer AMX class

* refactor get_idxs

* amx working

* is slower...

* useless flip

* cache

* SZ_X

* AMX_SZ_X/Y work alone

* Contiguous mlop

* test gemm packed

* PREPARE in packed

* use_amx factor

* prefetch isn't faster

* loop

* same 3ms

* 2.24 ms

* allow double on store in TG

* amx reduce is the same speed as non amx reduce

* include memory bandwidth

* clean up shapetracker

* flip returns stride

* prepare for upstream

* Update ops_llvm.py (#426)

* permutes are yellow and green now

* faster conv

* llvm cleanups

* Show optimised IR under debug 4 (#428)

* ASTKernel class

* Make tinygrad work with older python version (#427)

* Make tinygrad work with older python version

* Use partialmethod instead of partial

* smiple chonker is chonking

* remove junk from test speed vs torch

* fix linker and types

* AMX is only here now

* add LLVM tests, it's a valid backend now

* oops, run llvm test

* contiguous_op

* fix loadops compare

* dedup reduceops

Co-authored-by: calledit <1573053+calledit@users.noreply.github.com>
2022-11-10 23:17:09 -08:00
George Hotz
2cc1d970c6 updates from the chonker branch 2022-11-07 21:12:08 -08:00
George Hotz
d878065ece Gemm (#416)
* gemm

* off by factor of 5

* 50 GFLOPS

* works

* 91 gflops

* working at 50G

* works

* iy

* 150 GFLOPS

* 150 GFLOPS

* N=2048 is still fast

* threading soon

* multithread

* pinning

* throttling is sad

* Align matrices to cacheline width (#361)

Co-authored-by: cloud <Cloud11665@gmail.com>
2022-11-06 10:07:28 -08:00
George Hotz
6a8fb53304 move ops.py into lazy.py (#402)
* move ops.py into lazy.py

* fix graph and linter

* ugh, didn't add
2022-10-25 13:58:03 -07:00
George Hotz
8e22d5ee67 replace networkx with defaultdict 2022-10-20 19:36:43 -07:00
George Hotz
63f9c55156 really dumb bug 2022-10-20 17:07:47 -07:00
George Hotz
1bec4651b3 fix nonstatic weights 2022-10-20 17:04:14 -07:00
George Hotz
bb288e6938 safe_numpy and warning for broken matmul 2022-10-20 15:40:22 -07:00
George Hotz
50c95c7d9a add assert to catch issue in attention 2022-10-20 15:13:00 -07:00
George Hotz
26c78ccf7d remove useless buffer 2022-10-20 14:07:28 -07:00
George Hotz
a18c1f3178 zero out the inputs 2022-10-20 13:46:52 -07:00
George Hotz
ace8db29f8 ReduceSum 2022-10-20 12:48:14 -07:00
George Hotz
c400ee0beb refactoring thneed (#400)
* refactoring thneed

* continue

* minor update

* looks like it's working

* big refactor

* confirm thneed got the right output

* code is there but it's broken

* works now

* always OPTWG, input -> dat

* fix type issue
2022-10-20 12:35:59 -07:00
YassineYousfi
ae0f9b17df openpilot: new models and onnx ops (#401)
* ngrl stuff

* fngrl

* fix typo in compile script

* workflow dispatch

* new models in tests

* dont need to up this threshold

Co-authored-by: HaraldSchafer <harald.the.engineer@gmail.com>
2022-10-20 11:49:19 -07:00
George Hotz
ff11c4316b move get_parameters to optim.py 2022-09-25 13:16:58 -04:00
Jacky Lee
2c01a66265 Reshape dataset from fetch_mnist (#390) 2022-09-24 21:16:29 -04:00
George Hotz
271446e3eb set requires_grad to None (#387)
* set requires_grad to None

* some things need gradients

* hmm, why was get_parameters filtering
2022-09-21 11:16:02 -04:00
YassineYousfi
2f0f91ba3d support float16 onnx weights (#384) 2022-09-15 09:12:18 -04:00
YassineYousfi
1a7bdc51f8 support more onnx ops (#376)
* broadcast from right to left

* add another broadcasted add test

* more onnx ops

* use float32 range in clip
2022-09-07 15:15:24 -07:00
George Hotz
0516359af8 fix stupid OPENCL=1 OOM 2022-09-06 14:29:23 -07:00
George Hotz
4dadd95e3c fix tests hopefully, more stable diffusion 2022-09-03 10:38:31 -07:00
George Hotz
c01a8c5c2d stable diffusion start 2022-09-03 10:08:42 -07:00
George Hotz
a3fc64a585 fix batchnorm folding in openpilot compile 2022-08-31 13:04:49 -07:00
George Hotz
dc7af8c3ac thneed run float32 2022-08-28 11:03:35 -07:00
George Hotz
b132de677d tinygrad.nn (#367)
* tinygrad.nn

* flake8

* working on pylint

* more pylint

* more pylint

* pylint passes

* networkx

* mypy can't infer that type

* junk
2022-08-18 07:41:00 -07:00
George Hotz
f76d41812b prune graph 2022-07-17 15:38:43 -07:00
George Hotz
eda6f071b2 default opt level 2 2022-07-17 14:54:40 -07:00
George Hotz
73b0471b25 join expands 2022-07-17 13:42:05 -07:00
George Hotz
d04b274cd2 noop removal can replace with reshape 2022-07-16 08:32:42 -07:00
George Hotz
2720ef49ca extra and test and tuple 2022-07-07 10:01:33 -07:00
George Hotz
81b73f97a3 Optiimzation (#355)
* constant folding into kernels

* that opt worth it?

* fix mypy

* ast one kernel

* save 2 lines in conv kernel

* debug print kernel count

* cl debugging

* early realize inputs

* refactor Device
2022-07-04 08:58:57 -07:00
George Hotz
7276f8d6bf improve constant folding, detach before moving tensor 2022-07-02 15:29:40 -07:00
George Hotz
8cf1aed0f4 don't track_running_stats, parameters must require_grad 2022-07-02 14:38:45 -07:00
George Hotz
49c954b389 comments 2022-06-26 17:20:25 -07:00
George Hotz
83d50e2687 move to extra.onnx 2022-06-21 19:43:44 -07:00