Commit Graph

119 Commits

Author SHA1 Message Date
George Hotz
0881d504c1 move shapetracker (#466)
* move shapetracker

* shapetracker test

* move ast

* move a few things

* fix print kernel

* fix test

* symbolic fixups
2023-01-19 09:56:31 -08:00
George Hotz
3a3400e3a2 more from indexer 2023-01-18 18:11:51 -08:00
George Hotz
f1378b3ea1 fix linter and force allocation on hostbuf 2023-01-11 21:04:21 -08:00
George Hotz
3ea38cac72 IMAGE == 1, add reshape to the ast 2023-01-11 20:56:03 -08:00
George Hotz
74fb772e2a rawcpu is junk, replace with llvm 2023-01-10 19:18:45 -08:00
George Hotz
fff1f046b0 Simple version of the new GPU backend (#458)
* newgpu

* more to delete

* hmm, tests pass with constant folding

* fix lint/type

* fix constant folding

* comment and rerun tests

* lazy touchups

* fix graph_batchnorm test

* smaller transformer to fix OOM

* Revert "smaller transformer to fix OOM"

This reverts commit a44ef8edc2.

* no func cache

* introspect

* touchups

* CLASTKernel

* ugh, it was lru_cache

* codegen

* spacing

* old gpu still in opencl

* typing fix
2023-01-10 19:16:02 -08:00
George Hotz
4885fce56e shapetracker from newgpu (#456)
* shapetracker from newgpu

* touchup ops

* test

* testst

* thneed deletes unused inputs

* test

* bugfix
2023-01-09 12:40:01 -08:00
George Hotz
5e07d4669d the speedy chonker is going to replace the old chonker (#432)
* bringing back reshape and permute

* done with E701

* 4x4 works in generic way

* max and sum not vectorizing...

* special case single float

* support comparing to MPS

* improve matmul speed, consider generic principles

* GlobalCounter

* fix op tracking

* faster

* comment that out for now

* err, it needs that

* fix minor issues

* fix global_mem
2022-11-11 18:34:24 -08:00
George Hotz
d2273d2cc4 s/contiguous_op/contiguous 2022-11-11 00:07:05 -08:00
George Hotz
b8c94a67c9 Simple chonker (#431)
* chonker will make llvm fast

* work

* better speed tests, we will make them fast

* with the cache add is the same speed

* relu and neg are fast

* fix sum speed

* maximum maxnum?

* hack for gemm opt

* gemm very slow

* zeros like

* test_permute

* shapetracker returns self

* fix shapetracker factorization

* err, int strides

* permutes are faster now in tinygrad than pytorch

* support -1 in expand

* gemm unrolled

* improve final test case

* WIP GEMM

* why isn't GEMM fast?

* revert cache dim

* ffp contract works on clang, not llvm?

* ignore llvm ir

* this makes fma work at least, but no faster

* USE_4x4

* 63 GFLOPS

* 87 GFLOPS

* that wasn't matmul, 44 GFLOPS now

* 82 GFLOPS permuted

* this permute too

* a little speed for the convs

* 45 GFLOPS

* speed tests pass again

* clean up prints

* fix FMA WHAT A WASTE OF TIME

* colors

* moar fair

* GPU

* useless on chonker

* cleanups

* improve factorized shapetracker

* better threshold

* label conv

* work

* ops test pass again

* hot load the index

* run the last view, no need to create

* ZeroView needs a repr for the key to work

* fix segfault on out of bounds

* one more test

* start amx, and llvm.initialize_native_asmparser

* amx works

* nice AMX class

* nicer AMX class

* refactor get_idxs

* amx working

* is slower...

* useless flip

* cache

* SZ_X

* AMX_SZ_X/Y work alone

* Contiguous mlop

* test gemm packed

* PREPARE in packed

* use_amx factor

* prefetch isn't faster

* loop

* same 3ms

* 2.24 ms

* allow double on store in TG

* amx reduce is the same speed as non amx reduce

* include memory bandwidth

* clean up shapetracker

* flip returns stride

* prepare for upstream

* Update ops_llvm.py (#426)

* permutes are yellow and green now

* faster conv

* llvm cleanups

* Show optimised IR under debug 4 (#428)

* ASTKernel class

* Make tinygrad work with older python version (#427)

* Make tinygrad work with older python version

* Use partialmethod instead of partial

* smiple chonker is chonking

* remove junk from test speed vs torch

* fix linker and types

* AMX is only here now

* add LLVM tests, it's a valid backend now

* oops, run llvm test

* contiguous_op

* fix loadops compare

* dedup reduceops

Co-authored-by: calledit <1573053+calledit@users.noreply.github.com>
2022-11-10 23:17:09 -08:00
George Hotz
8c849e637c that was in there twice, DEBUG>=4 to see loop opt 2022-10-30 15:31:39 -07:00
George Hotz
cfdf803b52 fix llvm vectorization by add analysis passes from the target machine 2022-10-30 15:28:36 -07:00
George Hotz
2f602a92ff seperate STRIDED and EXPAND 2022-10-30 13:23:58 -07:00
George Hotz
4b6097f81d more amx notes 2022-10-29 14:04:10 -07:00
George Hotz
fdb43fe553 gemm is 1.7 TFLOPS on a single M1 core 2022-10-29 13:42:33 -07:00
George Hotz
52bfbc31be vectorization 2022-10-29 12:47:52 -07:00
George Hotz
e473d35f90 llvm doesn't vectorize 2022-10-29 11:59:48 -07:00
George Hotz
86eb06eb76 accurate flop estimation 2022-10-28 19:13:20 -07:00
George Hotz
dd543fbc7a MovementOps is unused 2022-10-28 18:26:08 -07:00
George Hotz
71b336503f no RESHAPEs in the AST 2022-10-28 18:25:30 -07:00
George Hotz
b65b70812a Exec AST (#404)
* working exec ast

* exec_ast is staticmethod

* GenericExecAST

* fold that sometimes

* ExplicitExecAST

* exec_ast for GPU

* gpu working

* get_lazyop_shape

* now gpubuffer is ExplicitExecAST

* dedup

* add a type

* RESHAPE in opencl code

* fix linter

* that too for linter

* cleanups

* remove dead code

* GenericShape is less lines

* add ALLOWED_KERNEL_COUNT to tests

* fix mypy

* that's gotta be recursive

* fix opencl shape processing

* remove unneeded lambda
2022-10-28 08:27:03 -07:00
George Hotz
6a15fd3844 LLVM Backend take 2 (#403)
* take 2 llvm

* get_lazybuffers -> get_buffers

* llvm tests pass

* fix type issues and refactor LLVM
2022-10-26 20:32:31 -07:00
George Hotz
6a8fb53304 move ops.py into lazy.py (#402)
* move ops.py into lazy.py

* fix graph and linter

* ugh, didn't add
2022-10-25 13:58:03 -07:00
George Hotz
1bec4651b3 fix nonstatic weights 2022-10-20 17:04:14 -07:00
George Hotz
9f8c414589 might fix tests 2022-10-20 16:27:11 -07:00
George Hotz
fd6ba8e7ac don't recopy backing 2022-10-20 16:06:11 -07:00
George Hotz
0514594083 fix openpilot test 2022-10-20 11:56:26 -07:00
George Hotz
b7f748c15a Fix GPU 2**31 virtual size limit (#392)
* in progress

* big conv test works

* that's unneeded

* fix opencl with reduce

* rewrite contiguous_view_constant_fold

* clean up mids in loop code

* subidx

* print cl kernel before run

* no reduce, no loop

* Revert "no reduce, no loop"

This reverts commit 92777e40e9.
2022-10-05 00:55:20 -04:00
George Hotz
8382c51c12 always MATMUL, test the ops in OPENCL 2022-10-01 13:31:29 -04:00
Ollin Boer Bohan
3b1767e013 Fix OpenCL Metal texture issues (#378)
* Fix OpenCL Metal texture issues

Tile CL images when needed, to fit into the 16384 max Metal image size;
gets me to ~4.8s/iteration for SD on M1 Pro with OPENCL=1 FLOAT16=1.

* Minor cleanup

* Fix mish in CI, or no-op?

* Is mish being framed?

* It would help if any of this reproduced locally

* ???

* OPT is reverted; use original mish

* Cleanup post-review

* Fix some shape usage

* Tiler tests, shouldn't oom or overflow either

* Can't CL if there's no CL?

* Run tiler tests even if GPU=1

* relu6 segfault binary chop; revert test

* relu6 segfault binary chop; revert accel

* relu6 segfault binary chop; revert . (???)

* end relu6 segfault binary chop; repo's haunted
2022-09-29 01:21:54 -04:00
Comma Device
75f937227a add barrier 2022-09-13 11:39:48 -04:00
George Hotz
3c3534736e fix matmul kernel and tests 2022-09-13 08:31:04 -07:00
Comma Device
62e9419206 fix test failure on MATMUL=1 backward pass 2022-09-13 11:18:52 -04:00
George Hotz
0516359af8 fix stupid OPENCL=1 OOM 2022-09-06 14:29:23 -07:00
George Hotz
f215534a64 1100 lines, but sane linter rules 2022-09-06 13:47:45 -07:00
George Hotz
f683b26eef bring back native exp log 2022-09-06 07:59:04 -07:00
George Hotz
d6f499fd69 improve opencl, why is it OOMing 2022-09-05 20:14:31 -07:00
Comma Device
c07bf72d6e save free 200ms 2022-08-31 20:31:42 -04:00
Comma Device
a734df98fa TEST_ENET for openpilot compiler 2022-08-31 13:23:36 -04:00
George Hotz
e194ae0c1d typos 2022-08-30 19:52:21 -07:00
George Hotz
5efab7cf1d add reciprocal 2022-08-29 18:00:24 -07:00
George Hotz
dc7af8c3ac thneed run float32 2022-08-28 11:03:35 -07:00
Comma Device
9678cb8a1a hmm, the native exp/log breaks it too much 2022-08-22 17:13:08 -07:00
George Hotz
2162cd3383 fix typing 2022-08-22 16:25:15 -07:00
Comma Device
e0a8d0f836 image input works 2022-08-22 16:04:17 -07:00
George Hotz
18340e7d30 remove from_image 2022-08-22 15:52:26 -07:00
Comma Device
1b5f4e52d9 refactor getters 2022-08-22 13:29:08 -07:00
George Hotz
a8734df030 add openpilot tests to tinygrad 2022-08-21 12:03:37 -07:00
George Hotz
b132de677d tinygrad.nn (#367)
* tinygrad.nn

* flake8

* working on pylint

* more pylint

* more pylint

* pylint passes

* networkx

* mypy can't infer that type

* junk
2022-08-18 07:41:00 -07:00
George Hotz
783c120a8c rawcpu (#365)
* rawcpu

* add should work when we respect shapetracker

* now that's true

* still have to handle shapetracker

* copyin

* Fix mypy
2022-08-17 11:33:20 +02:00