Commit Graph

4505 Commits

Author SHA1 Message Date
George Hotz
a6de94b444 test partial sum 2023-01-22 21:28:40 -08:00
George Hotz
708215d06b Typing (#468)
* we typing

* types look good in theory

* most tests pass

* gpu tests pass

* TEST_AST

* delete comments

* i must have written that bug so many times

* bugfix

* don't merge the small ones

* add f to constants

* commits from reduce

* don't GCD the mod nodes

* broken and a hack IMAGE=3

* group for reduce

* fix linter + mypy

* move out test ast

* insource TENSOR_TYPE_TO_NP_TYPE

* does this fix it?

* move imports out
2023-01-21 09:09:22 -08:00
George Hotz
b29614592a first conv/second conv 2023-01-19 13:26:11 -08:00
George Hotz
3d697577b2 print_ast 2023-01-19 13:22:03 -08:00
George Hotz
0881d504c1 move shapetracker (#466)
* move shapetracker

* shapetracker test

* move ast

* move a few things

* fix print kernel

* fix test

* symbolic fixups
2023-01-19 09:56:31 -08:00
George Hotz
2b47ee401f Symbolic for indexes (#464)
* indexer

* works

* all use indexer

* boolean in the indexer too

* symbolic is a better name than indexer

* better symbolic API

* min and max

* symbolic tests

* work

* more tests

* fix demodder

* __str__ in the superclass

* NumNode

* awesome that works

* still works

* fix up parens

* fix zeroviews

* dead lines

* expr_node

* works

* still works

* refactor to not use __new__ methods

* ugh something went wrong a while ago

* this fixes it

* mod and div at the end

* test

* symbolic

* working

* one linter issue fixed

* other division

* more simplifys

* works

* validhacks

* VALIDHACKS passes thneed

* no str replace stuff

* inline indexes

* NATIVE_EXPLOG and factoring

* factor both ways

* cl indexing

* split on mod, not just full

* onnxlimit

* fix output shape

* op_estimate is a function of the program

* no ones in the index

* four_float4

* ALLOW_4FLOAT4

* test passes

* compute then store

* loads first

* bugfix

* better, but doesn't match

* select xb in smart way

* new test and bugfix

* no change to lazy

* Node fixes linter

* fix opencl with op_estimate

* fix mypy

* revert valid

* remove unused
2023-01-19 07:21:30 -08:00
George Hotz
9245f4650a indexer changes for master 2023-01-18 18:02:02 -08:00
George Hotz
287699c32c simplify ones after axis splitting 2023-01-14 10:51:43 -08:00
George Hotz
49c6e6d472 Latest attempt to add image (#462)
* add image

* load + store + boring stuff:

* image tests pass

* thneed print GFLOPS

* op conv test

* more debugging

* hack for multiview image

* shapetracker creates less views

* disable image tests

* working better

* ugh, lkey not key

* print in DEBUG, and allow views

* works

* simple padding conv2d

* use index for image

* that was bad code

* debug print

* fix types

* less lines

* save lines
2023-01-12 17:36:30 -08:00
George Hotz
fff1f046b0 Simple version of the new GPU backend (#458)
* newgpu

* more to delete

* hmm, tests pass with constant folding

* fix lint/type

* fix constant folding

* comment and rerun tests

* lazy touchups

* fix graph_batchnorm test

* smaller transformer to fix OOM

* Revert "smaller transformer to fix OOM"

This reverts commit a44ef8edc2.

* no func cache

* introspect

* touchups

* CLASTKernel

* ugh, it was lru_cache

* codegen

* spacing

* old gpu still in opencl

* typing fix
2023-01-10 19:16:02 -08:00
George Hotz
bfd4f4e35c testdocker 2023-01-09 12:41:52 -08:00
George Hotz
4885fce56e shapetracker from newgpu (#456)
* shapetracker from newgpu

* touchup ops

* test

* testst

* thneed deletes unused inputs

* test

* bugfix
2023-01-09 12:40:01 -08:00
cloud11665
4fb97b8de0 don't fail when termcolor is not installed (#436) 2022-11-14 16:45:06 -08:00
George Hotz
5e07d4669d the speedy chonker is going to replace the old chonker (#432)
* bringing back reshape and permute

* done with E701

* 4x4 works in generic way

* max and sum not vectorizing...

* special case single float

* support comparing to MPS

* improve matmul speed, consider generic principles

* GlobalCounter

* fix op tracking

* faster

* comment that out for now

* err, it needs that

* fix minor issues

* fix global_mem
2022-11-11 18:34:24 -08:00
George Hotz
b8c94a67c9 Simple chonker (#431)
* chonker will make llvm fast

* work

* better speed tests, we will make them fast

* with the cache add is the same speed

* relu and neg are fast

* fix sum speed

* maximum maxnum?

* hack for gemm opt

* gemm very slow

* zeros like

* test_permute

* shapetracker returns self

* fix shapetracker factorization

* err, int strides

* permutes are faster now in tinygrad than pytorch

* support -1 in expand

* gemm unrolled

* improve final test case

* WIP GEMM

* why isn't GEMM fast?

* revert cache dim

* ffp contract works on clang, not llvm?

* ignore llvm ir

* this makes fma work at least, but no faster

* USE_4x4

* 63 GFLOPS

* 87 GFLOPS

* that wasn't matmul, 44 GFLOPS now

* 82 GFLOPS permuted

* this permute too

* a little speed for the convs

* 45 GFLOPS

* speed tests pass again

* clean up prints

* fix FMA WHAT A WASTE OF TIME

* colors

* moar fair

* GPU

* useless on chonker

* cleanups

* improve factorized shapetracker

* better threshold

* label conv

* work

* ops test pass again

* hot load the index

* run the last view, no need to create

* ZeroView needs a repr for the key to work

* fix segfault on out of bounds

* one more test

* start amx, and llvm.initialize_native_asmparser

* amx works

* nice AMX class

* nicer AMX class

* refactor get_idxs

* amx working

* is slower...

* useless flip

* cache

* SZ_X

* AMX_SZ_X/Y work alone

* Contiguous mlop

* test gemm packed

* PREPARE in packed

* use_amx factor

* prefetch isn't faster

* loop

* same 3ms

* 2.24 ms

* allow double on store in TG

* amx reduce is the same speed as non amx reduce

* include memory bandwidth

* clean up shapetracker

* flip returns stride

* prepare for upstream

* Update ops_llvm.py (#426)

* permutes are yellow and green now

* faster conv

* llvm cleanups

* Show optimised IR under debug 4 (#428)

* ASTKernel class

* Make tinygrad work with older python version (#427)

* Make tinygrad work with older python version

* Use partialmethod instead of partial

* smiple chonker is chonking

* remove junk from test speed vs torch

* fix linker and types

* AMX is only here now

* add LLVM tests, it's a valid backend now

* oops, run llvm test

* contiguous_op

* fix loadops compare

* dedup reduceops

Co-authored-by: calledit <1573053+calledit@users.noreply.github.com>
2022-11-10 23:17:09 -08:00
George Hotz
1271f19a2b factorizing shapetracker from chonker 2022-11-09 16:36:38 -08:00
George Hotz
9781b4c3af rename test functions to helper_ 2022-11-07 21:27:56 -08:00
George Hotz
9884be2ad5 ugh, that too 2022-11-07 21:21:35 -08:00
George Hotz
537a9eb414 fix termcolor import 2022-11-07 21:19:08 -08:00
George Hotz
2cc1d970c6 updates from the chonker branch 2022-11-07 21:12:08 -08:00
George Hotz
db2da22a04 stop blowing up floats 2022-10-30 16:47:16 -07:00
George Hotz
8afc643bb1 fix bug in ops test, it was cheating somehow 2022-10-30 16:43:24 -07:00
George Hotz
2f602a92ff seperate STRIDED and EXPAND 2022-10-30 13:23:58 -07:00
George Hotz
544cb0a069 oops, remove while(1) 2022-10-29 14:05:13 -07:00
George Hotz
fdb43fe553 gemm is 1.7 TFLOPS on a single M1 core 2022-10-29 13:42:33 -07:00
George Hotz
52bfbc31be vectorization 2022-10-29 12:47:52 -07:00
George Hotz
e473d35f90 llvm doesn't vectorize 2022-10-29 11:59:48 -07:00
George Hotz
7909786dbf one more opt test 2022-10-28 18:37:53 -07:00
George Hotz
294ab9e2f8 more test opt 2022-10-28 18:04:12 -07:00
George Hotz
f885ceb695 test speed w/o bias 2022-10-28 11:22:15 -07:00
George Hotz
df31dde174 hasattr and DeviceBuffer type fixups 2022-10-28 09:05:45 -07:00
George Hotz
b65b70812a Exec AST (#404)
* working exec ast

* exec_ast is staticmethod

* GenericExecAST

* fold that sometimes

* ExplicitExecAST

* exec_ast for GPU

* gpu working

* get_lazyop_shape

* now gpubuffer is ExplicitExecAST

* dedup

* add a type

* RESHAPE in opencl code

* fix linter

* that too for linter

* cleanups

* remove dead code

* GenericShape is less lines

* add ALLOWED_KERNEL_COUNT to tests

* fix mypy

* that's gotta be recursive

* fix opencl shape processing

* remove unneeded lambda
2022-10-28 08:27:03 -07:00
George Hotz
6a15fd3844 LLVM Backend take 2 (#403)
* take 2 llvm

* get_lazybuffers -> get_buffers

* llvm tests pass

* fix type issues and refactor LLVM
2022-10-26 20:32:31 -07:00
George Hotz
10921a60c4 more imports from llvm branch 2022-10-26 18:02:36 -07:00
Drew Hintz
a4ad1d774a enable tests in test_ops.py that are disabled but now work. (#396)
remove custom tolerances that don't appear to be needed.
2022-10-13 09:58:53 -07:00
George Hotz
793edf8900 touchup 2022-10-10 16:13:34 -07:00
George Hotz
d54a45b50d measure speed vs torch 2022-10-10 16:06:00 -07:00
George Hotz
b7f748c15a Fix GPU 2**31 virtual size limit (#392)
* in progress

* big conv test works

* that's unneeded

* fix opencl with reduce

* rewrite contiguous_view_constant_fold

* clean up mids in loop code

* subidx

* print cl kernel before run

* no reduce, no loop

* Revert "no reduce, no loop"

This reverts commit 92777e40e9.
2022-10-05 00:55:20 -04:00
George Hotz
392e57aea7 ugh, why did that fail 2022-10-01 13:38:43 -04:00
George Hotz
7a61dc7ee9 test_sd_big_conv 2022-10-01 13:26:05 -04:00
Ollin Boer Bohan
3b1767e013 Fix OpenCL Metal texture issues (#378)
* Fix OpenCL Metal texture issues

Tile CL images when needed, to fit into the 16384 max Metal image size;
gets me to ~4.8s/iteration for SD on M1 Pro with OPENCL=1 FLOAT16=1.

* Minor cleanup

* Fix mish in CI, or no-op?

* Is mish being framed?

* It would help if any of this reproduced locally

* ???

* OPT is reverted; use original mish

* Cleanup post-review

* Fix some shape usage

* Tiler tests, shouldn't oom or overflow either

* Can't CL if there's no CL?

* Run tiler tests even if GPU=1

* relu6 segfault binary chop; revert test

* relu6 segfault binary chop; revert accel

* relu6 segfault binary chop; revert . (???)

* end relu6 segfault binary chop; repo's haunted
2022-09-29 01:21:54 -04:00
George Hotz
e737513c52 external_test_opt 2022-09-28 23:29:41 -04:00
George Hotz
650c011646 notrain test 2022-09-28 23:27:20 -04:00
George Hotz
af87d692e4 should this be 10? 2022-09-28 23:25:52 -04:00
George Hotz
0fd459b24e ugh, global state 2022-09-28 23:10:49 -04:00
George Hotz
fa4eff9cc1 Device.GPU isn't definied 2022-09-28 23:00:15 -04:00
George Hotz
0b6537a572 fix tests 2022-09-28 22:57:58 -04:00
George Hotz
726cca78cd fix bn folding issue, add new test 2022-09-28 22:52:18 -04:00
George Hotz
60df954377 Fix weight init: this work? (#391)
* this work?

* glorot uniform

* requies_grad broke

* propagate the None correctly

* so this weight init works

* ahh, i think it's this

* can't beat this

* glorot is best for ae

* remove comments
2022-09-25 16:46:33 -04:00
George Hotz
271446e3eb set requires_grad to None (#387)
* set requires_grad to None

* some things need gradients

* hmm, why was get_parameters filtering
2022-09-21 11:16:02 -04:00