Commit Graph

4667 Commits

Author SHA1 Message Date
George Hotz
bd8a5c2ced Simple CUDA Runtime (#480)
* factor out opencl runtime

* don't use CL outside the runtime

* cuda runtime adds

* final_dimension

* tests pass with CUDA backend

* more cuda

* cuda simpler

* retain old functionality

* linter and typing

* move globalcounters out of runtimes

* oops, GlobalCounters in cuda

* MAX_OUTPUT_SHAPE=3 is fine for CUDA
2023-01-27 16:26:24 -08:00
George Hotz
1b624a5051 DeviceBuffer has abstract methods 2023-01-25 19:16:23 -08:00
George Hotz
44e96c58b4 touch up pytorch speed tests 2023-01-25 18:11:26 -08:00
calledit
a0af1045bf Some new tests (#440)
* Make test run

* Added new tests: sub pow constant_sub

* Fix indentation

* Added one to many lines

* Fix indentation

* Update test_cl_tiler.py

* Delete test_cl_tiler.py
2023-01-25 15:40:19 -08:00
George Hotz
e37424424f first little attempt at search 2023-01-25 11:49:29 -08:00
George Hotz
335a261a2e test for slow kernel 2023-01-25 10:25:22 -08:00
George Hotz
487685919b Revert "Rename Normalize and move to nn (#415)" (#474)
This reverts commit d768acb6a9.
2023-01-25 07:50:04 -08:00
Jacky Lee
d768acb6a9 Rename Normalize and move to nn (#415)
* Rename Normalize and move to nn

* Fix comparison to None error

* Add test for GroupNorm

* Rename test case

* Flip parameters to match PyTorch

* Increase error tolerance

* Fix elementwise_affine on channels

* Match arguments with PyTorch

* Initialize weight and bias only when affine is true

* Is this it?

* A bit cleaner

* Handle case where weight or bias is None
2023-01-25 07:47:59 -08:00
George Hotz
6d7658db12 delete opencl <celebration> 2023-01-24 14:18:35 -08:00
George Hotz
5d350d4883 the ast test is actually a test now 2023-01-24 07:53:24 -08:00
George Hotz
6fe9edf30f torch cuda is very fast 2023-01-23 16:24:46 -08:00
George Hotz
a949de873b reduce 2.0 (#469)
* reduce 2.0

* works

* hacks

* DEBUG=3 for shapes

* fix types

* 0s weren't being folded

* cleaner

* last_reduce is no longer needed

* comments and cleanup
2023-01-23 15:11:13 -08:00
George Hotz
a6de94b444 test partial sum 2023-01-22 21:28:40 -08:00
George Hotz
708215d06b Typing (#468)
* we typing

* types look good in theory

* most tests pass

* gpu tests pass

* TEST_AST

* delete comments

* i must have written that bug so many times

* bugfix

* don't merge the small ones

* add f to constants

* commits from reduce

* don't GCD the mod nodes

* broken and a hack IMAGE=3

* group for reduce

* fix linter + mypy

* move out test ast

* insource TENSOR_TYPE_TO_NP_TYPE

* does this fix it?

* move imports out
2023-01-21 09:09:22 -08:00
George Hotz
b29614592a first conv/second conv 2023-01-19 13:26:11 -08:00
George Hotz
3d697577b2 print_ast 2023-01-19 13:22:03 -08:00
George Hotz
0881d504c1 move shapetracker (#466)
* move shapetracker

* shapetracker test

* move ast

* move a few things

* fix print kernel

* fix test

* symbolic fixups
2023-01-19 09:56:31 -08:00
George Hotz
2b47ee401f Symbolic for indexes (#464)
* indexer

* works

* all use indexer

* boolean in the indexer too

* symbolic is a better name than indexer

* better symbolic API

* min and max

* symbolic tests

* work

* more tests

* fix demodder

* __str__ in the superclass

* NumNode

* awesome that works

* still works

* fix up parens

* fix zeroviews

* dead lines

* expr_node

* works

* still works

* refactor to not use __new__ methods

* ugh something went wrong a while ago

* this fixes it

* mod and div at the end

* test

* symbolic

* working

* one linter issue fixed

* other division

* more simplifys

* works

* validhacks

* VALIDHACKS passes thneed

* no str replace stuff

* inline indexes

* NATIVE_EXPLOG and factoring

* factor both ways

* cl indexing

* split on mod, not just full

* onnxlimit

* fix output shape

* op_estimate is a function of the program

* no ones in the index

* four_float4

* ALLOW_4FLOAT4

* test passes

* compute then store

* loads first

* bugfix

* better, but doesn't match

* select xb in smart way

* new test and bugfix

* no change to lazy

* Node fixes linter

* fix opencl with op_estimate

* fix mypy

* revert valid

* remove unused
2023-01-19 07:21:30 -08:00
George Hotz
9245f4650a indexer changes for master 2023-01-18 18:02:02 -08:00
George Hotz
287699c32c simplify ones after axis splitting 2023-01-14 10:51:43 -08:00
George Hotz
49c6e6d472 Latest attempt to add image (#462)
* add image

* load + store + boring stuff:

* image tests pass

* thneed print GFLOPS

* op conv test

* more debugging

* hack for multiview image

* shapetracker creates less views

* disable image tests

* working better

* ugh, lkey not key

* print in DEBUG, and allow views

* works

* simple padding conv2d

* use index for image

* that was bad code

* debug print

* fix types

* less lines

* save lines
2023-01-12 17:36:30 -08:00
George Hotz
fff1f046b0 Simple version of the new GPU backend (#458)
* newgpu

* more to delete

* hmm, tests pass with constant folding

* fix lint/type

* fix constant folding

* comment and rerun tests

* lazy touchups

* fix graph_batchnorm test

* smaller transformer to fix OOM

* Revert "smaller transformer to fix OOM"

This reverts commit a44ef8edc2.

* no func cache

* introspect

* touchups

* CLASTKernel

* ugh, it was lru_cache

* codegen

* spacing

* old gpu still in opencl

* typing fix
2023-01-10 19:16:02 -08:00
George Hotz
bfd4f4e35c testdocker 2023-01-09 12:41:52 -08:00
George Hotz
4885fce56e shapetracker from newgpu (#456)
* shapetracker from newgpu

* touchup ops

* test

* testst

* thneed deletes unused inputs

* test

* bugfix
2023-01-09 12:40:01 -08:00
cloud11665
4fb97b8de0 don't fail when termcolor is not installed (#436) 2022-11-14 16:45:06 -08:00
George Hotz
5e07d4669d the speedy chonker is going to replace the old chonker (#432)
* bringing back reshape and permute

* done with E701

* 4x4 works in generic way

* max and sum not vectorizing...

* special case single float

* support comparing to MPS

* improve matmul speed, consider generic principles

* GlobalCounter

* fix op tracking

* faster

* comment that out for now

* err, it needs that

* fix minor issues

* fix global_mem
2022-11-11 18:34:24 -08:00
George Hotz
b8c94a67c9 Simple chonker (#431)
* chonker will make llvm fast

* work

* better speed tests, we will make them fast

* with the cache add is the same speed

* relu and neg are fast

* fix sum speed

* maximum maxnum?

* hack for gemm opt

* gemm very slow

* zeros like

* test_permute

* shapetracker returns self

* fix shapetracker factorization

* err, int strides

* permutes are faster now in tinygrad than pytorch

* support -1 in expand

* gemm unrolled

* improve final test case

* WIP GEMM

* why isn't GEMM fast?

* revert cache dim

* ffp contract works on clang, not llvm?

* ignore llvm ir

* this makes fma work at least, but no faster

* USE_4x4

* 63 GFLOPS

* 87 GFLOPS

* that wasn't matmul, 44 GFLOPS now

* 82 GFLOPS permuted

* this permute too

* a little speed for the convs

* 45 GFLOPS

* speed tests pass again

* clean up prints

* fix FMA WHAT A WASTE OF TIME

* colors

* moar fair

* GPU

* useless on chonker

* cleanups

* improve factorized shapetracker

* better threshold

* label conv

* work

* ops test pass again

* hot load the index

* run the last view, no need to create

* ZeroView needs a repr for the key to work

* fix segfault on out of bounds

* one more test

* start amx, and llvm.initialize_native_asmparser

* amx works

* nice AMX class

* nicer AMX class

* refactor get_idxs

* amx working

* is slower...

* useless flip

* cache

* SZ_X

* AMX_SZ_X/Y work alone

* Contiguous mlop

* test gemm packed

* PREPARE in packed

* use_amx factor

* prefetch isn't faster

* loop

* same 3ms

* 2.24 ms

* allow double on store in TG

* amx reduce is the same speed as non amx reduce

* include memory bandwidth

* clean up shapetracker

* flip returns stride

* prepare for upstream

* Update ops_llvm.py (#426)

* permutes are yellow and green now

* faster conv

* llvm cleanups

* Show optimised IR under debug 4 (#428)

* ASTKernel class

* Make tinygrad work with older python version (#427)

* Make tinygrad work with older python version

* Use partialmethod instead of partial

* smiple chonker is chonking

* remove junk from test speed vs torch

* fix linker and types

* AMX is only here now

* add LLVM tests, it's a valid backend now

* oops, run llvm test

* contiguous_op

* fix loadops compare

* dedup reduceops

Co-authored-by: calledit <1573053+calledit@users.noreply.github.com>
2022-11-10 23:17:09 -08:00
George Hotz
1271f19a2b factorizing shapetracker from chonker 2022-11-09 16:36:38 -08:00
George Hotz
9781b4c3af rename test functions to helper_ 2022-11-07 21:27:56 -08:00
George Hotz
9884be2ad5 ugh, that too 2022-11-07 21:21:35 -08:00
George Hotz
537a9eb414 fix termcolor import 2022-11-07 21:19:08 -08:00
George Hotz
2cc1d970c6 updates from the chonker branch 2022-11-07 21:12:08 -08:00
George Hotz
db2da22a04 stop blowing up floats 2022-10-30 16:47:16 -07:00
George Hotz
8afc643bb1 fix bug in ops test, it was cheating somehow 2022-10-30 16:43:24 -07:00
George Hotz
2f602a92ff seperate STRIDED and EXPAND 2022-10-30 13:23:58 -07:00
George Hotz
544cb0a069 oops, remove while(1) 2022-10-29 14:05:13 -07:00
George Hotz
fdb43fe553 gemm is 1.7 TFLOPS on a single M1 core 2022-10-29 13:42:33 -07:00
George Hotz
52bfbc31be vectorization 2022-10-29 12:47:52 -07:00
George Hotz
e473d35f90 llvm doesn't vectorize 2022-10-29 11:59:48 -07:00
George Hotz
7909786dbf one more opt test 2022-10-28 18:37:53 -07:00
George Hotz
294ab9e2f8 more test opt 2022-10-28 18:04:12 -07:00
George Hotz
f885ceb695 test speed w/o bias 2022-10-28 11:22:15 -07:00
George Hotz
df31dde174 hasattr and DeviceBuffer type fixups 2022-10-28 09:05:45 -07:00
George Hotz
b65b70812a Exec AST (#404)
* working exec ast

* exec_ast is staticmethod

* GenericExecAST

* fold that sometimes

* ExplicitExecAST

* exec_ast for GPU

* gpu working

* get_lazyop_shape

* now gpubuffer is ExplicitExecAST

* dedup

* add a type

* RESHAPE in opencl code

* fix linter

* that too for linter

* cleanups

* remove dead code

* GenericShape is less lines

* add ALLOWED_KERNEL_COUNT to tests

* fix mypy

* that's gotta be recursive

* fix opencl shape processing

* remove unneeded lambda
2022-10-28 08:27:03 -07:00
George Hotz
6a15fd3844 LLVM Backend take 2 (#403)
* take 2 llvm

* get_lazybuffers -> get_buffers

* llvm tests pass

* fix type issues and refactor LLVM
2022-10-26 20:32:31 -07:00
George Hotz
10921a60c4 more imports from llvm branch 2022-10-26 18:02:36 -07:00
Drew Hintz
a4ad1d774a enable tests in test_ops.py that are disabled but now work. (#396)
remove custom tolerances that don't appear to be needed.
2022-10-13 09:58:53 -07:00
George Hotz
793edf8900 touchup 2022-10-10 16:13:34 -07:00
George Hotz
d54a45b50d measure speed vs torch 2022-10-10 16:06:00 -07:00
George Hotz
b7f748c15a Fix GPU 2**31 virtual size limit (#392)
* in progress

* big conv test works

* that's unneeded

* fix opencl with reduce

* rewrite contiguous_view_constant_fold

* clean up mids in loop code

* subidx

* print cl kernel before run

* no reduce, no loop

* Revert "no reduce, no loop"

This reverts commit 92777e40e9.
2022-10-05 00:55:20 -04:00