1501 Commits

Author SHA1 Message Date
George Hotz
42256c0d9d rocm sniffer dumps code 2023-05-05 18:36:53 +00:00
George Hotz
f2a964f447 nocopy (#764) 2023-05-05 09:32:06 -07:00
George Hotz
3a2011ab2d rocm sniffer 2023-05-04 22:22:39 +00:00
George Hotz
a55c4f5000 better rocm build scripts 2023-05-04 09:14:05 +00:00
George Hotz
987b1aaf96 rocm build scripts 2023-05-04 08:45:23 +00:00
George Hotz
ed33a89d52 no werror in archprobe 2023-05-03 19:34:17 +00:00
George Hotz
7ecf4dff68 multi cl_queue (#762)
* multi cl_queue

* only platforms 1

* gpus first, then cpus

* put device on underlying buffer

* cl_queue array
2023-05-03 12:15:28 -07:00
George Hotz
3b933b0a2f rocm setup script 2023-05-03 16:01:17 +00:00
George Hotz
59d0d168cd FLOAT16 off works 2023-04-19 15:34:56 -07:00
George Hotz
3d15769a8f 50 TFLOPS cuda matmul 2023-04-19 14:38:24 -07:00
George Hotz
0b5a0b9ba4 winograd comment 2023-04-16 03:36:51 -07:00
George Hotz
8b777af571 metal_conv gets over 10.4 TFLOPS... 2023-04-15 03:31:22 -07:00
George Hotz
d66e682205 metal matmul from tcores branch 2023-04-14 23:29:29 -07:00
Sohaib
70b9072663 add Pad onnx operator and rework _padding (#740) 2023-04-06 17:07:36 +05:30
George Hotz
94e2c49c35 test_cacheline_size that works in both places 2023-03-30 06:47:20 +04:00
George Hotz
b05c2828f7 better cacheline test 2023-03-30 06:08:54 +04:00
George Hotz
76db1af6fc better archprobe 2023-03-30 05:52:00 +04:00
George Hotz
20894991ed good changes from the M1 Tensor Core project (#730)
* good changes

* working except llvm

* llvm types

* nice acc

* archprobe

* lang.float4

* use self.acc for late acc

* fix store bug
2023-03-29 05:11:02 +04:00
George Hotz
68e45fca18 metal_matmul: bw and torch sync 2023-03-23 08:02:04 -07:00
George Hotz
bd6c3c31a9 compare to torch 2023-03-22 23:58:37 -07:00
George Hotz
c3a3db75c7 fix metal matmul example 2023-03-22 23:42:51 -07:00
George Hotz
b12b60af20 fix binop, other tests failure (#723)
* fix binop, other tests failure

* that was a bad idea

* better layernorm

* inference kernel count tests

* new style reshape pushing

* fixup replacement

* 199 kernels is okay. fix flops

* push reshape through unaryops only

* GRAPH=2 draws the phantom ops

* found resnet issue

* non working test

* mul is cheaper than div

* OPT inflation

* SHUFFLE_PAD_OPS in OPT=2
2023-03-22 18:15:07 -07:00
Fernando Vidal
73bd0b217b add int64 as supported dtype from numpy (#699)
* add int64 as supported dtype from numpy

Without this, examples/transformer.py didn't run. With this change it runs successfully.

* Update helpers.py

* Update transformer.py

* Update training.py
2023-03-18 17:15:04 -07:00
George Hotz
f5467cfedc Devicebufferless (#708)
* runs one metal kernel

* conv2d works

* ops tests are passing

* const folding

* all ops work

* pre commit always passes

* torch works

* working still

* fix graph test

* tests passing

* image almost works

* image conv works

* most images

* fix custom

* fix assignment

* fix compile enet

* clean up comments

* fix realize return value

* include shapetracker in LB repr

* copy should make a copy

* reenable method cache

* fix lna

* dtypes in graph

* forward only for IMAGE=2

* simple realize

* getting close

* fixup new api, it's good except the kernel count

* back to 197 kernels

* tests should pass

* go to a real float

* no type_on_cpu

* fix the docs

* put shapetracker back in it's proper place
2023-03-18 14:40:23 -07:00
Kirill
0532025b04 Fix llama 13B weights loading (#700)
* Fix llama 13B weights loading

* refactor more

* add test

* test storage offset

* fix spacing

* fix strides

* llama 13B working?

* yolo?

* better test for seeks
2023-03-15 08:59:52 -07:00
George Hotz
15e0b56e39 compile works (#688)
* compile works

* runtimes

* line count

* fix custom, to tg dtype

* meh, that's fine with lazy import
2023-03-12 11:01:25 -07:00
Kirill
af7745073f Add comments to SD (#686)
* Add explanation for empty lambdas

* Fix my_unpickle if pytorch_lightning is installed

* oops
2023-03-12 10:56:49 -07:00
George Hotz
6c3675c01c _mmap loads to gpu fast 2023-03-11 23:00:13 -08:00
George Hotz
803b0aef28 track memory for numpy/torch 2023-03-11 20:39:10 -08:00
Diogo
784afc6c6f Eq magic function support (#683)
* add eq magic func

* changed from eq to __eq__

* ignore type for linter

* mypy doenst like descriptions :(
2023-03-11 10:31:46 -08:00
George Hotz
01f39b19dc move to shapetracker.py 2023-03-11 07:50:07 -08:00
George Hotz
f3ac52aee8 Mypyc (#680)
* building shapetracker

* default ENABLE_METHOD_CACHE

* symbolic compiles

* improve types

* tensor compiles

* oops, that's a bug

* best of both worlds

* find legit typing bugs

* pad2d can take list or tuple

* sub 200ms when compiled
2023-03-11 07:33:30 -08:00
George Hotz
d7cb8e3e56 multithreaded fake_torch_load_zipped 2023-03-10 19:16:27 -08:00
George Hotz
b1206bcb18 third try at torch loading (#677)
* third try at torch loading

* numpy fixed

* fix enet compile

* load_single_weight supports empty weights

* oops, CPU wasn't the default

* so many bugs
2023-03-10 19:11:29 -08:00
George Hotz
4780f9a6df llama runs (slowly) in master 2023-03-10 17:36:51 -08:00
George Hotz
1826ff6b89 dtypes nice and clean (#673)
* add dtype class

* dtypes

* buffers are lazy

* dtype is tracked by lazybuffer and GenericShape

* fix types in llvm

* llvm store

* dtype tests

* fix tests maybe

* fix flop counter

* fix CI

* CI fix and check format

* fix dtype and dtype check

* fix custom test

* fix test graph
2023-03-10 16:56:07 -08:00
George Hotz
d26345595d more llama stuff 2023-03-10 10:48:10 -08:00
George Hotz
1a039306d2 good changes from llama branch (#671)
* good changes from llama

* transpose behavior changed
2023-03-09 20:51:22 -08:00
George Hotz
d8dda2af3a openpilot fixups 2023-03-06 14:14:44 -08:00
George Hotz
a77d792aff Codegen gpu cleanups (#640)
* cleanups

* fixups

* handle pre upcasted global buffers

* early is just required

* delete junk from hand coded opt

* implicit upcast_in_mid_reduce

* speedup

* fix exec w validhacks

* reorder opt

* only need to check the output for that

* return total runtime from kernels if debugging
2023-03-04 15:31:51 -08:00
Patrick Geneva
117111825c Fix windows file permission error (#634) 2023-03-04 09:23:55 -08:00
George Hotz
528cb3b3b9 fix ast test 2023-03-04 07:49:25 -08:00
George Hotz
893f136fe0 lines from helpers 2023-03-03 23:07:46 -08:00
George Hotz
c53efb3635 optimize for CL (#633)
* required opt

* simplify

* works

* shift_to_last

* required is fine

* print shape in colored

* better shape

* args was wrong

* debugs

* fix empty shape

* colored shape printer
2023-03-03 22:00:09 -08:00
Diogo
52204a7b88 adding comparison operators (#616)
* Less, LessOrEqual, Greater, GreaterOrEqual, Equal

* lint fix

* using built in functions

* overriding __eq__ breaks things

* backwards pass for less - foward only tests

* one other spot

* removing backwards for comparison ops to match pytorch

* raise runtime error

* more tests for comparison ops

* fixed the lineup

* added number upcast tests
2023-03-02 08:10:44 -08:00
George Hotz
d062cc82b8 put restrict back 2023-03-01 21:34:45 -08:00
George Hotz
bfcec234a2 Refactor ASTs (#622)
* ugh worst branch name

* compiler refactor continues

* scc -> cloc

* buf -> _buf

* finish _buf, and program -> runtime

* gpu is still working, clang isn't

* clang in new style

* ops_metal

* something broke it

* improve metal

* clean up tons of cl crap

* hack fix sync

* cleaner gpu

* gpu metal clang

* cleanups

* minor refactor

* GPUCodegen

* fix up LLVM

* blind CUDA refactor

* codegen / runtime

* keep ops naming

* linter passes

* woah, llvm was allocing 4x what it needed to

* bugfixes

* fix openpilot compiler

* fix compile_efficientnet

* method cache should fix tests

* deal with duped functions
2023-03-01 18:57:29 -08:00
George Hotz
7e6edfbc64 unbreak onnx conv padding 2023-02-28 13:55:03 -08:00
George Hotz
7d556ca7e0 avg/max pool work in N-D 2023-02-28 13:38:27 -08:00
George Hotz
d584bae5c0 fine, openpilot can have 197 kernels 2023-02-27 11:48:36 -08:00