Commit Graph

10633 Commits

Author SHA1 Message Date
George Hotz
463dece63e auto arg dtypes (#1623) 2023-08-22 10:22:40 -07:00
George Hotz
db8344ab83 add noalias to llvm (#1622) 2023-08-22 09:26:01 -07:00
chenyu
89e13f2f04 support symbols in shrink (#1611) 2023-08-22 09:08:21 -07:00
George Hotz
718ced296c move state to nn/state (#1619) 2023-08-22 07:36:24 -07:00
Umut Zengin
1e93fd5449 Readability for unreadable functions (#1610)
* cleaned

* typing

* typing

* if format

* if format

* mypy

* update argmax

* argmax more readable

* More stable def pad

* lint
2023-08-22 07:09:08 -07:00
George Hotz
86a32ffb1a lt sum (#1617) 2023-08-21 21:19:16 -07:00
George Hotz
c64c47a6ae test arange simple 2023-08-21 20:16:17 -07:00
George Hotz
4f459841bc Symbolic JIT for GPT2 (#1613)
* not fast yet

* simpler

* symbolic jit

* fp16 GOPS and GB
2023-08-21 19:44:57 -07:00
Yixiang Gao
4f02491cd4 add cpu if torch tensor (#1609) 2023-08-21 16:57:59 -07:00
Umut Zengin
f720682beb np.argmax to Tensor.argmax (#1608)
* to tensor argmax

* removed keepdim

* training update
2023-08-21 15:22:29 -07:00
George Hotz
4ea00bad38 track down llama bug 2023-08-21 15:14:21 -07:00
Roelof van Dijk
b02f77b354 perf: faster broadcasted (#1601)
Co-authored-by: Roelof van Dijk <roelof.van.dijk@vitestro.com>
2023-08-21 14:21:46 -07:00
Yixiang Gao
4d54afb6df sparse cat cross entropy (#1597)
* add sparse cat cross entropy

* minor fix

* add log_softmax into loss function

* add test

* update docs

* fix training loss

* add device
2023-08-21 14:14:54 -07:00
Roelof van Dijk
109100656f refactor: no len if it is not needed (#1598)
Co-authored-by: Roelof van Dijk <roelof.van.dijk@vitestro.com>
2023-08-21 14:06:32 -07:00
Roelof van Dijk
2c8f8ac611 perf: no ret needed (#1604)
Co-authored-by: Roelof van Dijk <roelof.van.dijk@vitestro.com>
2023-08-21 14:05:13 -07:00
Roelof van Dijk
750714c386 perf: namedtuples are hashable, don't need a key (#1607)
Co-authored-by: Roelof van Dijk <roelof.van.dijk@vitestro.com>
2023-08-21 14:01:10 -07:00
George Hotz
aaa6fdf347 this was unused code (#1600) 2023-08-21 12:02:58 -07:00
Roelof van Dijk
8e8724d3a8 perf: if argument order (mops) (#1599)
Co-authored-by: Roelof van Dijk <roelof.van.dijk@vitestro.com>
2023-08-21 11:20:39 -07:00
George Hotz
2e60920317 Revert "sparse cat cross entropy (#1591)" (#1596)
This reverts commit f0ee850e98.
2023-08-21 10:04:26 -07:00
Yixiang Gao
f0ee850e98 sparse cat cross entropy (#1591)
* add sparse cat cross entropy

* minor fix

* add log_softmax into loss function

* add test

* update docs
2023-08-21 09:56:41 -07:00
Yixiang Gao
8d6662a741 .cpu().numpy() -> .numpy() (#1594)
* .cpu().numpy() -> .numpy()

* restore ops_torch

* restore test_speed_v_torch
2023-08-21 09:53:29 -07:00
Umut Zengin
35bf21276f Argmax/Argmin Feature (#1576)
* implemented argmax and argmin

* lint

* lint

* match torch behaviour

* format

* removed flip
2023-08-20 18:46:46 -07:00
Roelof van Dijk
1900acda09 [READY] ci: setup venv cache (#1475)
* ci: cache installed packages

* ci: trigger jobs

* ci: fix hashfiles argument

---------

Co-authored-by: Roelof van Dijk <roelof.van.dijk@vitestro.com>
2023-08-20 18:43:16 -07:00
Umut Zengin
3fc7e984f0 __getitem__ refactoring (#1586)
* dene

* dene

* form

* form

* form

* form

* lint

* small change

* preserve old

* revert to explicit reshape
2023-08-20 18:42:30 -07:00
George Hotz
d627349af0 teeny changes (#1589)
* teeny changes

* import order
2023-08-20 13:38:38 -07:00
George Hotz
012ee7d162 not worth the speed (#1584)
* not worth the speed

* no slots

* uops comments

* bump to python 3.11 for speed

* add critical slots back
2023-08-20 10:24:58 -07:00
George Hotz
739f327d2d Shorter (#1582)
* deleting lines

* remove insert dims

* if statement is never hit

* bug fixes
2023-08-20 08:12:16 -07:00
David Hou
4fbce972d7 CSE at uop level (#1483)
* uop-level cse

* add test

* don't cache reduce alu ops

* types

* rename variable

* fix

* delete lines
2023-08-19 23:40:40 -07:00
George Hotz
b9feb1b743 fp16 support in stable diffusion 2023-08-20 05:37:21 +00:00
George Hotz
ad7d26c393 fix __launch_bounds__ and benchmark TC MATMUL (#1575)
* fix

* benchmark matmul
2023-08-19 10:54:39 -07:00
David Hou
92754e177c cache buffer loads across multiple bufs (#1482)
* cache loads across buffers (since they may share rawbufs)

* typing

* add test

* fix test

* small changes to test

* fix test

* one big cache

* whitespace

* golf a line?

* invalid is RawBuffer(0)[0], valid 1.
2023-08-19 09:09:58 -07:00
George Hotz
e464442adf WMMA for 7900XTX (#1563)
* go

* hip no LRU

* work

* works

* 16 TFLOPS

* 29 TFLOPS

* 30 TFLOPS

* never mind, it's 60 TFLOPS

* fix metal WMMA

* put hip alloc back
2023-08-19 09:07:23 -07:00
nimlgen
faa521bcab fix usage of arm64 regs according to CC (#1570) 2023-08-18 21:40:32 -07:00
corranr
68ebbd2954 for issue #1555, int64 and int8 in CI=1 ARM64=1 CLANG=1 (#1572)
* fixed for int8,int64, added dtype broadcasting test, passing all CI,ARM64,CLANG tests

* remove shifts
2023-08-18 21:40:13 -07:00
chenyu
ae39cf84ab Symbolic Shape JIT main PR (#1353)
* Symbolic Shape JIT

update tests

2 variables symbolic ops, adding more tests

test passing

cleanup

* more test cases

* single flag

* review update

* jit attention one piece

* realize

* symbolic_jit test for cuda

* old artifact

* works with cuda gpu but failed ci

* CUDACPU
2023-08-18 14:39:55 -07:00
Roelof van Dijk
84e6693915 fix: apt-get to apt, no recommends, clean up (#1571)
Co-authored-by: Roelof van Dijk <roelof.van.dijk@vitestro.com>
2023-08-18 13:48:59 -07:00
wozeparrot
50decf0d45 train cifar using multigpu (#1529)
* feat: train cifar using multigpu

* feat: split eval batch across 5

* feat: cleaner allreduce

* feat: 93.88%

* feat: cleaner batch chunking from bert

* feat: cleaner grad sync

* feat: tinygrad argmax

* feat: make it work with different gpu counts

* feat: move some stuff into the normal __init__

* feat: autodetect gpu count

* feat: move import inside
2023-08-18 09:35:44 -07:00
chenyu
be50b2fe8f more symbolic symbolic ops (#1564)
* more symbolic symbolic ops

* handle NumNode in __mul__
2023-08-18 09:21:41 -07:00
chenyu
dfec16cc83 Support arg int for CUDA kernel (#1565) 2023-08-18 09:19:40 -07:00
wozeparrot
15150d60c4 fix: small fix for lru on hip (#1567) 2023-08-18 09:18:38 -07:00
wozeparrot
c65ad43a93 cleanup ops_gpu (#1566) 2023-08-17 23:43:08 -04:00
nimlgen
bd111411bf init allocator for compiled backends (#1467)
* init allocator for compiled backends

* Update ops_webgpu.py

---------

Co-authored-by: George Hotz <72895+geohot@users.noreply.github.com>
2023-08-17 10:33:32 -07:00
geohotstan
a293c18d34 Gather bugfix (#1561) 2023-08-16 19:53:14 -04:00
Ethan Sorrell
cb62911f6b PTX Reintegration and Passing Tests (#1512)
* move assembly, assembly_ptx

* successful but broken rendering of ptx asm

* clear ins before render asm

* slightly less broken :')

* we needed thread syncs

* fix float16 loading, rounding modifiers and other casting stuff, passing casts_from_half

* Fix runtime_args for gpuocelot

* our casts were flipped on both ends

* more casting

* add ternary where op

* dealing with storing/loading bool

* add test for casting to bool from negative

* Fix args.valid on ConstOp

* add to CI, TODO: fix runtime_args for test_uops

* fix placement of runtime_args to work with lazy.Device

* undo ci changes so I can push

* fix lints

* start cleanup and fix things we broke fixing lints

* add checks for PTX specifc asm instructions

* revert added test -- doesn't pass on llvm

* skip tests for underflow,overflow

* another fix for how we're setting runtime args

* Less broken cleanup

* add to CI

* add more env variables for ci test

* fix ci to install pycuda for ptx

* ci: copy cuda test command

* cleanup

* assert to make sure we're actually running ptx in ci

* remove test assert

* move is_ptx arg

* move assembly, assembly_ptx back to extras

* fix imports

* initial merge fixes

* clear registers, fix UOps.LOAD with invalid value

* draft merge fixes

* remove prints

* quick lint and merge fixes

* cleanup

* remove PTXProgram wrapper

* final cleanup

* temp change for ci rerun

* ci rerun

* rollback ISA version
2023-08-16 16:20:20 -07:00
geohotstan
8763037f0e Fancy indexing is fancy wow and gather thing (#1399) 2023-08-16 18:35:49 -04:00
chenyu
11dd9b1741 symbolic codegen and exec (#1552)
* symbolic codegen and exec

* fix and add test

* no sketchy

* merge_dicts type

* dtypes._arg_int32
2023-08-16 14:43:41 -07:00
George Hotz
1e1d48b4e6 single model (#1560) 2023-08-16 13:22:19 -07:00
JaSpa99
491e85597a Run onnx commavq model (#1537)
* try to run commavq

* fix 0 dim, start implementing new ops

- Implement EmbedLayerNormalization
- Implement Attention

* SkipLayerNormalization and FastGelu

* use original torch model, cast inputs

* fix some ops:

- properly do Cast
- Attention: bi- and unidirectional
- FastGelu: add bias before gelu

* cleanup onnx_ops.py

* add validation option to benchmark

* cleanup imports

* add checks incase onnx2torch implements ops in future

* run onnx instead of original torch

* just skip gpu on m1

* reactivate the other models

* check for strange params & squash whitespace

* cleanup

* fix causal mask Attention

* Range doesn't need int cast

* embedding vocab_counter same dtype as input

* no need to cast

* always validate, fix PosixPath ort

---------

Co-authored-by: George Hotz <george@comma.ai>
2023-08-16 12:24:40 -07:00
wozeparrot
55d95d1658 llama 70b (#1558)
* feat: llama 70b

* feat: llama 70b but simpler
2023-08-16 11:36:12 -07:00
nimlgen
c93e63b8b5 make TestNonFloatUOps.test_mul_bool pass on all platforms (#1557) 2023-08-16 11:34:09 -07:00