George Hotz
643cbdfd50
make embedding and GPT-2 fast ( #1631 )
...
* make embedding fast
* jit more, variable shape support
* print mem bw
2023-08-22 15:14:38 -07:00
Niklas D
a7752ad65d
Fix link to state.py in quickstart ( #1632 )
2023-08-22 17:39:30 -04:00
c143
c9c40bb16f
Import whole math module in tensor.py ( #1628 )
2023-08-22 17:07:46 -04:00
Roelof van Dijk
6fcfa50b35
[ready] perf: no noop cast just to make mypy happy ( #1626 )
...
Co-authored-by: Roelof van Dijk <roelof.van.dijk@vitestro.com >
2023-08-22 17:07:22 -04:00
Roelof van Dijk
f04a6d7882
perf: faster partition ( #1625 )
...
Co-authored-by: Roelof van Dijk <roelof.van.dijk@vitestro.com >
2023-08-22 11:56:41 -07:00
George Hotz
d3c401ba3c
llama quantize: scale uses mul, not div
2023-08-22 11:48:56 -07:00
George Hotz
696e4d20a1
fix KOPT=2 with variable shape
2023-08-22 11:34:34 -07:00
George Hotz
de1fcc418f
no more toCPU path ( #1624 )
2023-08-22 11:07:26 -07:00
George Hotz
463dece63e
auto arg dtypes ( #1623 )
2023-08-22 10:22:40 -07:00
George Hotz
db8344ab83
add noalias to llvm ( #1622 )
2023-08-22 09:26:01 -07:00
chenyu
89e13f2f04
support symbols in shrink ( #1611 )
2023-08-22 09:08:21 -07:00
George Hotz
718ced296c
move state to nn/state ( #1619 )
2023-08-22 07:36:24 -07:00
Umut Zengin
1e93fd5449
Readability for unreadable functions ( #1610 )
...
* cleaned
* typing
* typing
* if format
* if format
* mypy
* update argmax
* argmax more readable
* More stable def pad
* lint
2023-08-22 07:09:08 -07:00
George Hotz
86a32ffb1a
lt sum ( #1617 )
2023-08-21 21:19:16 -07:00
George Hotz
c64c47a6ae
test arange simple
2023-08-21 20:16:17 -07:00
George Hotz
4f459841bc
Symbolic JIT for GPT2 ( #1613 )
...
* not fast yet
* simpler
* symbolic jit
* fp16 GOPS and GB
2023-08-21 19:44:57 -07:00
Yixiang Gao
4f02491cd4
add cpu if torch tensor ( #1609 )
2023-08-21 16:57:59 -07:00
Umut Zengin
f720682beb
np.argmax to Tensor.argmax ( #1608 )
...
* to tensor argmax
* removed keepdim
* training update
2023-08-21 15:22:29 -07:00
George Hotz
4ea00bad38
track down llama bug
2023-08-21 15:14:21 -07:00
Roelof van Dijk
b02f77b354
perf: faster broadcasted ( #1601 )
...
Co-authored-by: Roelof van Dijk <roelof.van.dijk@vitestro.com >
2023-08-21 14:21:46 -07:00
Yixiang Gao
4d54afb6df
sparse cat cross entropy ( #1597 )
...
* add sparse cat cross entropy
* minor fix
* add log_softmax into loss function
* add test
* update docs
* fix training loss
* add device
2023-08-21 14:14:54 -07:00
Roelof van Dijk
109100656f
refactor: no len if it is not needed ( #1598 )
...
Co-authored-by: Roelof van Dijk <roelof.van.dijk@vitestro.com >
2023-08-21 14:06:32 -07:00
Roelof van Dijk
2c8f8ac611
perf: no ret needed ( #1604 )
...
Co-authored-by: Roelof van Dijk <roelof.van.dijk@vitestro.com >
2023-08-21 14:05:13 -07:00
Roelof van Dijk
750714c386
perf: namedtuples are hashable, don't need a key ( #1607 )
...
Co-authored-by: Roelof van Dijk <roelof.van.dijk@vitestro.com >
2023-08-21 14:01:10 -07:00
George Hotz
aaa6fdf347
this was unused code ( #1600 )
2023-08-21 12:02:58 -07:00
Roelof van Dijk
8e8724d3a8
perf: if argument order (mops) ( #1599 )
...
Co-authored-by: Roelof van Dijk <roelof.van.dijk@vitestro.com >
2023-08-21 11:20:39 -07:00
George Hotz
2e60920317
Revert "sparse cat cross entropy ( #1591 )" ( #1596 )
...
This reverts commit f0ee850e98 .
2023-08-21 10:04:26 -07:00
Yixiang Gao
f0ee850e98
sparse cat cross entropy ( #1591 )
...
* add sparse cat cross entropy
* minor fix
* add log_softmax into loss function
* add test
* update docs
2023-08-21 09:56:41 -07:00
Yixiang Gao
8d6662a741
.cpu().numpy() -> .numpy() ( #1594 )
...
* .cpu().numpy() -> .numpy()
* restore ops_torch
* restore test_speed_v_torch
2023-08-21 09:53:29 -07:00
Umut Zengin
35bf21276f
Argmax/Argmin Feature ( #1576 )
...
* implemented argmax and argmin
* lint
* lint
* match torch behaviour
* format
* removed flip
2023-08-20 18:46:46 -07:00
Roelof van Dijk
1900acda09
[READY] ci: setup venv cache ( #1475 )
...
* ci: cache installed packages
* ci: trigger jobs
* ci: fix hashfiles argument
---------
Co-authored-by: Roelof van Dijk <roelof.van.dijk@vitestro.com >
2023-08-20 18:43:16 -07:00
Umut Zengin
3fc7e984f0
__getitem__ refactoring ( #1586 )
...
* dene
* dene
* form
* form
* form
* form
* lint
* small change
* preserve old
* revert to explicit reshape
2023-08-20 18:42:30 -07:00
George Hotz
d627349af0
teeny changes ( #1589 )
...
* teeny changes
* import order
2023-08-20 13:38:38 -07:00
George Hotz
012ee7d162
not worth the speed ( #1584 )
...
* not worth the speed
* no slots
* uops comments
* bump to python 3.11 for speed
* add critical slots back
2023-08-20 10:24:58 -07:00
George Hotz
739f327d2d
Shorter ( #1582 )
...
* deleting lines
* remove insert dims
* if statement is never hit
* bug fixes
2023-08-20 08:12:16 -07:00
David Hou
4fbce972d7
CSE at uop level ( #1483 )
...
* uop-level cse
* add test
* don't cache reduce alu ops
* types
* rename variable
* fix
* delete lines
2023-08-19 23:40:40 -07:00
George Hotz
b9feb1b743
fp16 support in stable diffusion
2023-08-20 05:37:21 +00:00
George Hotz
ad7d26c393
fix __launch_bounds__ and benchmark TC MATMUL ( #1575 )
...
* fix
* benchmark matmul
2023-08-19 10:54:39 -07:00
David Hou
92754e177c
cache buffer loads across multiple bufs ( #1482 )
...
* cache loads across buffers (since they may share rawbufs)
* typing
* add test
* fix test
* small changes to test
* fix test
* one big cache
* whitespace
* golf a line?
* invalid is RawBuffer(0)[0], valid 1.
2023-08-19 09:09:58 -07:00
George Hotz
e464442adf
WMMA for 7900XTX ( #1563 )
...
* go
* hip no LRU
* work
* works
* 16 TFLOPS
* 29 TFLOPS
* 30 TFLOPS
* never mind, it's 60 TFLOPS
* fix metal WMMA
* put hip alloc back
2023-08-19 09:07:23 -07:00
nimlgen
faa521bcab
fix usage of arm64 regs according to CC ( #1570 )
2023-08-18 21:40:32 -07:00
corranr
68ebbd2954
for issue #1555 , int64 and int8 in CI=1 ARM64=1 CLANG=1 ( #1572 )
...
* fixed for int8,int64, added dtype broadcasting test, passing all CI,ARM64,CLANG tests
* remove shifts
2023-08-18 21:40:13 -07:00
chenyu
ae39cf84ab
Symbolic Shape JIT main PR ( #1353 )
...
* Symbolic Shape JIT
update tests
2 variables symbolic ops, adding more tests
test passing
cleanup
* more test cases
* single flag
* review update
* jit attention one piece
* realize
* symbolic_jit test for cuda
* old artifact
* works with cuda gpu but failed ci
* CUDACPU
2023-08-18 14:39:55 -07:00
Roelof van Dijk
84e6693915
fix: apt-get to apt, no recommends, clean up ( #1571 )
...
Co-authored-by: Roelof van Dijk <roelof.van.dijk@vitestro.com >
2023-08-18 13:48:59 -07:00
wozeparrot
50decf0d45
train cifar using multigpu ( #1529 )
...
* feat: train cifar using multigpu
* feat: split eval batch across 5
* feat: cleaner allreduce
* feat: 93.88%
* feat: cleaner batch chunking from bert
* feat: cleaner grad sync
* feat: tinygrad argmax
* feat: make it work with different gpu counts
* feat: move some stuff into the normal __init__
* feat: autodetect gpu count
* feat: move import inside
2023-08-18 09:35:44 -07:00
chenyu
be50b2fe8f
more symbolic symbolic ops ( #1564 )
...
* more symbolic symbolic ops
* handle NumNode in __mul__
2023-08-18 09:21:41 -07:00
chenyu
dfec16cc83
Support arg int for CUDA kernel ( #1565 )
2023-08-18 09:19:40 -07:00
wozeparrot
15150d60c4
fix: small fix for lru on hip ( #1567 )
2023-08-18 09:18:38 -07:00
wozeparrot
c65ad43a93
cleanup ops_gpu ( #1566 )
2023-08-17 23:43:08 -04:00
nimlgen
bd111411bf
init allocator for compiled backends ( #1467 )
...
* init allocator for compiled backends
* Update ops_webgpu.py
---------
Co-authored-by: George Hotz <72895+geohot@users.noreply.github.com >
2023-08-17 10:33:32 -07:00