Commit Graph

177 Commits

Author SHA1 Message Date
George Hotz
47f9887ce4 hip events work (#3229)
* hip events work

* event
2024-01-24 11:49:53 -08:00
George Hotz
e2e4632aea LoadOps SYNC (#3223)
* LoadOps SYNC and WAIT

* no wait, only sync

* DEBUG >= 1

* track cross device
2024-01-23 21:59:18 -08:00
chenyu
c4b5661146 fuzz length for multitensor reduce test case (#3190)
so that the uneven case is not just with 0 length and can have other positve values
2024-01-20 00:44:38 -05:00
chenyu
fdb1c2b1d9 move reduce over 0 len axis logic to lazy.py (#3188)
* move reduce over 0 len axis logic to lazy.py

this fixed uneven shard reduce case if the uneven one has length 0

* fix interpreted backends

* fix backwards for 0 shape tensors too
2024-01-20 00:13:03 -05:00
George Hotz
254a7372fe buffer copy refactor (#3187) 2024-01-19 20:21:24 -08:00
chenyu
cb4cfc078a parameterize multitensor tests for reduce (#3181)
uneven shards reduce is incorrect now
2024-01-19 14:03:01 -05:00
chenyu
c4faedebf3 add test cases for negative entry max allreduce (#3177) 2024-01-18 22:26:51 -05:00
chenyu
ab1b7c4d09 fix allreduce for max (#3175)
* test cases to show allreduce for max is incorrect

* oh fixed
2024-01-18 20:25:35 -05:00
chenyu
28dcbf0e00 test case sharded batchnorm has different ast on devices (#3172) 2024-01-18 18:12:15 -05:00
George Hotz
ee83505fcc fix test extra issue (#3159) 2024-01-17 11:58:08 -08:00
George Hotz
a72b1b6d65 sharding for llama (#3151)
* shard llama

* sharding works

* simpler

* simpler

* consume option

* disable that test

* save a line

---------

Co-authored-by: George Hotz <george@tinygrad.org>
2024-01-16 19:28:00 -08:00
chenyu
14c010958b support for non-uniform sharding (#3154)
* support for non-uniform sharding

* bugfix and more tests

---------

Co-authored-by: George Hotz <geohot@gmail.com>
2024-01-16 20:33:32 -05:00
George Hotz
228f30b96a multitensor jit (#3149)
* initial multitensor jit support and tests

* Added graphs to multitensor jit and updated tests

* update unbind api

* fix set device, add TinyJit to resnet

* update_stats includes device

---------

Co-authored-by: ramenguy99 <ramenguy99@gmail.com>
2024-01-16 09:09:15 -08:00
George Hotz
cec0a7bc37 use shard api to eval resnet fast (#3136)
* use shard api to eval resnet fast

* to supports shard

* test to in multitensor
2024-01-15 16:49:38 -08:00
Yixiang Gao
c13d51da1d add device options for tests in multigpu (#3121) 2024-01-14 15:17:47 -08:00
Yixiang Gao
13e872b53f add mutigpu support for llama attention (#3064)
* add llama attention test for multigpu

* test fails

* kv cache trying to shrink on sharded axis

* mask None works for scale dot product

* kv cache seems to be working but scale dot product breaks

* scaled dot product works, but the last linear layer failed

* running into the reshape case where it could be wrong for multigpu

* making sure it was the reshape

* adding contiguous doesn't solve

* need to shard more properly

* remove reshape test

* minor adjustment to scale dot product attention test

* weights are sharded wrong

* continue fix new weight sharding

* clean up

* fix attention when start_pos is 0

* remove print

* add TODOs for the best mutigpu interface
2024-01-11 16:31:02 -08:00
Yixiang Gao
adcc844755 cat works (#3086) 2024-01-11 08:25:20 -08:00
Yixiang Gao
6842476ca6 better test demonstration (#3077)
* a better test demonstration

* fix white space
2024-01-10 10:50:52 -08:00
George Hotz
ac3f246c11 cached size (#3060)
* cached size

* simplify simplify

* 0 doesn't have base

* fix test

* cleaner cache

* hmm, metal is flaky on this...might be real(ish) but useless as test

* short circuit reshape/expand properly

* better reshape bypass
2024-01-09 16:37:37 -08:00
Yixiang Gao
73b72b8de2 test scaled dot product attention (#3063)
* add test

* add initial test for scaled dot product attention

* test pass for scaled dot product attention
2024-01-09 14:30:57 -08:00
Yixiang Gao
259bf9bffc add multigpu test for RMSNorm (#3056)
* need all gather

* add two multigpu test scenarios for RMSNorm
2024-01-09 09:52:51 -08:00
Yixiang Gao
a686663657 make Embedding device aware for multigpu (#3051)
* make Embedding device aware for multigpu

* split line instead of igore because that's cheating

* add test incomplete

* add test complete

* remove comment

* fix white space

* remove nn.Embedding
2024-01-08 20:09:26 -08:00
Yixiang Gao
8a63f26a0f make LR scheduler work with multigpu (#3011)
* add a failing test for LR scheduler when using multigpu

* fix calculation order and unnecessary tensor created for float

* min_lr is no longer tensor
2024-01-04 12:10:56 -08:00
chenyu
81b97cd2c6 canonicalize device in LazyBuffer constructor (#2991)
fixed the multitensor +1 then sum bug
2024-01-03 12:55:25 -05:00
chenyu
db525cf8c2 multitensor failed test case with +1 then sum on DEVICE:0 (#2990) 2024-01-03 12:17:11 -05:00
George Hotz
5dbaaa7061 hotfix: make multitensor shard contiguous 2024-01-03 08:48:30 -08:00
George Hotz
f494b9d463 simple multitensor API (#2903)
* simple multitensor API

* test multitensor

* mt work

* new api

* copies

* all but data parallel

* allreduce there

* works, but axis sharded

* fix all mt tests

* features/multi

* work

* backprop

* fix tests

* tests passing

* mt progress

* cleanups

* less lines

* tensor cleanup

* save more lines

* mypy passes

* fix tests

* skip for cuda too

* bump download cache
2024-01-02 17:49:44 -08:00