wozeparrot
d269bc95fa
faster tinychat ( #5993 )
2024-08-08 19:16:26 -07:00
George Hotz
bf8ec23b00
hotfix: contiguous on precompute_freqs_cis
2024-08-07 14:40:56 -07:00
David Hou
9a485f36e4
shard kvcache ( #5830 )
2024-07-30 20:29:54 -07:00
George Hotz
4e89d45513
hotfix: put contiguous back in llama
2024-07-30 18:43:48 -07:00
George Hotz
21c5e8e1b7
extreme llama speed, 57.34 tok/s ( #5827 )
...
* extreme llama speed
* mergable
2024-07-30 18:32:09 -07:00
wozeparrot
fa873df9c1
bring tinychat more inline with tinyos' version ( #5358 )
2024-07-10 13:13:52 -07:00
chenyu
b2c3a28a5e
nn.RMSNorm ( #5272 )
...
the norm itself has no significant value to add to Tensor method, but we would want Tensor.normalize
2024-07-02 21:39:01 -04:00
George Hotz
14980f79dd
hotfix: unbreak llama
2024-06-30 15:27:54 -07:00
George Hotz
3df47bc21e
OpenELM + repeat_interleave ( #5234 )
...
* start writing openelm
* progress...hit bug
* repeat_interleave support
* gqa
* add rotary embedding
* spp
* i think it runs correctly
* broken
* output is good now
* cleanups
* no io_uring on android
2024-06-30 15:18:39 -07:00
chenyu
e468601226
update llama attention casting ( #5096 )
...
* update llama attention casting
updated scaled_dot_product_attention middle cast and removed hard-coded half in llama attention.
* fix that
2024-06-22 10:57:17 -04:00
chenyu
8bd6cb9511
update llama model RMSNorm casting ( #5095 )
...
following the original implementation, cast back to input dtype before multiplying weight. slightly faster
https://github.com/meta-llama/llama/blob/main/llama/model.py
2024-06-21 23:02:04 -04:00
chenyu
31358cbea5
change Tensor.stack to method ( #4719 )
2024-05-24 17:04:19 -04:00
chenyu
ae861325ce
update llama sample for mac 32 input buffer limit ( #4662 )
...
set default sampling params to function call to 0, and top k in llama3 to 25.
2024-05-20 17:23:39 -04:00
wozeparrot
b144d4b460
new llama3 example ( #4576 )
2024-05-19 22:42:23 -07:00
chenyu
a65c8de735
move .half() llama freq_cis to the end of sin and cos ( #4587 )
...
otherwise arange has inf if either dim or context length exceeds half.max
2024-05-14 15:00:18 -04:00
George Hotz
e79a11b99c
hotfix: revert llama change
2024-04-10 20:13:15 -07:00
George Hotz
2e6c39b0b2
Do less realizes ( #4141 )
...
* less realize
* corealize jit inputs
* prints
* print before we run
2024-04-10 19:50:50 -07:00
chenyu
f8dc82a8a7
use single tensor for llama kv chache ( #4108 )
...
similar to optimization in gpt2
2024-04-08 00:38:32 -04:00
chenyu
92c0675ccf
setitem initial support ( #4093 )
...
* wip setitem
it's an eager assign to output shapetracker view
* cleanups and tests
* more cleanups
2024-04-07 20:35:22 -04:00
George Hotz
4c4d3cb3e3
restrict assignment to base ( #3809 )
...
* restrict assignment to base
* add some restrictions there
* more restrictions
2024-03-18 15:33:06 -07:00
chenyu
5ac1fa933f
apply the same fix_bf16 in llama and coder ( #3789 )
...
* apply the same fix_bf16 in llama and coder
did not realize the same logic was in llama too.
really fix #2775
* flag for native SUPPORT_BF16 cast
2024-03-17 21:25:24 -04:00
George Hotz
641f347232
simple LoadOps.ASSIGN ( #3745 )
...
* simple LoadOps.ASSIGN
* skip that test
* don't assign in onnx ops gemm
* track cache usage
* recreate the lazybuffer to avoid the cache
* fix contigs
* skip that test
* lol
* better letters
2024-03-14 20:44:34 -07:00
George Hotz
a72b1b6d65
sharding for llama ( #3151 )
...
* shard llama
* sharding works
* simpler
* simpler
* consume option
* disable that test
* save a line
---------
Co-authored-by: George Hotz <george@tinygrad.org >
2024-01-16 19:28:00 -08:00
Yixiang Gao
13e872b53f
add mutigpu support for llama attention ( #3064 )
...
* add llama attention test for multigpu
* test fails
* kv cache trying to shrink on sharded axis
* mask None works for scale dot product
* kv cache seems to be working but scale dot product breaks
* scaled dot product works, but the last linear layer failed
* running into the reshape case where it could be wrong for multigpu
* making sure it was the reshape
* adding contiguous doesn't solve
* need to shard more properly
* remove reshape test
* minor adjustment to scale dot product attention test
* weights are sharded wrong
* continue fix new weight sharding
* clean up
* fix attention when start_pos is 0
* remove print
* add TODOs for the best mutigpu interface
2024-01-11 16:31:02 -08:00
chenyu
c9371f0d31
hotfix llama conversation mode ( #3031 )
...
without contiguous on keys and values, it runs but the update is incorrect
2024-01-06 16:57:07 -05:00
chenyu
f88506e630
move gpt2/llama sampling inside the model call ( #3013 )
...
* move gpt2/llama sampling inside the model call
* argmax uses one more kernel
2024-01-04 17:01:50 -05:00
chenyu
ad4472e6e8
cleanup llama apply_rotary_emb and other helpers ( #2950 )
...
* cleanup llama apply_rotary_emb and other helpers
used ellipsis and other higher level tensor function.
disabled the half @ half -> half tensor core as it fails uop dtype checks
* keep hip 8x8->8 wmma
2023-12-29 11:39:15 -05:00
chenyu
61e255d197
use max for gpt2 and llama ( #2949 )
...
not using argmax yet because there's a multinomial outside of function.
2023-12-28 23:26:00 -05:00
chenyu
1fb815e77e
hotfix fix coder. RMSNorm cannot have float16 input ( #2932 )
...
* hotfix fix coder. RMSNorm cannot have float16 input
* update real world test due to new kernels
* more type casts
2023-12-25 02:28:11 -05:00
chenyu
b55b55d56e
use at least int32 and uint32 for sum output ( #2926 )
...
* use at least int32 and uint32 for sum output
* use the correct type for acc
* fix opencl
* llvm mulacc
2023-12-24 01:14:54 -05:00
George Hotz
64dded27f0
pad ops broke coder ( #2881 )
...
* pad ops broke coder
* that contiguous fixes it
* Update lazy.py
2023-12-20 17:03:41 -08:00
George Hotz
1765849937
new lazy, benchmark ( #2878 )
...
* lazy rewrite, try 2
* min fix tests
* pass contig test
* put broken pads back
* move that to realize
* no contig child fixes array packing
* so wrong
* now that's correct
* base children
* fix bind issues
* disable to_image_idx
* fix tests
* that failure shouldn't break other tests
* more fixes
* fix torch
* skip failing tests in CI
* 1e-7
* half is broken
* 1e-6 margin of error
2023-12-20 14:33:21 -08:00
chenyu
c0f76ed4ea
transformer kvcache and mask have same dtype as input ( #2771 )
...
* transformer kvcache and mask have same dtype as input
* don't use `=0` in cstyle ternary where
* (bool)
* where float16 test
2023-12-14 22:41:51 -05:00
George Hotz
b3982187d1
Mixtral Example ( #2691 )
...
* mixtral
* simpler
* global counters
* simpler
* weights arg
2023-12-10 17:18:31 -08:00
chenyu
539b00a645
move llama getenv("JIT") from models to examples ( #2671 )
...
Transformer class has a jit param so we should use that in the caller
2023-12-07 12:43:22 -05:00
chenyu
6ba6349c97
JIT=0 llama.py should not jit ( #2609 )
2023-12-04 20:21:07 -05:00
Davi Silva
ddeec24fa8
Cleanup & fix llama.py ( #2524 )
...
* docs, cleanup crap
* comma AI
* fix 70B
* this is why lexical scope exists
2023-11-30 16:00:17 -05:00
George Hotz
7170a9a057
coder.py can write and run code ( #2439 )
...
* wip mistral
* coder
* touchups
* cleanups
* mistral cleanups
* clean up cache create
* download the weights, fix tests
* fix llama loading
* global fixup
* clean up all
* move llama model
* cleanups
* Revert "cleanups"
This reverts commit a71c5d59eb .
* fine, leave it
2023-11-25 12:27:54 -08:00