Commit Graph

66 Commits

Author SHA1 Message Date
George Hotz
0c978d45e6 stub attention (#13196)
* stub attention

* name the kernels
2025-11-10 13:48:38 -08:00
nimlgen
9182948951 remove llvm_bf16_cast (#12075) 2025-09-08 20:51:15 +03:00
chenyu
ef17af85c6 remove .float call in llama logit (#11598)
* remove .float call in llama logit

* bfloat item
2025-08-10 00:02:18 -04:00
chenyu
3e64467322 remove freqs_cis contiguous in llama (#11597) 2025-08-09 21:11:12 -04:00
George Hotz
8ff03806e8 add llama layers (#11460)
* add llama layers

* add contig bw for speed
2025-07-31 16:28:04 -07:00
George Hotz
474ee9daa5 hotfix: add contiguous_backward to llama 2025-07-31 15:07:12 -07:00
wozeparrot
825b6a2505 feat: llama3 dataloader (#11340) 2025-07-30 13:27:55 -07:00
George Hotz
e15754db28 remove (some) kernelize from llama and test schedule speed (#10939)
* remove kernelize from llama

* 405B

* space
2025-06-23 15:07:31 -07:00
chenyu
4a6d84c4c3 hotfix llama start_pos vmax is max_context-1 (#10659)
* hotfix llama start_pos vmax is max_context-1

fixed `IGNORE_OOB=0 python3 examples/llama3.py --size 1B --benchmark --temperature 0`

* hotfix: multitensor transformer test tests kv cache

---------

Co-authored-by: George Hotz <geohot@gmail.com>
2025-06-06 00:41:25 -04:00
George Hotz
8864ff894b hotfix: that repeat_kv belongs outside the if 2025-05-11 18:43:01 -07:00
George Hotz
98c84a711d min rectified flow example [pr] (#10252)
* work on minrf example

* more

* jit sample

* t is tensor not const

* fixes

* more convs

* fix dropout

* don't print

* 504

* big patch

* onehot

* touch

* use embeddings

* dumb uses final layer

* act

* non fl

* match

* tp

* 3

* of

* ppsz

* normal

* add adln

* no t

* weird transformer

* weird transformer

* contig

* actual speed fix

* dumb

* cb

* 0

* t is 0

* mort-t

* args

* dumb days are over

* readable

* contig

* no more t mask

* mask_t

* init to zero

* clean

* steps

* work

* tt

* t

* solid
2025-05-11 18:36:44 -07:00
chenyu
17d4d258ea simple symbolic slice in llama [pr] (#10112)
support slice that has step None and stop > start
2025-04-30 14:36:35 -04:00
chenyu
573bbb9746 Revert "remove TransformerBlock contiguous in llama (#10104)" (#10108)
This reverts commit b8d07dcc54.
2025-04-29 15:28:38 -04:00
chenyu
b8d07dcc54 remove TransformerBlock contiguous in llama (#10104) 2025-04-29 14:15:39 -04:00
qazal
3b67f56c02 kernelize some llama realizes (#10098) 2025-04-29 18:39:56 +08:00
chenyu
3eba3d6ee9 don't pass model in convert_from_huggingface and convert_from_gguf (#10094)
it only needs n_layers
2025-04-28 20:11:19 -04:00
George Hotz
25847080f0 olmoe (from stream, wip) (#9390)
* olmoest working (but not)

* it's correct

* compare ropes

* old code wasn't wrong

* default device

* no metal

* fix permute

* working

* more minimal
2025-03-10 13:46:33 +08:00
hooved
3b9950241e tinychat in browser, Part 1: llama (#9273)
* load llama3-1B to WEBGPU device

* include compile script for loading llama3 to WEBGPU

* parametrize max_context in build_transformer fxn

* jit_model with two different args sets

* compile for webgpu, split weights

* load model weight parts in browser

* export all tensors from initialized transformer

* run transformer inference in browser

* enable tiktoken with llama bpe in browser

* count total tokens on client with tiktoken.js

* full client-side chat streaming, eliminate server

* revert change that enabled jitting with 2 argsets

* llama without Variable or cache_kv, for webgpu

* have client use mask tokens / whole context

* cleanup staged weights

* add tiktoken.js build script, README

* export CLANG for Q6_k to float32 decompression

* fix and test exported CLANG code for Q6_k to fp32

* revert changes to jit and export_model

* isolate clang export

* test Q6_K to float32 decompression in browser

* gguf_load now also returns t_infos and data_start

* prepare llama-1B Q6_K gguf chunks for browser

* cache and decompress quantized llama in browser

* enable separate deployment of large files

* fix kv cache and symbolic with llama wgpu

* eliminate browser lag during decompression

* hash metadata and weight chunks

* delete obsolete indexeddb cache to free disk

* add progress bar, track model download/decompress

* refactor progress callback

* skip buffer hash verification for speed

* Display progress for entire loading scope

* Report page load errors to user

* actually display errors

* skip prompt tokens already seen by model

* skip prefilling with last assistant message tokens

* on page load tell user if webgpu not enabled

* push deployed URL root to window.history

* make note of bug sources with TODO items

* isolate bug in CLANG with BEAM=2

* remove clang_bug.py from diff

* decompress q6k to f32 on webgpu instead of clang

* remove unused code

* inter-weight decomp with larger wgpu kernels

* parallelize decompression submissions

* refactor dequantize scheduling

* add progress bar back

* fix bug

* temp fix for loading GGUF Q6_K to fp16 not fp32

* fix rendering of exported CLANG

* remove weight casts, sketch js functions for clang

* get symbolic vars from jit_cache for model export

* include symbolic vars in exported CLANG

* render js for clang transformer

* toggle clang/webgpu deployment; refactor decomp

* compile and render clang Q6_K->fp16 and int8 quant

* fix rendered clang for abs(fp16), to work in wasm

* simplify clang js wrapping

* run compiled clang in worker

* prepare llama weights in workers, q6k to int8/fp16

* tinychat on clang in browser, f32/int8 weights

* move wasm inference to (now flexible) worker

* don't load redundant embeddings

* modest wasm perf gain with compile flags

* set default backend, enable backend choice/backup

* render symbolic vars in exported WEBGPU

* quantize webgpu llama to int8/f32

* improve UX arising from rendered WEBGPU

* clean up webgpu launch

* new weights split: smaller chunks, tinygrad quant.

* switch webgpu inference to int8 quant

* remove unneeded clang decompression

* eliminate unneeded kv cache transfer to wasm

* use 1 worker for simplified clang decompression

* display launch errors

* refactor: stream load weight chunks to WebGPU

* show loading chunk completion

* quantize embeddings to int8

* test float16 as input for quantization

* webgpu: use f16 source, int8 embed, eliminate q6k

* simplify split weights prep: all from state_dict

* revert change to nn.state.gguf_load

* remove unneeded decompression from webgpu client

* remove unneeded code

* decrease dl chunks from 47 to 16 MiB

* improve stability of webgpu loading on mobile

* autodetect mobile, improve load stability

* refactor: progress closure

* refactor: one unified progress bar

* remove unneeded code

* revert changes to tinygrad core library

* enforce ios18.3 nerfed max buf size

* BEAM=3 webgpu

* cache integrity, mobile save throttling

* improve mobile UX - no autozoom on prompt box

* clang: int8 from f16, remove q6k

* reduce concurrent dls on mobile to 2 for stability

* refactor: wasm backend with stream loading

* prevent race between wasm load and indexedb save

* split wasm kernels into separate modules

* js wrapper for multiple wasm module inference

* revert multi-module wasm to single module

* make mobile wasm load more stable/fast

* refactor: copy weights into wasm without crashes

* fix bug in download queue; increase mobile dls

* refactor exported clang wrapper, split weights

* remove unnecessary code

* greatly improve int8 quant quality with rounding

* eliminate mobile throttling

* increase webgpu context to 4096 tokens

* export webgpu js functions

* enable separate hosted weights for mobile/pc

* enable prompt-thread switching during generation

* stop generation when max_context is reached

* show progress bar for prefill

* tell user if webgpu fails, while wasm loads

* make loading messages more concise

* update font

* revert changes to tinychat python app launch

* cleanup quantization, add scale_dtype param

* cleanup kv cache code

* cleanup compile code

* link tok_embeddings with output in webgpu export

* refactor: export_model webgpu: symbolic vars

* refactor: export_model weight loading

* forgot to commit export_model.py

* change CLANG to CPU

* deal with pylint incorrectly failing tests

* simplify f-strings for older CI python version

* fix pre-python3.12 parser errors

* [Int32Array] not Int32Array

* cleanup webgpu compile after refactor export_model

* refactor WASM export into export_model

* merge WebGPU/WASM compile scripts

* simplify max_contexts for local deployment

* fix parser issues and whitespace

* deduplicate variable defs for non-wasm clang export

* cleanup code

* cleanup compile scripts

* simplify wasm inference wrapping

* simplify webgpu symbolic vars export

* refactor: unify export of symbolic variables

* simplify WASM export

* simplify clang/wasm export

* update README and build scripts

* separate files for browser/python apps

* restore original python tinychat app files

* browser and python tinychats share assets

* minor cleanup

* isolate diffs to llama files

* minor cleanup

* set default scale_dtype

* set default scale_dtype for NF4 quantization

* make quantization of tok_embeds optional

* match output with tok_embeds if not quantizing

* minor change
2025-02-27 15:57:37 -05:00
divinity76
bec4f59ce8 workaround f16 cast ambiguity (#8935)
for unknown reasons, without this, when trying to execute "Llama 3.2 1B", I get the error below. Fwiw I do not know the performance impact for this change. I can't even get exo running, but this change allows me to /get further/ (before running into a separate issue with vram allocation? story for another day i suppose)

error: 
```
Failed to fetch completions: Error processing prompt (see logs with DEBUG>=2): Nvrtc Error 6, NVRTC_ERROR_COMPILATION <null>(18): error: more than one user-defined conversion from "nv_bfloat16" to "half" applies:
            function "__half::__half(float)" (declared at line 214 of /usr/include/cuda_fp16.hpp)
            function "__half::__half(short)" (declared at line 227 of /usr/include/cuda_fp16.hpp)
            function "__half::__half(unsigned short)" (declared at line 228 of /usr/include/cuda_fp16.hpp)
            function "__half::__half(int)" (declared at line 229 of /usr/include/cuda_fp16.hpp)
            function "__half::__half(unsigned int)" (declared at line 230 of /usr/include/cuda_fp16.hpp)
            function "__half::__half(long long)" (declared at line 231 of /usr/include/cuda_fp16.hpp)
            function "__half::__half(unsigned long long)" (declared at line 232 of /usr/include/cuda_fp16.hpp)
    *((half4*)((data0+(alu0+(gidx1<<14)+(lidx0<<11)+alu1)))) = make_half4(((half)(val0)),((half)(val1)),((half)(val2)),((half)(val3)));
                                                                                 ^

<null>(18): error: more than one user-defined conversion from "nv_bfloat16" to "half" applies:
            function "__half::__half(float)" (declared at line 214 of /usr/include/cuda_fp16.hpp)
            function "__half::__half(short)" (declared at line 227 of /usr/include/cuda_fp16.hpp)
            function "__half::__half(unsigned short)" (declared at line 228 of /usr/include/cuda_fp16.hpp)
            function "__half::__half(int)" (declared at line 229 of /usr/include/cuda_fp16.hpp)
            function "__half::__half(unsigned int)" (declared at line 230 of /usr/include/cuda_fp16.hpp)
            function "__half::__half(long long)" (declared at line 231 of /usr/include/cuda_fp16.hpp)
            function "__half::__half(unsigned long long)" (declared at line 232 of /usr/include/cuda_fp16.hpp)
    *((half4*)((data0+(alu0+(gidx1<<14)+(lidx0<<11)+alu1)))) = make_half4(((half)(val0)),((half)(val1)),((half)(val2)),((half)(val3)));
                                                                                                ^

<null>(18): error: more than one user-defined conversion from "nv_bfloat16" to "half" applies:
            function "__half::__half(float)" (declared at line 214 of /usr/include/cuda_fp16.hpp)
            function "__half::__half(short)" (declared at line 227 of /usr/include/cuda_fp16.hpp)
            function "__half::__half(unsigned short)" (declared at line 228 of /usr/include/cuda_fp16.hpp)
            function "__half::__half(int)" (declared at line 229 of /usr/include/cuda_fp16.hpp)
            function "__half::__half(unsigned int)" (declared at line 230 of /usr/include/cuda_fp16.hpp)
            function "__half::__half(long long)" (declared at line 231 of /usr/include/cuda_fp16.hpp)
            function "__half::__half(unsigned long long)" (declared at line 232 of /usr/include/cuda_fp16.hpp)
    *((half4*)((data0+(alu0+(gidx1<<14)+(lidx0<<11)+alu1)))) = make_half4(((half)(val0)),((half)(val1)),((half)(val2)),((half)(val3)));
                                                                                                               ^

<null>(18): error: more than one user-defined conversion from "nv_bfloat16" to "half" applies:
            function "__half::__half(float)" (declared at line 214 of /usr/include/cuda_fp16.hpp)
            function "__half::__half(short)" (declared at line 227 of /usr/include/cuda_fp16.hpp)
            function "__half::__half(unsigned short)" (declared at line 228 of /usr/include/cuda_fp16.hpp)
            function "__half::__half(int)" (declared at line 229 of /usr/include/cuda_fp16.hpp)
            function "__half::__half(unsigned int)" (declared at line 230 of /usr/include/cuda_fp16.hpp)
            function "__half::__half(long long)" (declared at line 231 of /usr/include/cuda_fp16.hpp)
            function "__half::__half(unsigned long long)" (declared at line 232 of /usr/include/cuda_fp16.hpp)
    *((half4*)((data0+(alu0+(gidx1<<14)+(lidx0<<11)+alu1)))) = make_half4(((half)(val0)),((half)(val1)),((half)(val2)),((half)(val3)));
                                                                                                                              ^

4 errors detected in the compilation of "<null>".
```
2025-02-11 09:38:56 +08:00
chenyu
a092b6395d Tuple -> tuple, List -> list [pr] (#8936) 2025-02-06 14:21:19 -05:00
George Hotz
80089536e5 Revert "move llvm_bf16_cast to renderer for CLANG and LLVM [pr] (#8720)" (#8786)
This reverts commit af0452f116.
2025-01-28 18:59:02 +09:00
mesozoic-egg
af0452f116 move llvm_bf16_cast to renderer for CLANG and LLVM [pr] (#8720)
* handle bf16 via bitcasting for CLANG and LLVM

* On LLVM, skip float16 cast

* float32 on llvm lite, float32 elsewhere

* code format

* trigger pr

* move to rewriter

---------

Co-authored-by: Mesozoic Egg <mesozoic.egg@proton.mail>
Co-authored-by: George Hotz <72895+geohot@users.noreply.github.com>
2025-01-28 18:16:43 +09:00
Francis Lata
c3187087f7 QwQ-32B-Preview support (#7962)
* load weights with some debugging

* start running a prompt

* cleanup

* optionally permute layers and cleanup

* add validation for simple prompt

* small cleanup

* minor cleanup with formatting download links

* add a longer prompt

* add timing option

* some typings

* remove unused arg

* reset GlobalCounters

* minor cleanups
2024-12-04 21:46:37 -05:00
chenyu
336a9b6bf3 remove dtype from llama precompute_freqs_cis (#7930)
do the cast based on input in first forward call instead
2024-11-27 22:28:40 -05:00
chenyu
3b26e51fce Tensor.cummax (#7854)
generalized the existing cumsum and take Ops.MAX in addition to Ops.ADD
2024-11-22 15:55:02 -05:00
eliotgolding
e920f1d663 Llama 3.2 1B load from GGUF (#7295)
* gguf 1b-instruct

* not needed
2024-10-27 09:29:02 +08:00
George Hotz
f4ec39fe58 switch symbolic from old to uops, final PR (#6872)
* switch symbolic from old to uops, final PR

* two wrong answers

* not needed resolves

* symbolic ops passes

* symbolic ops passes

* progress

* tests pass (almost)

* fix last test

* fix some tests

* global binding and unbinding

* Revert "global binding and unbinding"

This reverts commit 9456725630.

* that test works now

* vars on uop doesn't recurse

* fix fuzzer

* update

* fix type

* fix gpt, it's UOp now

* ssimplify symbolics
2024-10-04 16:42:27 +08:00
chenyu
c3c93f332a symbolic bool raise ValueError when not sure [pr] (#6853) 2024-10-02 09:10:58 -04:00
wozeparrot
d269bc95fa faster tinychat (#5993) 2024-08-08 19:16:26 -07:00
George Hotz
bf8ec23b00 hotfix: contiguous on precompute_freqs_cis 2024-08-07 14:40:56 -07:00
David Hou
9a485f36e4 shard kvcache (#5830) 2024-07-30 20:29:54 -07:00
George Hotz
4e89d45513 hotfix: put contiguous back in llama 2024-07-30 18:43:48 -07:00
George Hotz
21c5e8e1b7 extreme llama speed, 57.34 tok/s (#5827)
* extreme llama speed

* mergable
2024-07-30 18:32:09 -07:00
wozeparrot
fa873df9c1 bring tinychat more inline with tinyos' version (#5358) 2024-07-10 13:13:52 -07:00
chenyu
b2c3a28a5e nn.RMSNorm (#5272)
the norm itself has no significant value to add to Tensor method, but we would want Tensor.normalize
2024-07-02 21:39:01 -04:00
George Hotz
14980f79dd hotfix: unbreak llama 2024-06-30 15:27:54 -07:00
George Hotz
3df47bc21e OpenELM + repeat_interleave (#5234)
* start writing openelm

* progress...hit bug

* repeat_interleave support

* gqa

* add rotary embedding

* spp

* i think it runs correctly

* broken

* output is good now

* cleanups

* no io_uring on android
2024-06-30 15:18:39 -07:00
chenyu
e468601226 update llama attention casting (#5096)
* update llama attention casting

updated scaled_dot_product_attention middle cast and removed hard-coded half in llama attention.

* fix that
2024-06-22 10:57:17 -04:00
chenyu
8bd6cb9511 update llama model RMSNorm casting (#5095)
following the original implementation, cast back to input dtype before multiplying weight. slightly faster
https://github.com/meta-llama/llama/blob/main/llama/model.py
2024-06-21 23:02:04 -04:00
chenyu
31358cbea5 change Tensor.stack to method (#4719) 2024-05-24 17:04:19 -04:00
chenyu
ae861325ce update llama sample for mac 32 input buffer limit (#4662)
set default sampling params to function call to 0, and top k in llama3 to 25.
2024-05-20 17:23:39 -04:00
wozeparrot
b144d4b460 new llama3 example (#4576) 2024-05-19 22:42:23 -07:00
chenyu
a65c8de735 move .half() llama freq_cis to the end of sin and cos (#4587)
otherwise arange has inf if either dim or context length exceeds half.max
2024-05-14 15:00:18 -04:00
George Hotz
e79a11b99c hotfix: revert llama change 2024-04-10 20:13:15 -07:00
George Hotz
2e6c39b0b2 Do less realizes (#4141)
* less realize

* corealize jit inputs

* prints

* print before we run
2024-04-10 19:50:50 -07:00
chenyu
f8dc82a8a7 use single tensor for llama kv chache (#4108)
similar to optimization in gpt2
2024-04-08 00:38:32 -04:00
chenyu
92c0675ccf setitem initial support (#4093)
* wip setitem

it's an eager assign to output shapetracker view

* cleanups and tests

* more cleanups
2024-04-07 20:35:22 -04:00
George Hotz
4c4d3cb3e3 restrict assignment to base (#3809)
* restrict assignment to base

* add some restrictions there

* more restrictions
2024-03-18 15:33:06 -07:00
chenyu
5ac1fa933f apply the same fix_bf16 in llama and coder (#3789)
* apply the same fix_bf16 in llama and coder

did not realize the same logic was in llama too.
really fix #2775

* flag for native SUPPORT_BF16 cast
2024-03-17 21:25:24 -04:00
George Hotz
641f347232 simple LoadOps.ASSIGN (#3745)
* simple LoadOps.ASSIGN

* skip that test

* don't assign in onnx ops gemm

* track cache usage

* recreate the lazybuffer to avoid the cache

* fix contigs

* skip that test

* lol

* better letters
2024-03-14 20:44:34 -07:00