Previously the output.weight layer would not be loaded, and would only
contain randomly initialized values. This led to junk when doing a
forward pass.
Signed-off-by: Daniel Xu <daniel@thinkingmachines.ai>
* work on minrf example
* more
* jit sample
* t is tensor not const
* fixes
* more convs
* fix dropout
* don't print
* 504
* big patch
* onehot
* touch
* use embeddings
* dumb uses final layer
* act
* non fl
* match
* tp
* 3
* of
* ppsz
* normal
* add adln
* no t
* weird transformer
* weird transformer
* contig
* actual speed fix
* dumb
* cb
* 0
* t is 0
* mort-t
* args
* dumb days are over
* readable
* contig
* no more t mask
* mask_t
* init to zero
* clean
* steps
* work
* tt
* t
* solid
sum of bool by default uses default_float for acc. So without float, it might overflow with a large BS and default_float=HALF.
fixed clsf_accuracy to not be inf in mi300x bert
* least_upper_float is at least default_float
en route for div rounding mode. dtype of true int division would change from int32 to default_float, which matches torch too.
* fix bert acc
* load llama3-1B to WEBGPU device
* include compile script for loading llama3 to WEBGPU
* parametrize max_context in build_transformer fxn
* jit_model with two different args sets
* compile for webgpu, split weights
* load model weight parts in browser
* export all tensors from initialized transformer
* run transformer inference in browser
* enable tiktoken with llama bpe in browser
* count total tokens on client with tiktoken.js
* full client-side chat streaming, eliminate server
* revert change that enabled jitting with 2 argsets
* llama without Variable or cache_kv, for webgpu
* have client use mask tokens / whole context
* cleanup staged weights
* add tiktoken.js build script, README
* export CLANG for Q6_k to float32 decompression
* fix and test exported CLANG code for Q6_k to fp32
* revert changes to jit and export_model
* isolate clang export
* test Q6_K to float32 decompression in browser
* gguf_load now also returns t_infos and data_start
* prepare llama-1B Q6_K gguf chunks for browser
* cache and decompress quantized llama in browser
* enable separate deployment of large files
* fix kv cache and symbolic with llama wgpu
* eliminate browser lag during decompression
* hash metadata and weight chunks
* delete obsolete indexeddb cache to free disk
* add progress bar, track model download/decompress
* refactor progress callback
* skip buffer hash verification for speed
* Display progress for entire loading scope
* Report page load errors to user
* actually display errors
* skip prompt tokens already seen by model
* skip prefilling with last assistant message tokens
* on page load tell user if webgpu not enabled
* push deployed URL root to window.history
* make note of bug sources with TODO items
* isolate bug in CLANG with BEAM=2
* remove clang_bug.py from diff
* decompress q6k to f32 on webgpu instead of clang
* remove unused code
* inter-weight decomp with larger wgpu kernels
* parallelize decompression submissions
* refactor dequantize scheduling
* add progress bar back
* fix bug
* temp fix for loading GGUF Q6_K to fp16 not fp32
* fix rendering of exported CLANG
* remove weight casts, sketch js functions for clang
* get symbolic vars from jit_cache for model export
* include symbolic vars in exported CLANG
* render js for clang transformer
* toggle clang/webgpu deployment; refactor decomp
* compile and render clang Q6_K->fp16 and int8 quant
* fix rendered clang for abs(fp16), to work in wasm
* simplify clang js wrapping
* run compiled clang in worker
* prepare llama weights in workers, q6k to int8/fp16
* tinychat on clang in browser, f32/int8 weights
* move wasm inference to (now flexible) worker
* don't load redundant embeddings
* modest wasm perf gain with compile flags
* set default backend, enable backend choice/backup
* render symbolic vars in exported WEBGPU
* quantize webgpu llama to int8/f32
* improve UX arising from rendered WEBGPU
* clean up webgpu launch
* new weights split: smaller chunks, tinygrad quant.
* switch webgpu inference to int8 quant
* remove unneeded clang decompression
* eliminate unneeded kv cache transfer to wasm
* use 1 worker for simplified clang decompression
* display launch errors
* refactor: stream load weight chunks to WebGPU
* show loading chunk completion
* quantize embeddings to int8
* test float16 as input for quantization
* webgpu: use f16 source, int8 embed, eliminate q6k
* simplify split weights prep: all from state_dict
* revert change to nn.state.gguf_load
* remove unneeded decompression from webgpu client
* remove unneeded code
* decrease dl chunks from 47 to 16 MiB
* improve stability of webgpu loading on mobile
* autodetect mobile, improve load stability
* refactor: progress closure
* refactor: one unified progress bar
* remove unneeded code
* revert changes to tinygrad core library
* enforce ios18.3 nerfed max buf size
* BEAM=3 webgpu
* cache integrity, mobile save throttling
* improve mobile UX - no autozoom on prompt box
* clang: int8 from f16, remove q6k
* reduce concurrent dls on mobile to 2 for stability
* refactor: wasm backend with stream loading
* prevent race between wasm load and indexedb save
* split wasm kernels into separate modules
* js wrapper for multiple wasm module inference
* revert multi-module wasm to single module
* make mobile wasm load more stable/fast
* refactor: copy weights into wasm without crashes
* fix bug in download queue; increase mobile dls
* refactor exported clang wrapper, split weights
* remove unnecessary code
* greatly improve int8 quant quality with rounding
* eliminate mobile throttling
* increase webgpu context to 4096 tokens
* export webgpu js functions
* enable separate hosted weights for mobile/pc
* enable prompt-thread switching during generation
* stop generation when max_context is reached
* show progress bar for prefill
* tell user if webgpu fails, while wasm loads
* make loading messages more concise
* update font
* revert changes to tinychat python app launch
* cleanup quantization, add scale_dtype param
* cleanup kv cache code
* cleanup compile code
* link tok_embeddings with output in webgpu export
* refactor: export_model webgpu: symbolic vars
* refactor: export_model weight loading
* forgot to commit export_model.py
* change CLANG to CPU
* deal with pylint incorrectly failing tests
* simplify f-strings for older CI python version
* fix pre-python3.12 parser errors
* [Int32Array] not Int32Array
* cleanup webgpu compile after refactor export_model
* refactor WASM export into export_model
* merge WebGPU/WASM compile scripts
* simplify max_contexts for local deployment
* fix parser issues and whitespace
* deduplicate variable defs for non-wasm clang export
* cleanup code
* cleanup compile scripts
* simplify wasm inference wrapping
* simplify webgpu symbolic vars export
* refactor: unify export of symbolic variables
* simplify WASM export
* simplify clang/wasm export
* update README and build scripts
* separate files for browser/python apps
* restore original python tinychat app files
* browser and python tinychats share assets
* minor cleanup
* isolate diffs to llama files
* minor cleanup
* set default scale_dtype
* set default scale_dtype for NF4 quantization
* make quantization of tok_embeds optional
* match output with tok_embeds if not quantizing
* minor change
for unknown reasons, without this, when trying to execute "Llama 3.2 1B", I get the error below. Fwiw I do not know the performance impact for this change. I can't even get exo running, but this change allows me to /get further/ (before running into a separate issue with vram allocation? story for another day i suppose)
error:
```
Failed to fetch completions: Error processing prompt (see logs with DEBUG>=2): Nvrtc Error 6, NVRTC_ERROR_COMPILATION <null>(18): error: more than one user-defined conversion from "nv_bfloat16" to "half" applies:
function "__half::__half(float)" (declared at line 214 of /usr/include/cuda_fp16.hpp)
function "__half::__half(short)" (declared at line 227 of /usr/include/cuda_fp16.hpp)
function "__half::__half(unsigned short)" (declared at line 228 of /usr/include/cuda_fp16.hpp)
function "__half::__half(int)" (declared at line 229 of /usr/include/cuda_fp16.hpp)
function "__half::__half(unsigned int)" (declared at line 230 of /usr/include/cuda_fp16.hpp)
function "__half::__half(long long)" (declared at line 231 of /usr/include/cuda_fp16.hpp)
function "__half::__half(unsigned long long)" (declared at line 232 of /usr/include/cuda_fp16.hpp)
*((half4*)((data0+(alu0+(gidx1<<14)+(lidx0<<11)+alu1)))) = make_half4(((half)(val0)),((half)(val1)),((half)(val2)),((half)(val3)));
^
<null>(18): error: more than one user-defined conversion from "nv_bfloat16" to "half" applies:
function "__half::__half(float)" (declared at line 214 of /usr/include/cuda_fp16.hpp)
function "__half::__half(short)" (declared at line 227 of /usr/include/cuda_fp16.hpp)
function "__half::__half(unsigned short)" (declared at line 228 of /usr/include/cuda_fp16.hpp)
function "__half::__half(int)" (declared at line 229 of /usr/include/cuda_fp16.hpp)
function "__half::__half(unsigned int)" (declared at line 230 of /usr/include/cuda_fp16.hpp)
function "__half::__half(long long)" (declared at line 231 of /usr/include/cuda_fp16.hpp)
function "__half::__half(unsigned long long)" (declared at line 232 of /usr/include/cuda_fp16.hpp)
*((half4*)((data0+(alu0+(gidx1<<14)+(lidx0<<11)+alu1)))) = make_half4(((half)(val0)),((half)(val1)),((half)(val2)),((half)(val3)));
^
<null>(18): error: more than one user-defined conversion from "nv_bfloat16" to "half" applies:
function "__half::__half(float)" (declared at line 214 of /usr/include/cuda_fp16.hpp)
function "__half::__half(short)" (declared at line 227 of /usr/include/cuda_fp16.hpp)
function "__half::__half(unsigned short)" (declared at line 228 of /usr/include/cuda_fp16.hpp)
function "__half::__half(int)" (declared at line 229 of /usr/include/cuda_fp16.hpp)
function "__half::__half(unsigned int)" (declared at line 230 of /usr/include/cuda_fp16.hpp)
function "__half::__half(long long)" (declared at line 231 of /usr/include/cuda_fp16.hpp)
function "__half::__half(unsigned long long)" (declared at line 232 of /usr/include/cuda_fp16.hpp)
*((half4*)((data0+(alu0+(gidx1<<14)+(lidx0<<11)+alu1)))) = make_half4(((half)(val0)),((half)(val1)),((half)(val2)),((half)(val3)));
^
<null>(18): error: more than one user-defined conversion from "nv_bfloat16" to "half" applies:
function "__half::__half(float)" (declared at line 214 of /usr/include/cuda_fp16.hpp)
function "__half::__half(short)" (declared at line 227 of /usr/include/cuda_fp16.hpp)
function "__half::__half(unsigned short)" (declared at line 228 of /usr/include/cuda_fp16.hpp)
function "__half::__half(int)" (declared at line 229 of /usr/include/cuda_fp16.hpp)
function "__half::__half(unsigned int)" (declared at line 230 of /usr/include/cuda_fp16.hpp)
function "__half::__half(long long)" (declared at line 231 of /usr/include/cuda_fp16.hpp)
function "__half::__half(unsigned long long)" (declared at line 232 of /usr/include/cuda_fp16.hpp)
*((half4*)((data0+(alu0+(gidx1<<14)+(lidx0<<11)+alu1)))) = make_half4(((half)(val0)),((half)(val1)),((half)(val2)),((half)(val3)));
^
4 errors detected in the compilation of "<null>".
```
* validate that FC exists before loading pretrained weights
* add test case for ResNet pretrained model without FC layer
* remove extra newline
* rename test case
* reraise exception if not handled by check