* add torch inplace tests
* first set of tests passing
* wrap all inplace funcs, add more tests
* fixes and wrap more functions
* fix all uint8 tests to avoid slow tests
* fix the one test
* another test, another fix
* and one more, works for ddp now
* something on contiguous, cleanup
---------
Co-authored-by: qazal <77887910+Qazalin@users.noreply.github.com>
* terrible but somewhat working impl
* linux behaves differently than macos?
* slightly better impl
* small clean up; haven't figured this out yet
* better
* torch has different behavior on linux and macos for duplicated values
* add sum docs
* fix test
* add torch return_type test
* add an exception test
* wrap_fxn instead, and move op lower in order
* better repeated values test
* rerun ci
* load llama3-1B to WEBGPU device
* include compile script for loading llama3 to WEBGPU
* parametrize max_context in build_transformer fxn
* jit_model with two different args sets
* compile for webgpu, split weights
* load model weight parts in browser
* export all tensors from initialized transformer
* run transformer inference in browser
* enable tiktoken with llama bpe in browser
* count total tokens on client with tiktoken.js
* full client-side chat streaming, eliminate server
* revert change that enabled jitting with 2 argsets
* llama without Variable or cache_kv, for webgpu
* have client use mask tokens / whole context
* cleanup staged weights
* add tiktoken.js build script, README
* export CLANG for Q6_k to float32 decompression
* fix and test exported CLANG code for Q6_k to fp32
* revert changes to jit and export_model
* isolate clang export
* test Q6_K to float32 decompression in browser
* gguf_load now also returns t_infos and data_start
* prepare llama-1B Q6_K gguf chunks for browser
* cache and decompress quantized llama in browser
* enable separate deployment of large files
* fix kv cache and symbolic with llama wgpu
* eliminate browser lag during decompression
* hash metadata and weight chunks
* delete obsolete indexeddb cache to free disk
* add progress bar, track model download/decompress
* refactor progress callback
* skip buffer hash verification for speed
* Display progress for entire loading scope
* Report page load errors to user
* actually display errors
* skip prompt tokens already seen by model
* skip prefilling with last assistant message tokens
* on page load tell user if webgpu not enabled
* push deployed URL root to window.history
* make note of bug sources with TODO items
* isolate bug in CLANG with BEAM=2
* remove clang_bug.py from diff
* decompress q6k to f32 on webgpu instead of clang
* remove unused code
* inter-weight decomp with larger wgpu kernels
* parallelize decompression submissions
* refactor dequantize scheduling
* add progress bar back
* fix bug
* temp fix for loading GGUF Q6_K to fp16 not fp32
* fix rendering of exported CLANG
* remove weight casts, sketch js functions for clang
* get symbolic vars from jit_cache for model export
* include symbolic vars in exported CLANG
* render js for clang transformer
* toggle clang/webgpu deployment; refactor decomp
* compile and render clang Q6_K->fp16 and int8 quant
* fix rendered clang for abs(fp16), to work in wasm
* simplify clang js wrapping
* run compiled clang in worker
* prepare llama weights in workers, q6k to int8/fp16
* tinychat on clang in browser, f32/int8 weights
* move wasm inference to (now flexible) worker
* don't load redundant embeddings
* modest wasm perf gain with compile flags
* set default backend, enable backend choice/backup
* render symbolic vars in exported WEBGPU
* quantize webgpu llama to int8/f32
* improve UX arising from rendered WEBGPU
* clean up webgpu launch
* new weights split: smaller chunks, tinygrad quant.
* switch webgpu inference to int8 quant
* remove unneeded clang decompression
* eliminate unneeded kv cache transfer to wasm
* use 1 worker for simplified clang decompression
* display launch errors
* refactor: stream load weight chunks to WebGPU
* show loading chunk completion
* quantize embeddings to int8
* test float16 as input for quantization
* webgpu: use f16 source, int8 embed, eliminate q6k
* simplify split weights prep: all from state_dict
* revert change to nn.state.gguf_load
* remove unneeded decompression from webgpu client
* remove unneeded code
* decrease dl chunks from 47 to 16 MiB
* improve stability of webgpu loading on mobile
* autodetect mobile, improve load stability
* refactor: progress closure
* refactor: one unified progress bar
* remove unneeded code
* revert changes to tinygrad core library
* enforce ios18.3 nerfed max buf size
* BEAM=3 webgpu
* cache integrity, mobile save throttling
* improve mobile UX - no autozoom on prompt box
* clang: int8 from f16, remove q6k
* reduce concurrent dls on mobile to 2 for stability
* refactor: wasm backend with stream loading
* prevent race between wasm load and indexedb save
* split wasm kernels into separate modules
* js wrapper for multiple wasm module inference
* revert multi-module wasm to single module
* make mobile wasm load more stable/fast
* refactor: copy weights into wasm without crashes
* fix bug in download queue; increase mobile dls
* refactor exported clang wrapper, split weights
* remove unnecessary code
* greatly improve int8 quant quality with rounding
* eliminate mobile throttling
* increase webgpu context to 4096 tokens
* export webgpu js functions
* enable separate hosted weights for mobile/pc
* enable prompt-thread switching during generation
* stop generation when max_context is reached
* show progress bar for prefill
* tell user if webgpu fails, while wasm loads
* make loading messages more concise
* update font
* revert changes to tinychat python app launch
* cleanup quantization, add scale_dtype param
* cleanup kv cache code
* cleanup compile code
* link tok_embeddings with output in webgpu export
* refactor: export_model webgpu: symbolic vars
* refactor: export_model weight loading
* forgot to commit export_model.py
* change CLANG to CPU
* deal with pylint incorrectly failing tests
* simplify f-strings for older CI python version
* fix pre-python3.12 parser errors
* [Int32Array] not Int32Array
* cleanup webgpu compile after refactor export_model
* refactor WASM export into export_model
* merge WebGPU/WASM compile scripts
* simplify max_contexts for local deployment
* fix parser issues and whitespace
* deduplicate variable defs for non-wasm clang export
* cleanup code
* cleanup compile scripts
* simplify wasm inference wrapping
* simplify webgpu symbolic vars export
* refactor: unify export of symbolic variables
* simplify WASM export
* simplify clang/wasm export
* update README and build scripts
* separate files for browser/python apps
* restore original python tinychat app files
* browser and python tinychats share assets
* minor cleanup
* isolate compile/export model
---------
Co-authored-by: George Hotz <72895+geohot@users.noreply.github.com>
* fix some torch tests
* fixup
* small change
* fixup
* fix test
* use default function
* add todo
* bunch of small changes
* fix tests
* more tests
* fix
* fix
* test fix
* simplify
* add DynamicDequantizeLinear and corresponding tests
* wow qlinearops are round away from zero
* this passes locally...
* again
* try
* try separate test
* round to even again
* also add QLinearMul
---------
Co-authored-by: chenyu <chenyu@fastmail.com>
* least_upper_float is at least default_float
en route for div rounding mode. dtype of true int division would change from int32 to default_float, which matches torch too.
* fix bert acc
* load llama3-1B to WEBGPU device
* include compile script for loading llama3 to WEBGPU
* parametrize max_context in build_transformer fxn
* jit_model with two different args sets
* compile for webgpu, split weights
* load model weight parts in browser
* export all tensors from initialized transformer
* run transformer inference in browser
* enable tiktoken with llama bpe in browser
* count total tokens on client with tiktoken.js
* full client-side chat streaming, eliminate server
* revert change that enabled jitting with 2 argsets
* llama without Variable or cache_kv, for webgpu
* have client use mask tokens / whole context
* cleanup staged weights
* add tiktoken.js build script, README
* export CLANG for Q6_k to float32 decompression
* fix and test exported CLANG code for Q6_k to fp32
* revert changes to jit and export_model
* isolate clang export
* test Q6_K to float32 decompression in browser
* gguf_load now also returns t_infos and data_start
* prepare llama-1B Q6_K gguf chunks for browser
* cache and decompress quantized llama in browser
* enable separate deployment of large files
* fix kv cache and symbolic with llama wgpu
* eliminate browser lag during decompression
* hash metadata and weight chunks
* delete obsolete indexeddb cache to free disk
* add progress bar, track model download/decompress
* refactor progress callback
* skip buffer hash verification for speed
* Display progress for entire loading scope
* Report page load errors to user
* actually display errors
* skip prompt tokens already seen by model
* skip prefilling with last assistant message tokens
* on page load tell user if webgpu not enabled
* push deployed URL root to window.history
* make note of bug sources with TODO items
* isolate bug in CLANG with BEAM=2
* remove clang_bug.py from diff
* decompress q6k to f32 on webgpu instead of clang
* remove unused code
* inter-weight decomp with larger wgpu kernels
* parallelize decompression submissions
* refactor dequantize scheduling
* add progress bar back
* fix bug
* temp fix for loading GGUF Q6_K to fp16 not fp32
* fix rendering of exported CLANG
* remove weight casts, sketch js functions for clang
* get symbolic vars from jit_cache for model export
* include symbolic vars in exported CLANG
* render js for clang transformer
* toggle clang/webgpu deployment; refactor decomp
* compile and render clang Q6_K->fp16 and int8 quant
* fix rendered clang for abs(fp16), to work in wasm
* simplify clang js wrapping
* run compiled clang in worker
* prepare llama weights in workers, q6k to int8/fp16
* tinychat on clang in browser, f32/int8 weights
* move wasm inference to (now flexible) worker
* don't load redundant embeddings
* modest wasm perf gain with compile flags
* set default backend, enable backend choice/backup
* render symbolic vars in exported WEBGPU
* quantize webgpu llama to int8/f32
* improve UX arising from rendered WEBGPU
* clean up webgpu launch
* new weights split: smaller chunks, tinygrad quant.
* switch webgpu inference to int8 quant
* remove unneeded clang decompression
* eliminate unneeded kv cache transfer to wasm
* use 1 worker for simplified clang decompression
* display launch errors
* refactor: stream load weight chunks to WebGPU
* show loading chunk completion
* quantize embeddings to int8
* test float16 as input for quantization
* webgpu: use f16 source, int8 embed, eliminate q6k
* simplify split weights prep: all from state_dict
* revert change to nn.state.gguf_load
* remove unneeded decompression from webgpu client
* remove unneeded code
* decrease dl chunks from 47 to 16 MiB
* improve stability of webgpu loading on mobile
* autodetect mobile, improve load stability
* refactor: progress closure
* refactor: one unified progress bar
* remove unneeded code
* revert changes to tinygrad core library
* enforce ios18.3 nerfed max buf size
* BEAM=3 webgpu
* cache integrity, mobile save throttling
* improve mobile UX - no autozoom on prompt box
* clang: int8 from f16, remove q6k
* reduce concurrent dls on mobile to 2 for stability
* refactor: wasm backend with stream loading
* prevent race between wasm load and indexedb save
* split wasm kernels into separate modules
* js wrapper for multiple wasm module inference
* revert multi-module wasm to single module
* make mobile wasm load more stable/fast
* refactor: copy weights into wasm without crashes
* fix bug in download queue; increase mobile dls
* refactor exported clang wrapper, split weights
* remove unnecessary code
* greatly improve int8 quant quality with rounding
* eliminate mobile throttling
* increase webgpu context to 4096 tokens
* export webgpu js functions
* enable separate hosted weights for mobile/pc
* enable prompt-thread switching during generation
* stop generation when max_context is reached
* show progress bar for prefill
* tell user if webgpu fails, while wasm loads
* make loading messages more concise
* update font
* revert changes to tinychat python app launch
* cleanup quantization, add scale_dtype param
* cleanup kv cache code
* cleanup compile code
* link tok_embeddings with output in webgpu export
* refactor: export_model webgpu: symbolic vars
* refactor: export_model weight loading
* forgot to commit export_model.py
* change CLANG to CPU
* deal with pylint incorrectly failing tests
* simplify f-strings for older CI python version
* fix pre-python3.12 parser errors
* [Int32Array] not Int32Array
* cleanup webgpu compile after refactor export_model
* refactor WASM export into export_model
* merge WebGPU/WASM compile scripts
* simplify max_contexts for local deployment
* fix parser issues and whitespace
* deduplicate variable defs for non-wasm clang export
* cleanup code
* cleanup compile scripts
* simplify wasm inference wrapping
* simplify webgpu symbolic vars export
* refactor: unify export of symbolic variables
* simplify WASM export
* simplify clang/wasm export
* update README and build scripts
* separate files for browser/python apps
* restore original python tinychat app files
* browser and python tinychats share assets
* minor cleanup
* isolate diffs to llama files
* minor cleanup
* set default scale_dtype
* set default scale_dtype for NF4 quantization
* make quantization of tok_embeds optional
* match output with tok_embeds if not quantizing
* minor change
* yml changes
* torch backend remove meta decomps and add test
* torch backend bump timeout for tests
---------
Co-authored-by: George Hotz <72895+geohot@users.noreply.github.com>
* fixes from chargpt for torch backend
* shrink support
* add stride support
* comment cleanup
* a few more
* work
* import the stream hack
* llvm multi auto
* rig up torch's testing framework [pr]
* support more movement ops
* dec on expand
* fix tests
* work
* fix tests
* a few more
* decomps + opt hook
* installed pytest
* boom
* fix webgpu
* use exact variable names in test so that AI can read easier
* add tag for specific test name like test a specific dtype
* fix ruff
* astype everything
* dtype in array creation
* just arange
* is 67% considered fixed?
* move test up
* small cleanups
* share function
* add qgemm as well
* add qgemm too
* make sure qgemm comes out as int
* take out qgemm for now
* fixed test
* add correct qgemm
* addressing feedback here too, early naive fix for now
* simplify bias and c to be minimalistic enough to test correctness
* refactored qlinearops
* maybe these asserts aren't the best..
* fix test
* updated tests to cover new ops
* try to add to CI
* move test_onnx_ops into testextra/
* more attention tests
* qlinear_add atol=1
* attention still not fullllllly correct
* it is what it is
---------
Co-authored-by: chenyu <chenyu@fastmail.com>
* different way to write torch backend
* both backends
* more work
* simpler code
* more work
* test both
* imply unwrap/wrap
* FORWARD_ONLY=1 TINY_BACKEND=1 python3 test/test_ops.py TestOps.test_add works
* ready to start making test_ops work in torch backend
* backward pass, TINY_BACKEND=1 python3 test/test_ops.py TestOps.test_add works
* FORWARD_ONLY=1 TINY_BACKEND=1 python3 test/test_ops.py TestOps.test_simple_conv2d works
* matmul backward is broken with as_strided