* minor grouper + viz fixup [pr]
* gitignore mypy_cache
* reorder create_kernels
* replace with realized
* use tensor_map + viz before spec
* lint
* add that back
* add ability to ORT=1
* test_vs_ort
* useless f
* actually have benchmark take in modelproto for more flexibility in huggingface stuff
* ok runs
* good
* oops fix benchmark_onnx __main__
* 224 as default
* add ORT=1 option to huggingface_onnx
* use Tensor to get_input
* add abilty to do single onnx model testing
* better names
* merge properly...
* copy in onnx_helpers
* better
* decent script
* need to add debug tool first
* new limit usage
* why did narrowing_error come back..
* pretty decent
* revert validate change
* more ops bug fixes
* revert unnecessary changes
* fix InstanceNorm too
* remove op from O4
* minimize diff
* address old feedback
* unsure of this, just revert
* remove that assert
* working attention
* to_python_const Attention
* cant init from np constant so just do this
* final
* fix bug in attention
* attention clean ups
* add hard TODOs and REPOPATH and TRUNCATE envvar
* fix input_ids default value
* final
* fix scatter
* cleaner _prepare_quantize
* use new attention and tempfile for huggingface script
* more stats
* update
* remove outdated code
* big refactor to something usable by CI
* booooooom
* clean up
* update to using yaml as env var input
* add dry run
* try
* valid pad
* use argparser and fix gather bug
* ignore all yaml
* tiny bit more polish
* woah ignoring all yaml was not right
* typo
* decouple huggingface_onnx_run debug run with huggingface_onnx_download
* bug fix for downloading single model
* WOOOO ok much better
* oops argparse 'required' is an invalid argument for positionals
* oops argparse 'required' is an invalid argument for positionals
* add assert
* fix types
---------
Co-authored-by: chenyu <chenyu@fastmail.com>
* load llama3-1B to WEBGPU device
* include compile script for loading llama3 to WEBGPU
* parametrize max_context in build_transformer fxn
* jit_model with two different args sets
* compile for webgpu, split weights
* load model weight parts in browser
* export all tensors from initialized transformer
* run transformer inference in browser
* enable tiktoken with llama bpe in browser
* count total tokens on client with tiktoken.js
* full client-side chat streaming, eliminate server
* revert change that enabled jitting with 2 argsets
* llama without Variable or cache_kv, for webgpu
* have client use mask tokens / whole context
* cleanup staged weights
* add tiktoken.js build script, README
* export CLANG for Q6_k to float32 decompression
* fix and test exported CLANG code for Q6_k to fp32
* revert changes to jit and export_model
* isolate clang export
* test Q6_K to float32 decompression in browser
* gguf_load now also returns t_infos and data_start
* prepare llama-1B Q6_K gguf chunks for browser
* cache and decompress quantized llama in browser
* enable separate deployment of large files
* fix kv cache and symbolic with llama wgpu
* eliminate browser lag during decompression
* hash metadata and weight chunks
* delete obsolete indexeddb cache to free disk
* add progress bar, track model download/decompress
* refactor progress callback
* skip buffer hash verification for speed
* Display progress for entire loading scope
* Report page load errors to user
* actually display errors
* skip prompt tokens already seen by model
* skip prefilling with last assistant message tokens
* on page load tell user if webgpu not enabled
* push deployed URL root to window.history
* make note of bug sources with TODO items
* isolate bug in CLANG with BEAM=2
* remove clang_bug.py from diff
* decompress q6k to f32 on webgpu instead of clang
* remove unused code
* inter-weight decomp with larger wgpu kernels
* parallelize decompression submissions
* refactor dequantize scheduling
* add progress bar back
* fix bug
* temp fix for loading GGUF Q6_K to fp16 not fp32
* fix rendering of exported CLANG
* remove weight casts, sketch js functions for clang
* get symbolic vars from jit_cache for model export
* include symbolic vars in exported CLANG
* render js for clang transformer
* toggle clang/webgpu deployment; refactor decomp
* compile and render clang Q6_K->fp16 and int8 quant
* fix rendered clang for abs(fp16), to work in wasm
* simplify clang js wrapping
* run compiled clang in worker
* prepare llama weights in workers, q6k to int8/fp16
* tinychat on clang in browser, f32/int8 weights
* move wasm inference to (now flexible) worker
* don't load redundant embeddings
* modest wasm perf gain with compile flags
* set default backend, enable backend choice/backup
* render symbolic vars in exported WEBGPU
* quantize webgpu llama to int8/f32
* improve UX arising from rendered WEBGPU
* clean up webgpu launch
* new weights split: smaller chunks, tinygrad quant.
* switch webgpu inference to int8 quant
* remove unneeded clang decompression
* eliminate unneeded kv cache transfer to wasm
* use 1 worker for simplified clang decompression
* display launch errors
* refactor: stream load weight chunks to WebGPU
* show loading chunk completion
* quantize embeddings to int8
* test float16 as input for quantization
* webgpu: use f16 source, int8 embed, eliminate q6k
* simplify split weights prep: all from state_dict
* revert change to nn.state.gguf_load
* remove unneeded decompression from webgpu client
* remove unneeded code
* decrease dl chunks from 47 to 16 MiB
* improve stability of webgpu loading on mobile
* autodetect mobile, improve load stability
* refactor: progress closure
* refactor: one unified progress bar
* remove unneeded code
* revert changes to tinygrad core library
* enforce ios18.3 nerfed max buf size
* BEAM=3 webgpu
* cache integrity, mobile save throttling
* improve mobile UX - no autozoom on prompt box
* clang: int8 from f16, remove q6k
* reduce concurrent dls on mobile to 2 for stability
* refactor: wasm backend with stream loading
* prevent race between wasm load and indexedb save
* split wasm kernels into separate modules
* js wrapper for multiple wasm module inference
* revert multi-module wasm to single module
* make mobile wasm load more stable/fast
* refactor: copy weights into wasm without crashes
* fix bug in download queue; increase mobile dls
* refactor exported clang wrapper, split weights
* remove unnecessary code
* greatly improve int8 quant quality with rounding
* eliminate mobile throttling
* increase webgpu context to 4096 tokens
* export webgpu js functions
* enable separate hosted weights for mobile/pc
* enable prompt-thread switching during generation
* stop generation when max_context is reached
* show progress bar for prefill
* tell user if webgpu fails, while wasm loads
* make loading messages more concise
* update font
* revert changes to tinychat python app launch
* cleanup quantization, add scale_dtype param
* cleanup kv cache code
* cleanup compile code
* link tok_embeddings with output in webgpu export
* refactor: export_model webgpu: symbolic vars
* refactor: export_model weight loading
* forgot to commit export_model.py
* change CLANG to CPU
* deal with pylint incorrectly failing tests
* simplify f-strings for older CI python version
* fix pre-python3.12 parser errors
* [Int32Array] not Int32Array
* cleanup webgpu compile after refactor export_model
* refactor WASM export into export_model
* merge WebGPU/WASM compile scripts
* simplify max_contexts for local deployment
* fix parser issues and whitespace
* deduplicate variable defs for non-wasm clang export
* cleanup code
* cleanup compile scripts
* simplify wasm inference wrapping
* simplify webgpu symbolic vars export
* refactor: unify export of symbolic variables
* simplify WASM export
* simplify clang/wasm export
* update README and build scripts
* separate files for browser/python apps
* restore original python tinychat app files
* browser and python tinychats share assets
* minor cleanup
* isolate app layer diff
* add .gitignore for generated files
* validate CPU/WEBGPU models in python
* prevent infinite generation if validation fails
* check if exported weight files are unique
---------
Co-authored-by: George Hotz <72895+geohot@users.noreply.github.com>
* search: add better default settings for fast search
not the highest possible performance, but adequate for most usage
* search: revert BEAM_MIN_PROGRESS and BEAM_UPCAST_MAX default changes
also sneak in a link to .gitignore for the unet3d dataset
* revert BEAM_MAX_TASKS_PER_CHILD change and fix uops max condition
* first commit
* state back to orig
* mamba comparisions
* rm file
* rename file
* use Tensor.einsum and mke default model 370M
* Cleaned code and made a comparision test
* Simplyfy pull request. Only has 1 mamba implementation now.
* Update prompt
* rm whitespaces
* last space
* remove Einops dependency
* rm unused code
* add tests
* rm print statement
* rm imports
* skip CLANG
* Update skipIf description
* skip model test in CI and add CLANG fix
* rm Device import
* don't be stupid
* Fix conv assign
When the prompt is too short, the logic for conv_state assign messes up. This can be fixed when padding the tokenized array to min length of 4. I padded using the empty string token, but idk if proper practice is to use the PAD token
* fix p1
* temp
* fix jit import
---------
Co-authored-by: schlimeszn <schlimeszn@gmail.com>
Co-authored-by: reddyn <nikidsniper@gmail.com>
Co-authored-by: George Hotz <72895+geohot@users.noreply.github.com>
* dtypes alu test
* those types don't exist in torch
* floats
* more tests
* disable those
* a couple unary tests
* skip float16 tests in CI for GPU
* fix LLVM bool add True+True=1+1=2 which truncates to False in native LLVM
* remove hardcoded float for LLVM ALU fns
* less sensitive atol for fp32, 1e-10 is flaky and sometimes failed even if you revert the merge commit for non-fp32 math, nothing has changed in our kernels for fp32.
* return on overflows
* fix CUDA exp2
* compute results of op regardless of bounds in a python backend
* skip fp16 in GPU and CUDACPU
* fuzz a smaller range in the float_midcast_int32 test
I sampled this and we overflow ~70% of the time.
because numpy behaves differently on different devices for overflows and Metal seems to do the same, I'm opting to eliminate the non-determinism here
* remove CUDA exp2 overload it's already there now
---------
Co-authored-by: George Hotz <geohot@gmail.com>
* cpu tests pass
* torch works
* works
* metal works
* fix ops_disk
* metal jit works
* fix openpilot
* llvm and clang work
* fix webgpu
* docs are rly broken
* LRU works on metal
* delete comment
* revert name to ._buf. LRU only on Compiled
* changes
* allocator
* allocator, getting closer
* lru alloc
* LRUAllocator
* all pass
* metal
* cuda
* test examples
* linearizer
* test fixes
* fix custom + clean realize
* fix hip
* skip tests
* fix tests
* fix size=0
* fix MOCKHIP
* fix thneed
* copy better
* simple
* old style metal copy
* fix thneed
* np reshape
* give cuda a device
* WIP: Stable diffusion WebGPU port
* Load whole model: split safetensor to avoid Chrome allocation limit
* Gitignore .DS_Store, remove debug print
* Clip tokenizer in JS
* WIP: Compile model in parts (text model, diffusor, get_x_prev_and_pred_x0, decoder), and recreate forward logic in JS
* e2e stable diffusion flow
* Create initial random latent tensor in JS
* SD working e2e
* Log if some weights were not loaded properly
* Remove latent_tensor.npy used for debugging
* Cleanup, remove useless logs
* Improve UI
* Add progress bar
* Remove .npy files used for debugging
* Add clip tokenizer as external dependency
* Remove alphas_cumprod.js and load it from safetensors
* Refactor
* Simplify a lot
* Dedup base when limiting elementwise merge (webgpu)
* Add return type to safe_load_metadata
* Do not allow run when webgpu is not supported
* Add progress bar, refactor, fix special names
* Add option to chose from local vs huggingface weights
* lowercase tinygrad :)
* fp16 model dl, decompression client side
* Cache f16 model in browser, better progress
* Cache miss recovery
---------
Co-authored-by: George Hotz <72895+geohot@users.noreply.github.com>
* feat: world
* feat: tests
* feat: no more backwards
* feat: recv into
* feat: whoops
* feat: test in ci
* feat: some debug logging
* feat: workflow naming
* feat: need to set pythonpath
* feat: just send to same device
* need a better place for reshape and permute
* add permutation
* cuda fixed
* clean up
* enable nvidia GPU with global max
* fix order
* fix CI
* add check for global dim limit but need refactor
* refactor
* fix ignore
* initial commit
* 81 passing
* 105 passing tests
* 148 passing
* CI tests
* install dep on ci
* try opencl pkgs
* try using vulkan
* down to only 6 failing
* refactor
* cleaning up
* another test skipped due to buffer limit
* linter
* segfault
* indent fix
* another segfault found
* small touchups
* Fix max and maxpool tests
* Add constant folding
* Add javascript export script
* better asserts in codegen
* manual upcasting
* reverted token type change
* skip safetensor test due to unsupported type
* FIx efficientnet and all other model tests
* Remove np copy
* fixed indent and missing import
* manually destroy the buffer
* revert back to length
* linter errors
* removed extra val
* skip broken tests
* skipping more tests
* Make the page pretty
* Save model weights as safetensor
* Fix imagenet to c test
* Fix second imagenet to c bug
* Async and paralel kernel compilation
* workgroup support
* reversed local size
* fixed non local bug
* correct local groups
* ci experiment
* removed typo
* Fix define local by using shared memory
* Refactor
* try running on mac
* match metal tests
* add more workers
* scope down tests
* trying windows runner
* fixed windows env
* see how many it can do
* merged master
* refactor
* missed refactor
* increase test suite coverage
* missing import
* whitespace in test_efficientnet.py
* getting there
* fixed reset
* fixed bufs
* switched to cstyle
* cleanup
* min/max rename
* one more linter issue
* fixed demo
* linter
* testing ci chrome
* add unsafe webgpu arg
* add build step
* remove WEBGPU from cmd line
* use module
* try forcing directx
* trying forced metal backend
* temp disable conv2d for CI
* disable conv_trasnpose2d
---------
Co-authored-by: 0x4d - Martin Loretz <20306567+martinloretzzz@users.noreply.github.com>
Co-authored-by: George Hotz <72895+geohot@users.noreply.github.com>
* Rename in files
* Move files
* Moved to extra/datasets as suggested
* Changes to files
* Fixed stupid mistake
---------
Co-authored-by: terafo <terafo@protonmail.com>
* Add ResNet inference test and cannon
* Test with ResNet50
* test_car works with resnet fix
* Add KiTS19 dataset
* KiTS19: Implement iterate
* No batch load for this dataset
* Save results on iterate
* Implement dice score
* Add data prep and eval functions
* Resolve shape issue
* Conversion works but wrong values
* Segfaults when load_from_pretrained is called
* Fix segfault and assign properly
* Final result generated, though very slow
* Store and load final result to save time
* Fix typo in finalize
* Score computes
* More bug fixes, dice score is very low
* Working broken code
* Assign output values to result
* Getting a much higher score now
* Fix dataset preprocessing
* Mean DICE score of 88.5
* Ugh, typo
* Attempt to reimplement model
* Rename layers
* Tiny model works, kinda
* Accuracy? gone
* Implement InstanceNorm and match torch
* Test instance norm 2d and 3d
* Combined input block with downsample block
* Tiny model works, support strided convtranspose
* Commands to download dataset
* Clean up a bit
* unet3d_v2 -> unet3d
* Remove duplicated code
* Oops, put tests back
* feat: add mlperf bert model
* feat: switch to nn.Embedding
* clean+fix: fix formatting
* feat: add simple downloader
* feat: metrics
* feat: don't actually need exact match
* feat: doing a run
* feat: set eps on the layernorms
* clean+fix: cleaner impl + hopefully fixed
* feat: move dataset initialization into iterate
* feat: move tokenizer out of iterate
* clean+fix: cleaner + working
* clean: cleanup
* fix: fix metrics
* feat: need to use original bert gelu + download vocab
* feat: make directory if it doesn't exist yet
* feat: jit go brrr
* feat: initial rnn-t
* feat: working with BS>1
* feat: add lstm test
* feat: test passing hidden
* clean: cleanup
* feat: specify start
* feat: way faster lstm & model
* fix: default batch size
* feat: optimization
* fix: fix metrics
* fix: fix feature splicing
* feat: cleaner stacktime
* clean: remove unused import
* clean: remove extra prints
* fix: fix tests and happy llvm
* feat: have the librispeech dataset in its own dir
* clean: unused variable
* feat: no longer need numpy for the embedding + slightly more memory efficient lstm
* fix: forgot to remove something that broke tests
* feat: use relative paths
* feat: even faster
* feat: remove pointless transposes in StackTime
* fix: correct forward
* feat: switch to soundfile for loading and fix some leaks
* feat: add comment about initial dataset setup
* feat: jit more things
* feat: default batch size back to 1
larger than 1 is broken again :(
and even in the reference implementation it gives worse results