for some reason with random dropout it creates different ast on each device. And search embedding is slow. This workaround saved 6 minutes setup time on mi300x (25->19) and resulted in similar speed
* add system json for mi300x mlperf
```
python3 -m mlperf_logging.system_desc_checker examples/mlperf/training_submission_v5.0/tinycorp/systems/tinybox_8xMI300X.json training 4.1.0
INFO - System description checker passed for tinybox 8xMI300X
```
also removed the rocm from tinybox_red since we are not using it
* update mlperf-logging version
fix when only run eval `TRAIN=0 BERT_SIZE=tiny examples/mlperf/training_submission_v5.0/tinycorp/benchmarks/bert/implementations/tinybox_green/dev_beam.sh`
* jit the forward
* might timeout, idk just send it
* this is dumb
* naive bitonic lol
* idk if this is correct, but that squeeze before is definitly not
* vectorized bitonic sort, but still slow
* yay 1 layer is correct
* alright its pretty good
* good enough
* rerun CI
* nit improve comment
* load llama3-1B to WEBGPU device
* include compile script for loading llama3 to WEBGPU
* parametrize max_context in build_transformer fxn
* jit_model with two different args sets
* compile for webgpu, split weights
* load model weight parts in browser
* export all tensors from initialized transformer
* run transformer inference in browser
* enable tiktoken with llama bpe in browser
* count total tokens on client with tiktoken.js
* full client-side chat streaming, eliminate server
* revert change that enabled jitting with 2 argsets
* llama without Variable or cache_kv, for webgpu
* have client use mask tokens / whole context
* cleanup staged weights
* add tiktoken.js build script, README
* export CLANG for Q6_k to float32 decompression
* fix and test exported CLANG code for Q6_k to fp32
* revert changes to jit and export_model
* isolate clang export
* test Q6_K to float32 decompression in browser
* gguf_load now also returns t_infos and data_start
* prepare llama-1B Q6_K gguf chunks for browser
* cache and decompress quantized llama in browser
* enable separate deployment of large files
* fix kv cache and symbolic with llama wgpu
* eliminate browser lag during decompression
* hash metadata and weight chunks
* delete obsolete indexeddb cache to free disk
* add progress bar, track model download/decompress
* refactor progress callback
* skip buffer hash verification for speed
* Display progress for entire loading scope
* Report page load errors to user
* actually display errors
* skip prompt tokens already seen by model
* skip prefilling with last assistant message tokens
* on page load tell user if webgpu not enabled
* push deployed URL root to window.history
* make note of bug sources with TODO items
* isolate bug in CLANG with BEAM=2
* remove clang_bug.py from diff
* decompress q6k to f32 on webgpu instead of clang
* remove unused code
* inter-weight decomp with larger wgpu kernels
* parallelize decompression submissions
* refactor dequantize scheduling
* add progress bar back
* fix bug
* temp fix for loading GGUF Q6_K to fp16 not fp32
* fix rendering of exported CLANG
* remove weight casts, sketch js functions for clang
* get symbolic vars from jit_cache for model export
* include symbolic vars in exported CLANG
* render js for clang transformer
* toggle clang/webgpu deployment; refactor decomp
* compile and render clang Q6_K->fp16 and int8 quant
* fix rendered clang for abs(fp16), to work in wasm
* simplify clang js wrapping
* run compiled clang in worker
* prepare llama weights in workers, q6k to int8/fp16
* tinychat on clang in browser, f32/int8 weights
* move wasm inference to (now flexible) worker
* don't load redundant embeddings
* modest wasm perf gain with compile flags
* set default backend, enable backend choice/backup
* render symbolic vars in exported WEBGPU
* quantize webgpu llama to int8/f32
* improve UX arising from rendered WEBGPU
* clean up webgpu launch
* new weights split: smaller chunks, tinygrad quant.
* switch webgpu inference to int8 quant
* remove unneeded clang decompression
* eliminate unneeded kv cache transfer to wasm
* use 1 worker for simplified clang decompression
* display launch errors
* refactor: stream load weight chunks to WebGPU
* show loading chunk completion
* quantize embeddings to int8
* test float16 as input for quantization
* webgpu: use f16 source, int8 embed, eliminate q6k
* simplify split weights prep: all from state_dict
* revert change to nn.state.gguf_load
* remove unneeded decompression from webgpu client
* remove unneeded code
* decrease dl chunks from 47 to 16 MiB
* improve stability of webgpu loading on mobile
* autodetect mobile, improve load stability
* refactor: progress closure
* refactor: one unified progress bar
* remove unneeded code
* revert changes to tinygrad core library
* enforce ios18.3 nerfed max buf size
* BEAM=3 webgpu
* cache integrity, mobile save throttling
* improve mobile UX - no autozoom on prompt box
* clang: int8 from f16, remove q6k
* reduce concurrent dls on mobile to 2 for stability
* refactor: wasm backend with stream loading
* prevent race between wasm load and indexedb save
* split wasm kernels into separate modules
* js wrapper for multiple wasm module inference
* revert multi-module wasm to single module
* make mobile wasm load more stable/fast
* refactor: copy weights into wasm without crashes
* fix bug in download queue; increase mobile dls
* refactor exported clang wrapper, split weights
* remove unnecessary code
* greatly improve int8 quant quality with rounding
* eliminate mobile throttling
* increase webgpu context to 4096 tokens
* export webgpu js functions
* enable separate hosted weights for mobile/pc
* enable prompt-thread switching during generation
* stop generation when max_context is reached
* show progress bar for prefill
* tell user if webgpu fails, while wasm loads
* make loading messages more concise
* update font
* revert changes to tinychat python app launch
* cleanup quantization, add scale_dtype param
* cleanup kv cache code
* cleanup compile code
* link tok_embeddings with output in webgpu export
* refactor: export_model webgpu: symbolic vars
* refactor: export_model weight loading
* forgot to commit export_model.py
* change CLANG to CPU
* deal with pylint incorrectly failing tests
* simplify f-strings for older CI python version
* fix pre-python3.12 parser errors
* [Int32Array] not Int32Array
* cleanup webgpu compile after refactor export_model
* refactor WASM export into export_model
* merge WebGPU/WASM compile scripts
* simplify max_contexts for local deployment
* fix parser issues and whitespace
* deduplicate variable defs for non-wasm clang export
* cleanup code
* cleanup compile scripts
* simplify wasm inference wrapping
* simplify webgpu symbolic vars export
* refactor: unify export of symbolic variables
* simplify WASM export
* simplify clang/wasm export
* update README and build scripts
* separate files for browser/python apps
* restore original python tinychat app files
* browser and python tinychats share assets
* minor cleanup
* isolate app layer diff
* add .gitignore for generated files
* validate CPU/WEBGPU models in python
* prevent infinite generation if validation fails
* check if exported weight files are unique
---------
Co-authored-by: George Hotz <72895+geohot@users.noreply.github.com>
* load llama3-1B to WEBGPU device
* include compile script for loading llama3 to WEBGPU
* parametrize max_context in build_transformer fxn
* jit_model with two different args sets
* compile for webgpu, split weights
* load model weight parts in browser
* export all tensors from initialized transformer
* run transformer inference in browser
* enable tiktoken with llama bpe in browser
* count total tokens on client with tiktoken.js
* full client-side chat streaming, eliminate server
* revert change that enabled jitting with 2 argsets
* llama without Variable or cache_kv, for webgpu
* have client use mask tokens / whole context
* cleanup staged weights
* add tiktoken.js build script, README
* export CLANG for Q6_k to float32 decompression
* fix and test exported CLANG code for Q6_k to fp32
* revert changes to jit and export_model
* isolate clang export
* test Q6_K to float32 decompression in browser
* gguf_load now also returns t_infos and data_start
* prepare llama-1B Q6_K gguf chunks for browser
* cache and decompress quantized llama in browser
* enable separate deployment of large files
* fix kv cache and symbolic with llama wgpu
* eliminate browser lag during decompression
* hash metadata and weight chunks
* delete obsolete indexeddb cache to free disk
* add progress bar, track model download/decompress
* refactor progress callback
* skip buffer hash verification for speed
* Display progress for entire loading scope
* Report page load errors to user
* actually display errors
* skip prompt tokens already seen by model
* skip prefilling with last assistant message tokens
* on page load tell user if webgpu not enabled
* push deployed URL root to window.history
* make note of bug sources with TODO items
* isolate bug in CLANG with BEAM=2
* remove clang_bug.py from diff
* decompress q6k to f32 on webgpu instead of clang
* remove unused code
* inter-weight decomp with larger wgpu kernels
* parallelize decompression submissions
* refactor dequantize scheduling
* add progress bar back
* fix bug
* temp fix for loading GGUF Q6_K to fp16 not fp32
* fix rendering of exported CLANG
* remove weight casts, sketch js functions for clang
* get symbolic vars from jit_cache for model export
* include symbolic vars in exported CLANG
* render js for clang transformer
* toggle clang/webgpu deployment; refactor decomp
* compile and render clang Q6_K->fp16 and int8 quant
* fix rendered clang for abs(fp16), to work in wasm
* simplify clang js wrapping
* run compiled clang in worker
* prepare llama weights in workers, q6k to int8/fp16
* tinychat on clang in browser, f32/int8 weights
* move wasm inference to (now flexible) worker
* don't load redundant embeddings
* modest wasm perf gain with compile flags
* set default backend, enable backend choice/backup
* render symbolic vars in exported WEBGPU
* quantize webgpu llama to int8/f32
* improve UX arising from rendered WEBGPU
* clean up webgpu launch
* new weights split: smaller chunks, tinygrad quant.
* switch webgpu inference to int8 quant
* remove unneeded clang decompression
* eliminate unneeded kv cache transfer to wasm
* use 1 worker for simplified clang decompression
* display launch errors
* refactor: stream load weight chunks to WebGPU
* show loading chunk completion
* quantize embeddings to int8
* test float16 as input for quantization
* webgpu: use f16 source, int8 embed, eliminate q6k
* simplify split weights prep: all from state_dict
* revert change to nn.state.gguf_load
* remove unneeded decompression from webgpu client
* remove unneeded code
* decrease dl chunks from 47 to 16 MiB
* improve stability of webgpu loading on mobile
* autodetect mobile, improve load stability
* refactor: progress closure
* refactor: one unified progress bar
* remove unneeded code
* revert changes to tinygrad core library
* enforce ios18.3 nerfed max buf size
* BEAM=3 webgpu
* cache integrity, mobile save throttling
* improve mobile UX - no autozoom on prompt box
* clang: int8 from f16, remove q6k
* reduce concurrent dls on mobile to 2 for stability
* refactor: wasm backend with stream loading
* prevent race between wasm load and indexedb save
* split wasm kernels into separate modules
* js wrapper for multiple wasm module inference
* revert multi-module wasm to single module
* make mobile wasm load more stable/fast
* refactor: copy weights into wasm without crashes
* fix bug in download queue; increase mobile dls
* refactor exported clang wrapper, split weights
* remove unnecessary code
* greatly improve int8 quant quality with rounding
* eliminate mobile throttling
* increase webgpu context to 4096 tokens
* export webgpu js functions
* enable separate hosted weights for mobile/pc
* enable prompt-thread switching during generation
* stop generation when max_context is reached
* show progress bar for prefill
* tell user if webgpu fails, while wasm loads
* make loading messages more concise
* update font
* revert changes to tinychat python app launch
* cleanup quantization, add scale_dtype param
* cleanup kv cache code
* cleanup compile code
* link tok_embeddings with output in webgpu export
* refactor: export_model webgpu: symbolic vars
* refactor: export_model weight loading
* forgot to commit export_model.py
* change CLANG to CPU
* deal with pylint incorrectly failing tests
* simplify f-strings for older CI python version
* fix pre-python3.12 parser errors
* [Int32Array] not Int32Array
* cleanup webgpu compile after refactor export_model
* refactor WASM export into export_model
* merge WebGPU/WASM compile scripts
* simplify max_contexts for local deployment
* fix parser issues and whitespace
* deduplicate variable defs for non-wasm clang export
* cleanup code
* cleanup compile scripts
* simplify wasm inference wrapping
* simplify webgpu symbolic vars export
* refactor: unify export of symbolic variables
* simplify WASM export
* simplify clang/wasm export
* update README and build scripts
* separate files for browser/python apps
* restore original python tinychat app files
* browser and python tinychats share assets
* minor cleanup
* isolate compile/export model
---------
Co-authored-by: George Hotz <72895+geohot@users.noreply.github.com>
* Revert "BUFFER_VIEW is a node in the kernel graph + delete ViewOp (#9298)"
This reverts commit 3210b656b6.
* Revert "substitute ast from kernel op [pr] (#9293)"
This reverts commit 5a9c788ae6.