* add support for a custom BASEDIR for openimages download
* make export step faster
* add focal loss
* update model_eval with new dataloader
* generate_anchors in tinygrad
* update initializers for model
* small cleanup
* revert isin enhancements
* recursively go through backbone layers to freeze them
* add optimizer
* minor cleanup
* start dataloader work with input images
* add first transform for train set
* reuse existing prepare_target
* continue with dataloader implementation
* add dataloader
* separate out KiTS19 dataset test cases
* create mock data samples for test
* add dataloader + test
* cleanup dataloader test and revert shm path
* trim dataloader related code needed from ref
* got dataloader with normalize working
* update image to be float32
* add back normalization and negate it in test
* clean up reference dataset implementation + ruff changes
* add validation set test
* add proper training loop over the training dataset
* add LambdaLR support
* add LR scheduler and the start of training step
* get forward call to model work and setup multi-GPU
* already passed device
* return matches from dataloader
* hotfix for dataloader typo causing some hang
* start some work on classification loss
* update focal loss to support masking
* add missing test and cleanup focal loss
* cleanup unit tests
* remove masking support for sigmoid_focal_loss
* make ClassificationHead loss work
* cleanups + fix dataloader tests
* remove sigmoid when computing loss
* make anchors use Tensors
* simplify anchors batching
* revert anchors to use np
* implement regression loss
* fix regression loss
* cleanup losses
* move BoxCoder to MLPerf helpers
* revert helper changes
* fixes after helper refactor cleanup
* add tests for l1_loss
* start re-enabling training step
* minor cleanup
* add pycocotools to testing dependencies
* make training work
* adjust regression loss to mask after L1 loss is calculated
* reduce img and lbl sizes by half for KiTS19 dataset tests
* Revert "reduce img and lbl sizes by half for KiTS19 dataset tests"
This reverts commit d115b0c664.
* temporarily disable openimages dataset tests to debug CI
* enable openimages dataset test and create samples once
* temporarily disable openimages validation set test
* reenable test and add some debugging to the test
* add boto3 testing dependencies
* add pandas to testing dependencies
* This reverts commit 467704fec6.
* reenable test
* move sample creation to setup
* realize boxcoder's encoding
* add wandb
* fix wandb resuming feature
* move anchors as part of dataloader
* fix dtype for anchor inside dataloader and fix horizontal flip transformation
* add support for BENCHMARK
* set seed
* debug dataset test failuire
* Revert "debug dataset test failuire"
This reverts commit 1b2f9d7f50.
* fix dataloader script
* do not realize when sharding model weights
* setup openimages samples differently
* create the necessary samples per test case
* enable lr scheduler and fix benchmark timing
* add jit to the training loop
* add checkpointing and training resume capabilities
* refactor on training loop and start the work on val looop
* add debug logging for dataloader test
* debug test
* assert boxes again
* update validation dataloader and more cleanups
* fix validation test case
* add multi device support to retinanet eval
* fix issue with realized on dataloader
* remove optional disk tensors in dataloader
* remove verbose debugging on datasets test
* put back parallel testing and remove img_ids Tensor from dataloader
* cleanup train and validation dataloader
* return validation targets in dataloader
* cleanup boxes and labels in dataloader
* fix img_ids repeating its values
* remove unnecessary targets from validation dataloader
* add validation loop to training script
* adjust LR to be the ratio of the batch size
* minor cleanups
* remove frozen layers from optimizer's params
* hyperparameter adjustments and cleanups
* model init, hyperparam, and data preprocessing updates
* no need to return loaded keys for resnet
* fix train script
* update loss calculation for regresionhead and some cleanups
* add JIT reset support
* add nan check during training
* Revert "add nan check during training"
This reverts commit ddf1f0d5dd.
* Revert "Revert "add nan check during training""
This reverts commit b7b2943197.
* some typing cleanups
* update seeding on dataloader and the start of training script
* undo changse
* undo more changes
* more typing fixes
* minor cleanups
* update dataloader seed
* hotfix: log metric and move target metric check outside of CKPT
* check for CKPT when target metric is reached before saving
* add TRAIN_BEAM and EVAL_BEAM
* minor cleanup
* update hyperparams and add support for EVAL_BS
* add green coloring to metric reached statement
* initial work to support f16
* update model initializers to be monkeypatched
* update layers to support float32 weight loading + float16 training
* don't return loss that's scaled
* run eval on benchmark beam
* move BEAM to their respective steps
* update layers to be compatible with fp16
* end BENCHMARK after first eval
* cleanups and adjust learning rate for fp16
* remove duplicated files from test
* revert losses changes
* Revert "revert losses changes"
This reverts commit aebccf93ac.
* go back to old LR
* cast batchnorm to float32
* set new loss scaler default value for float16
* remove LambdaLRScheduler
* remove runner and use dataloader on eval
* fix retinanet eval with new dataloader
* remove unused import
* revert lr_scheduler updates
* use BS=96 with new learning rate
* rename module initializers
* more cleanups on training loop
* remove contig from optim.step
* simplify sum when computing loss
for some reason with random dropout it creates different ast on each device. And search embedding is slow. This workaround saved 6 minutes setup time on mi300x (25->19) and resulted in similar speed
* add system json for mi300x mlperf
```
python3 -m mlperf_logging.system_desc_checker examples/mlperf/training_submission_v5.0/tinycorp/systems/tinybox_8xMI300X.json training 4.1.0
INFO - System description checker passed for tinybox 8xMI300X
```
also removed the rocm from tinybox_red since we are not using it
* update mlperf-logging version
fix when only run eval `TRAIN=0 BERT_SIZE=tiny examples/mlperf/training_submission_v5.0/tinycorp/benchmarks/bert/implementations/tinybox_green/dev_beam.sh`
* jit the forward
* might timeout, idk just send it
* this is dumb
* naive bitonic lol
* idk if this is correct, but that squeeze before is definitly not
* vectorized bitonic sort, but still slow
* yay 1 layer is correct
* alright its pretty good
* good enough
* rerun CI
* nit improve comment
* load llama3-1B to WEBGPU device
* include compile script for loading llama3 to WEBGPU
* parametrize max_context in build_transformer fxn
* jit_model with two different args sets
* compile for webgpu, split weights
* load model weight parts in browser
* export all tensors from initialized transformer
* run transformer inference in browser
* enable tiktoken with llama bpe in browser
* count total tokens on client with tiktoken.js
* full client-side chat streaming, eliminate server
* revert change that enabled jitting with 2 argsets
* llama without Variable or cache_kv, for webgpu
* have client use mask tokens / whole context
* cleanup staged weights
* add tiktoken.js build script, README
* export CLANG for Q6_k to float32 decompression
* fix and test exported CLANG code for Q6_k to fp32
* revert changes to jit and export_model
* isolate clang export
* test Q6_K to float32 decompression in browser
* gguf_load now also returns t_infos and data_start
* prepare llama-1B Q6_K gguf chunks for browser
* cache and decompress quantized llama in browser
* enable separate deployment of large files
* fix kv cache and symbolic with llama wgpu
* eliminate browser lag during decompression
* hash metadata and weight chunks
* delete obsolete indexeddb cache to free disk
* add progress bar, track model download/decompress
* refactor progress callback
* skip buffer hash verification for speed
* Display progress for entire loading scope
* Report page load errors to user
* actually display errors
* skip prompt tokens already seen by model
* skip prefilling with last assistant message tokens
* on page load tell user if webgpu not enabled
* push deployed URL root to window.history
* make note of bug sources with TODO items
* isolate bug in CLANG with BEAM=2
* remove clang_bug.py from diff
* decompress q6k to f32 on webgpu instead of clang
* remove unused code
* inter-weight decomp with larger wgpu kernels
* parallelize decompression submissions
* refactor dequantize scheduling
* add progress bar back
* fix bug
* temp fix for loading GGUF Q6_K to fp16 not fp32
* fix rendering of exported CLANG
* remove weight casts, sketch js functions for clang
* get symbolic vars from jit_cache for model export
* include symbolic vars in exported CLANG
* render js for clang transformer
* toggle clang/webgpu deployment; refactor decomp
* compile and render clang Q6_K->fp16 and int8 quant
* fix rendered clang for abs(fp16), to work in wasm
* simplify clang js wrapping
* run compiled clang in worker
* prepare llama weights in workers, q6k to int8/fp16
* tinychat on clang in browser, f32/int8 weights
* move wasm inference to (now flexible) worker
* don't load redundant embeddings
* modest wasm perf gain with compile flags
* set default backend, enable backend choice/backup
* render symbolic vars in exported WEBGPU
* quantize webgpu llama to int8/f32
* improve UX arising from rendered WEBGPU
* clean up webgpu launch
* new weights split: smaller chunks, tinygrad quant.
* switch webgpu inference to int8 quant
* remove unneeded clang decompression
* eliminate unneeded kv cache transfer to wasm
* use 1 worker for simplified clang decompression
* display launch errors
* refactor: stream load weight chunks to WebGPU
* show loading chunk completion
* quantize embeddings to int8
* test float16 as input for quantization
* webgpu: use f16 source, int8 embed, eliminate q6k
* simplify split weights prep: all from state_dict
* revert change to nn.state.gguf_load
* remove unneeded decompression from webgpu client
* remove unneeded code
* decrease dl chunks from 47 to 16 MiB
* improve stability of webgpu loading on mobile
* autodetect mobile, improve load stability
* refactor: progress closure
* refactor: one unified progress bar
* remove unneeded code
* revert changes to tinygrad core library
* enforce ios18.3 nerfed max buf size
* BEAM=3 webgpu
* cache integrity, mobile save throttling
* improve mobile UX - no autozoom on prompt box
* clang: int8 from f16, remove q6k
* reduce concurrent dls on mobile to 2 for stability
* refactor: wasm backend with stream loading
* prevent race between wasm load and indexedb save
* split wasm kernels into separate modules
* js wrapper for multiple wasm module inference
* revert multi-module wasm to single module
* make mobile wasm load more stable/fast
* refactor: copy weights into wasm without crashes
* fix bug in download queue; increase mobile dls
* refactor exported clang wrapper, split weights
* remove unnecessary code
* greatly improve int8 quant quality with rounding
* eliminate mobile throttling
* increase webgpu context to 4096 tokens
* export webgpu js functions
* enable separate hosted weights for mobile/pc
* enable prompt-thread switching during generation
* stop generation when max_context is reached
* show progress bar for prefill
* tell user if webgpu fails, while wasm loads
* make loading messages more concise
* update font
* revert changes to tinychat python app launch
* cleanup quantization, add scale_dtype param
* cleanup kv cache code
* cleanup compile code
* link tok_embeddings with output in webgpu export
* refactor: export_model webgpu: symbolic vars
* refactor: export_model weight loading
* forgot to commit export_model.py
* change CLANG to CPU
* deal with pylint incorrectly failing tests
* simplify f-strings for older CI python version
* fix pre-python3.12 parser errors
* [Int32Array] not Int32Array
* cleanup webgpu compile after refactor export_model
* refactor WASM export into export_model
* merge WebGPU/WASM compile scripts
* simplify max_contexts for local deployment
* fix parser issues and whitespace
* deduplicate variable defs for non-wasm clang export
* cleanup code
* cleanup compile scripts
* simplify wasm inference wrapping
* simplify webgpu symbolic vars export
* refactor: unify export of symbolic variables
* simplify WASM export
* simplify clang/wasm export
* update README and build scripts
* separate files for browser/python apps
* restore original python tinychat app files
* browser and python tinychats share assets
* minor cleanup
* isolate app layer diff
* add .gitignore for generated files
* validate CPU/WEBGPU models in python
* prevent infinite generation if validation fails
* check if exported weight files are unique
---------
Co-authored-by: George Hotz <72895+geohot@users.noreply.github.com>
* load llama3-1B to WEBGPU device
* include compile script for loading llama3 to WEBGPU
* parametrize max_context in build_transformer fxn
* jit_model with two different args sets
* compile for webgpu, split weights
* load model weight parts in browser
* export all tensors from initialized transformer
* run transformer inference in browser
* enable tiktoken with llama bpe in browser
* count total tokens on client with tiktoken.js
* full client-side chat streaming, eliminate server
* revert change that enabled jitting with 2 argsets
* llama without Variable or cache_kv, for webgpu
* have client use mask tokens / whole context
* cleanup staged weights
* add tiktoken.js build script, README
* export CLANG for Q6_k to float32 decompression
* fix and test exported CLANG code for Q6_k to fp32
* revert changes to jit and export_model
* isolate clang export
* test Q6_K to float32 decompression in browser
* gguf_load now also returns t_infos and data_start
* prepare llama-1B Q6_K gguf chunks for browser
* cache and decompress quantized llama in browser
* enable separate deployment of large files
* fix kv cache and symbolic with llama wgpu
* eliminate browser lag during decompression
* hash metadata and weight chunks
* delete obsolete indexeddb cache to free disk
* add progress bar, track model download/decompress
* refactor progress callback
* skip buffer hash verification for speed
* Display progress for entire loading scope
* Report page load errors to user
* actually display errors
* skip prompt tokens already seen by model
* skip prefilling with last assistant message tokens
* on page load tell user if webgpu not enabled
* push deployed URL root to window.history
* make note of bug sources with TODO items
* isolate bug in CLANG with BEAM=2
* remove clang_bug.py from diff
* decompress q6k to f32 on webgpu instead of clang
* remove unused code
* inter-weight decomp with larger wgpu kernels
* parallelize decompression submissions
* refactor dequantize scheduling
* add progress bar back
* fix bug
* temp fix for loading GGUF Q6_K to fp16 not fp32
* fix rendering of exported CLANG
* remove weight casts, sketch js functions for clang
* get symbolic vars from jit_cache for model export
* include symbolic vars in exported CLANG
* render js for clang transformer
* toggle clang/webgpu deployment; refactor decomp
* compile and render clang Q6_K->fp16 and int8 quant
* fix rendered clang for abs(fp16), to work in wasm
* simplify clang js wrapping
* run compiled clang in worker
* prepare llama weights in workers, q6k to int8/fp16
* tinychat on clang in browser, f32/int8 weights
* move wasm inference to (now flexible) worker
* don't load redundant embeddings
* modest wasm perf gain with compile flags
* set default backend, enable backend choice/backup
* render symbolic vars in exported WEBGPU
* quantize webgpu llama to int8/f32
* improve UX arising from rendered WEBGPU
* clean up webgpu launch
* new weights split: smaller chunks, tinygrad quant.
* switch webgpu inference to int8 quant
* remove unneeded clang decompression
* eliminate unneeded kv cache transfer to wasm
* use 1 worker for simplified clang decompression
* display launch errors
* refactor: stream load weight chunks to WebGPU
* show loading chunk completion
* quantize embeddings to int8
* test float16 as input for quantization
* webgpu: use f16 source, int8 embed, eliminate q6k
* simplify split weights prep: all from state_dict
* revert change to nn.state.gguf_load
* remove unneeded decompression from webgpu client
* remove unneeded code
* decrease dl chunks from 47 to 16 MiB
* improve stability of webgpu loading on mobile
* autodetect mobile, improve load stability
* refactor: progress closure
* refactor: one unified progress bar
* remove unneeded code
* revert changes to tinygrad core library
* enforce ios18.3 nerfed max buf size
* BEAM=3 webgpu
* cache integrity, mobile save throttling
* improve mobile UX - no autozoom on prompt box
* clang: int8 from f16, remove q6k
* reduce concurrent dls on mobile to 2 for stability
* refactor: wasm backend with stream loading
* prevent race between wasm load and indexedb save
* split wasm kernels into separate modules
* js wrapper for multiple wasm module inference
* revert multi-module wasm to single module
* make mobile wasm load more stable/fast
* refactor: copy weights into wasm without crashes
* fix bug in download queue; increase mobile dls
* refactor exported clang wrapper, split weights
* remove unnecessary code
* greatly improve int8 quant quality with rounding
* eliminate mobile throttling
* increase webgpu context to 4096 tokens
* export webgpu js functions
* enable separate hosted weights for mobile/pc
* enable prompt-thread switching during generation
* stop generation when max_context is reached
* show progress bar for prefill
* tell user if webgpu fails, while wasm loads
* make loading messages more concise
* update font
* revert changes to tinychat python app launch
* cleanup quantization, add scale_dtype param
* cleanup kv cache code
* cleanup compile code
* link tok_embeddings with output in webgpu export
* refactor: export_model webgpu: symbolic vars
* refactor: export_model weight loading
* forgot to commit export_model.py
* change CLANG to CPU
* deal with pylint incorrectly failing tests
* simplify f-strings for older CI python version
* fix pre-python3.12 parser errors
* [Int32Array] not Int32Array
* cleanup webgpu compile after refactor export_model
* refactor WASM export into export_model
* merge WebGPU/WASM compile scripts
* simplify max_contexts for local deployment
* fix parser issues and whitespace
* deduplicate variable defs for non-wasm clang export
* cleanup code
* cleanup compile scripts
* simplify wasm inference wrapping
* simplify webgpu symbolic vars export
* refactor: unify export of symbolic variables
* simplify WASM export
* simplify clang/wasm export
* update README and build scripts
* separate files for browser/python apps
* restore original python tinychat app files
* browser and python tinychats share assets
* minor cleanup
* isolate compile/export model
---------
Co-authored-by: George Hotz <72895+geohot@users.noreply.github.com>