tinygrad

mirror of https://github.com/tinygrad/tinygrad.git synced 2026-01-09 15:08:02 -05:00

Author	SHA1	Message	Date
chenyu	e2a40fb523	update bert mi300x script (#9872 ) 2 runs failed to converge in 10 back to back runs, increase total train steps and some beam params (2% faster step)	2025-04-13 10:07:36 -04:00
Francis Lata	2793cca9a6	RetinaNet MLPerf (#8385 ) * add support for a custom BASEDIR for openimages download * make export step faster * add focal loss * update model_eval with new dataloader * generate_anchors in tinygrad * update initializers for model * small cleanup * revert isin enhancements * recursively go through backbone layers to freeze them * add optimizer * minor cleanup * start dataloader work with input images * add first transform for train set * reuse existing prepare_target * continue with dataloader implementation * add dataloader * separate out KiTS19 dataset test cases * create mock data samples for test * add dataloader + test * cleanup dataloader test and revert shm path * trim dataloader related code needed from ref * got dataloader with normalize working * update image to be float32 * add back normalization and negate it in test * clean up reference dataset implementation + ruff changes * add validation set test * add proper training loop over the training dataset * add LambdaLR support * add LR scheduler and the start of training step * get forward call to model work and setup multi-GPU * already passed device * return matches from dataloader * hotfix for dataloader typo causing some hang * start some work on classification loss * update focal loss to support masking * add missing test and cleanup focal loss * cleanup unit tests * remove masking support for sigmoid_focal_loss * make ClassificationHead loss work * cleanups + fix dataloader tests * remove sigmoid when computing loss * make anchors use Tensors * simplify anchors batching * revert anchors to use np * implement regression loss * fix regression loss * cleanup losses * move BoxCoder to MLPerf helpers * revert helper changes * fixes after helper refactor cleanup * add tests for l1_loss * start re-enabling training step * minor cleanup * add pycocotools to testing dependencies * make training work * adjust regression loss to mask after L1 loss is calculated * reduce img and lbl sizes by half for KiTS19 dataset tests * Revert "reduce img and lbl sizes by half for KiTS19 dataset tests" This reverts commit `d115b0c664`. * temporarily disable openimages dataset tests to debug CI * enable openimages dataset test and create samples once * temporarily disable openimages validation set test * reenable test and add some debugging to the test * add boto3 testing dependencies * add pandas to testing dependencies * This reverts commit `467704fec6`. * reenable test * move sample creation to setup * realize boxcoder's encoding * add wandb * fix wandb resuming feature * move anchors as part of dataloader * fix dtype for anchor inside dataloader and fix horizontal flip transformation * add support for BENCHMARK * set seed * debug dataset test failuire * Revert "debug dataset test failuire" This reverts commit `1b2f9d7f50`. * fix dataloader script * do not realize when sharding model weights * setup openimages samples differently * create the necessary samples per test case * enable lr scheduler and fix benchmark timing * add jit to the training loop * add checkpointing and training resume capabilities * refactor on training loop and start the work on val looop * add debug logging for dataloader test * debug test * assert boxes again * update validation dataloader and more cleanups * fix validation test case * add multi device support to retinanet eval * fix issue with realized on dataloader * remove optional disk tensors in dataloader * remove verbose debugging on datasets test * put back parallel testing and remove img_ids Tensor from dataloader * cleanup train and validation dataloader * return validation targets in dataloader * cleanup boxes and labels in dataloader * fix img_ids repeating its values * remove unnecessary targets from validation dataloader * add validation loop to training script * adjust LR to be the ratio of the batch size * minor cleanups * remove frozen layers from optimizer's params * hyperparameter adjustments and cleanups * model init, hyperparam, and data preprocessing updates * no need to return loaded keys for resnet * fix train script * update loss calculation for regresionhead and some cleanups * add JIT reset support * add nan check during training * Revert "add nan check during training" This reverts commit `ddf1f0d5dd`. * Revert "Revert "add nan check during training"" This reverts commit `b7b2943197`. * some typing cleanups * update seeding on dataloader and the start of training script * undo changse * undo more changes * more typing fixes * minor cleanups * update dataloader seed * hotfix: log metric and move target metric check outside of CKPT * check for CKPT when target metric is reached before saving * add TRAIN_BEAM and EVAL_BEAM * minor cleanup * update hyperparams and add support for EVAL_BS * add green coloring to metric reached statement * initial work to support f16 * update model initializers to be monkeypatched * update layers to support float32 weight loading + float16 training * don't return loss that's scaled * run eval on benchmark beam * move BEAM to their respective steps * update layers to be compatible with fp16 * end BENCHMARK after first eval * cleanups and adjust learning rate for fp16 * remove duplicated files from test * revert losses changes * Revert "revert losses changes" This reverts commit `aebccf93ac`. * go back to old LR * cast batchnorm to float32 * set new loss scaler default value for float16 * remove LambdaLRScheduler * remove runner and use dataloader on eval * fix retinanet eval with new dataloader * remove unused import * revert lr_scheduler updates * use BS=96 with new learning rate * rename module initializers * more cleanups on training loop * remove contig from optim.step * simplify sum when computing loss	2025-04-12 22:11:51 -04:00
chenyu	4aab16ca6a	bert script cleanup and assert nan loss (#9851 )	2025-04-11 05:41:49 -04:00
chenyu	8c6299bced	move hand_coded_optimizations to heuristic.py [pr] (#9844 ) * move hand_coded_optimizations to heuristic.py [pr] also folded all long lines * make a copy and rename self -> k * fix test	2025-04-10 23:40:16 -04:00
chenyu	995d20673a	increase bert TRAIN_STEPS for mi300x (#9833 ) got a few non converged ones so try to increase steps. we need >= 90% runs to converge	2025-04-10 08:25:09 -04:00
chenyu	817746b30e	add contiguous to EmbeddingBert output (#9829 ) for some reason with random dropout it creates different ast on each device. And search embedding is slow. This workaround saved 6 minutes setup time on mi300x (25->19) and resulted in similar speed	2025-04-10 04:31:19 -04:00
chenyu	a0b72f066a	don't free intermediate for bert mi300x (#9824 )	2025-04-10 01:48:34 -04:00
chenyu	2e1002e179	EVAL_BS=96 and BEAM=3 for bert green (#9819 ) 19m -> 13m setup and same end to end time	2025-04-09 22:37:27 -04:00
chenyu	8fe83385ec	add system json for mi300x mlperf (#9786 ) * add system json for mi300x mlperf ``` python3 -m mlperf_logging.system_desc_checker examples/mlperf/training_submission_v5.0/tinycorp/systems/tinybox_8xMI300X.json training 4.1.0 INFO - System description checker passed for tinybox 8xMI300X ``` also removed the rocm from tinybox_red since we are not using it * update mlperf-logging version	2025-04-08 06:36:44 -04:00
chenyu	4cc7422769	use AM driver in bert mlperf (#9775 ) we should commit to use AM. it's 7ms slower python time now	2025-04-07 23:40:27 -04:00
Francis Lata	f8fe15e64e	move BoxCoder to mlperf helpers (#9773 )	2025-04-07 20:27:06 -04:00
chenyu	7c4a739fe4	full script for bert mi300x (#9772 )	2025-04-07 11:41:31 -04:00
chenyu	3069ebfad1	use BERT_LAYERS=2 in bert init (#9769 ) save 5 minut scheduling in setup so we can fit more search	2025-04-07 07:46:37 -04:00
Francis Lata	71b8890dd6	use validation dataloader inside retinanet eval (#9747 )	2025-04-05 16:46:55 -04:00
chenyu	5a04f4d4ba	revert bert hparams for green and red (#9744 ) did more runs and it's not really better and not worth the change. only useful for BS=1024	2025-04-05 07:38:01 -04:00
chenyu	640ff681c3	rename bert script to 8xMI300X (#9734 ) and adds a script for single MI300X	2025-04-03 23:36:24 -04:00
chenyu	6b3480ec70	update mi300x bert haparams (#9716 ) * update mi300x bert haparams borrowed from previous submission that also did BS=1024 * update	2025-04-03 22:30:00 -04:00
chenyu	a6fec2f5ae	dev_run for bert on mi300x (#9706 )	2025-04-02 21:12:55 -04:00
George Hotz	4514fd91c1	more stuff from DSP (#9689 ) * more good stuff from dsp branch * test pkl imagenet	2025-04-02 15:27:48 +08:00
George Hotz	6f812d3f2f	fixes from the dsp branch + 12500 lines (#9683 ) * fixes from the dsp branch * more changes * those are gep pushing	2025-04-02 13:07:17 +08:00
chenyu	f7cb2e8da3	bert dev_beam for mi300x box (#9648 ) * bert dev_beam for mi300x box * terminate BENCHMARK properly	2025-03-31 08:35:51 -04:00
chenyu	d8d7ac1bb1	fix bert free_intermediates (#9633 ) fix when only run eval `TRAIN=0 BERT_SIZE=tiny examples/mlperf/training_submission_v5.0/tinycorp/benchmarks/bert/implementations/tinybox_green/dev_beam.sh`	2025-03-30 22:42:52 -04:00
chenyu	a187dfd3df	bert BEAM_UOPS_MAX 3000->4000 (#9603 ) more stable for the final step time green 410ms (master) -> 397ms (BEAM=4) -> 392ms (this) red 561ms (master) -> 550ms (this)	2025-03-27 11:58:47 -04:00
chenyu	62888614f6	lower bert eval bs to 24 (#9590 ) oom during eval	2025-03-26 21:25:23 -04:00
Andrey	7b865ed03d	use tuple in isinstance for type checking (#9583 )	2025-03-26 19:36:48 +08:00
qazal	c03dadfcb9	add TORCHVIZ=1 to beautiful_mnist_torch (#9576 )	2025-03-26 11:17:08 +08:00
qazal	93bcb974c5	select torch device in examples/beautiful_mnist_torch.py (#9575 )	2025-03-26 11:01:25 +08:00
George Hotz	74d98eafb8	add onnx frontend stub [pr] (#9558 )	2025-03-24 12:24:34 +08:00
chenyu	c965f4c20b	update bert config (#9555 ) BEAM 4->5 for green, 2% faster use AMD driver instead of AM for red, 5% faster	2025-03-23 16:14:41 -04:00
Francis Lata	1a1087e3a0	cleanups on losses and dataset tests (#9538 )	2025-03-21 17:03:18 -04:00
Francis Lata	8cbe4009fc	RetinaNet losses (#9536 ) * add sigmoid_focal_loss and l1_loss * update ref implementation comment	2025-03-21 15:52:54 -04:00
chenyu	b46b8ee15e	add a flag to log when beam surpassed max limit [pr] (#9533 )	2025-03-21 13:37:02 -04:00
Francis Lata	eb95825eea	RetinaNet dataloader (#9442 ) * retinanet dataloader * remove batch_size from generate_anchors * refactor kits19 dataset tests * add tests for dataloader * fix testing setup and cleanups * remove unused import	2025-03-21 13:36:41 -04:00
George Hotz	8e555c586c	switch quantization to unsigned/unsigned + add Ops.REDUCE (#9527 ) * switch quantization to unsigned/unsigned + add Ops.REDUCE * tests * nhwc + replay pkl	2025-03-21 17:02:37 +08:00
George Hotz	865f23dd7b	olmoe memory usage cleanups	2025-03-19 12:28:18 +08:00
chenyu	1ea4876dfa	olmoe touchups (#9499 ) GlobalCounters.reset() and only validate if temperature is 0	2025-03-18 15:25:45 -04:00
geohotstan	f7506c6c25	JIT OLMoE (#9396 ) * jit the forward * might timeout, idk just send it * this is dumb * naive bitonic lol * idk if this is correct, but that squeeze before is definitly not * vectorized bitonic sort, but still slow * yay 1 layer is correct * alright its pretty good * good enough * rerun CI * nit improve comment	2025-03-18 14:49:02 -04:00
hooved	5500887eed	improve reproducibility of WebGPU CI puppeteer test (#9496 ) * try to make CI test fail with slow JS import * prevent race between model import and reference * revert artificial delay in JS module import	2025-03-18 09:27:38 -04:00
chenyu	f53be010d7	lower bert learning rate (#9481 ) slightly better. first sub 3hr run https://wandb.ai/chenyuxyz/MLPerf-BERT/runs/0or96ink/overview	2025-03-17 10:49:56 -04:00
chenyu	d2cfbd8a4d	bert lower learning rate and total steps (#9466 ) closer to the other submission with BS=240. converged with 10% less epochs	2025-03-16 17:21:20 -04:00
chenyu	4992958dae	update bert beam params (#9423 ) BEAM_MIN_PROGRESS=5 for setup speed	2025-03-12 13:00:41 -04:00
chenyu	22fc0a2e36	bert sum acc in half (#9412 ) also BS=96	2025-03-11 23:03:15 -04:00
chenyu	01e8b60911	acc_dtype -> dtype (#9402 ) matched numpy and torch	2025-03-10 16:05:30 -04:00
George Hotz	25847080f0	olmoe (from stream, wip) (#9390 ) * olmoest working (but not) * it's correct * compare ropes * old code wasn't wrong * default device * no metal * fix permute * working * more minimal	2025-03-10 13:46:33 +08:00
hooved	304afe0d55	tinychat in browser, Part 3: browser app (#9276 ) * load llama3-1B to WEBGPU device * include compile script for loading llama3 to WEBGPU * parametrize max_context in build_transformer fxn * jit_model with two different args sets * compile for webgpu, split weights * load model weight parts in browser * export all tensors from initialized transformer * run transformer inference in browser * enable tiktoken with llama bpe in browser * count total tokens on client with tiktoken.js * full client-side chat streaming, eliminate server * revert change that enabled jitting with 2 argsets * llama without Variable or cache_kv, for webgpu * have client use mask tokens / whole context * cleanup staged weights * add tiktoken.js build script, README * export CLANG for Q6_k to float32 decompression * fix and test exported CLANG code for Q6_k to fp32 * revert changes to jit and export_model * isolate clang export * test Q6_K to float32 decompression in browser * gguf_load now also returns t_infos and data_start * prepare llama-1B Q6_K gguf chunks for browser * cache and decompress quantized llama in browser * enable separate deployment of large files * fix kv cache and symbolic with llama wgpu * eliminate browser lag during decompression * hash metadata and weight chunks * delete obsolete indexeddb cache to free disk * add progress bar, track model download/decompress * refactor progress callback * skip buffer hash verification for speed * Display progress for entire loading scope * Report page load errors to user * actually display errors * skip prompt tokens already seen by model * skip prefilling with last assistant message tokens * on page load tell user if webgpu not enabled * push deployed URL root to window.history * make note of bug sources with TODO items * isolate bug in CLANG with BEAM=2 * remove clang_bug.py from diff * decompress q6k to f32 on webgpu instead of clang * remove unused code * inter-weight decomp with larger wgpu kernels * parallelize decompression submissions * refactor dequantize scheduling * add progress bar back * fix bug * temp fix for loading GGUF Q6_K to fp16 not fp32 * fix rendering of exported CLANG * remove weight casts, sketch js functions for clang * get symbolic vars from jit_cache for model export * include symbolic vars in exported CLANG * render js for clang transformer * toggle clang/webgpu deployment; refactor decomp * compile and render clang Q6_K->fp16 and int8 quant * fix rendered clang for abs(fp16), to work in wasm * simplify clang js wrapping * run compiled clang in worker * prepare llama weights in workers, q6k to int8/fp16 * tinychat on clang in browser, f32/int8 weights * move wasm inference to (now flexible) worker * don't load redundant embeddings * modest wasm perf gain with compile flags * set default backend, enable backend choice/backup * render symbolic vars in exported WEBGPU * quantize webgpu llama to int8/f32 * improve UX arising from rendered WEBGPU * clean up webgpu launch * new weights split: smaller chunks, tinygrad quant. * switch webgpu inference to int8 quant * remove unneeded clang decompression * eliminate unneeded kv cache transfer to wasm * use 1 worker for simplified clang decompression * display launch errors * refactor: stream load weight chunks to WebGPU * show loading chunk completion * quantize embeddings to int8 * test float16 as input for quantization * webgpu: use f16 source, int8 embed, eliminate q6k * simplify split weights prep: all from state_dict * revert change to nn.state.gguf_load * remove unneeded decompression from webgpu client * remove unneeded code * decrease dl chunks from 47 to 16 MiB * improve stability of webgpu loading on mobile * autodetect mobile, improve load stability * refactor: progress closure * refactor: one unified progress bar * remove unneeded code * revert changes to tinygrad core library * enforce ios18.3 nerfed max buf size * BEAM=3 webgpu * cache integrity, mobile save throttling * improve mobile UX - no autozoom on prompt box * clang: int8 from f16, remove q6k * reduce concurrent dls on mobile to 2 for stability * refactor: wasm backend with stream loading * prevent race between wasm load and indexedb save * split wasm kernels into separate modules * js wrapper for multiple wasm module inference * revert multi-module wasm to single module * make mobile wasm load more stable/fast * refactor: copy weights into wasm without crashes * fix bug in download queue; increase mobile dls * refactor exported clang wrapper, split weights * remove unnecessary code * greatly improve int8 quant quality with rounding * eliminate mobile throttling * increase webgpu context to 4096 tokens * export webgpu js functions * enable separate hosted weights for mobile/pc * enable prompt-thread switching during generation * stop generation when max_context is reached * show progress bar for prefill * tell user if webgpu fails, while wasm loads * make loading messages more concise * update font * revert changes to tinychat python app launch * cleanup quantization, add scale_dtype param * cleanup kv cache code * cleanup compile code * link tok_embeddings with output in webgpu export * refactor: export_model webgpu: symbolic vars * refactor: export_model weight loading * forgot to commit export_model.py * change CLANG to CPU * deal with pylint incorrectly failing tests * simplify f-strings for older CI python version * fix pre-python3.12 parser errors * [Int32Array] not Int32Array * cleanup webgpu compile after refactor export_model * refactor WASM export into export_model * merge WebGPU/WASM compile scripts * simplify max_contexts for local deployment * fix parser issues and whitespace * deduplicate variable defs for non-wasm clang export * cleanup code * cleanup compile scripts * simplify wasm inference wrapping * simplify webgpu symbolic vars export * refactor: unify export of symbolic variables * simplify WASM export * simplify clang/wasm export * update README and build scripts * separate files for browser/python apps * restore original python tinychat app files * browser and python tinychats share assets * minor cleanup * isolate app layer diff * add .gitignore for generated files * validate CPU/WEBGPU models in python * prevent infinite generation if validation fails * check if exported weight files are unique --------- Co-authored-by: George Hotz <72895+geohot@users.noreply.github.com>	2025-03-07 15:07:33 +08:00
chenyu	2af129c078	bert corealize multiple outputs (#9359 ) 1% faster step	2025-03-05 10:58:37 -05:00
chenyu	ad72269f08	bert put eval copy and getting lr in jit (#9350 )	2025-03-04 20:57:03 -05:00
chenyu	9eb45eb629	add a flag to skip bert train (#9349 )	2025-03-04 17:13:00 -05:00
hooved	01f7a4fadc	tinychat in browser, Part 2: model export (#9274 ) * load llama3-1B to WEBGPU device * include compile script for loading llama3 to WEBGPU * parametrize max_context in build_transformer fxn * jit_model with two different args sets * compile for webgpu, split weights * load model weight parts in browser * export all tensors from initialized transformer * run transformer inference in browser * enable tiktoken with llama bpe in browser * count total tokens on client with tiktoken.js * full client-side chat streaming, eliminate server * revert change that enabled jitting with 2 argsets * llama without Variable or cache_kv, for webgpu * have client use mask tokens / whole context * cleanup staged weights * add tiktoken.js build script, README * export CLANG for Q6_k to float32 decompression * fix and test exported CLANG code for Q6_k to fp32 * revert changes to jit and export_model * isolate clang export * test Q6_K to float32 decompression in browser * gguf_load now also returns t_infos and data_start * prepare llama-1B Q6_K gguf chunks for browser * cache and decompress quantized llama in browser * enable separate deployment of large files * fix kv cache and symbolic with llama wgpu * eliminate browser lag during decompression * hash metadata and weight chunks * delete obsolete indexeddb cache to free disk * add progress bar, track model download/decompress * refactor progress callback * skip buffer hash verification for speed * Display progress for entire loading scope * Report page load errors to user * actually display errors * skip prompt tokens already seen by model * skip prefilling with last assistant message tokens * on page load tell user if webgpu not enabled * push deployed URL root to window.history * make note of bug sources with TODO items * isolate bug in CLANG with BEAM=2 * remove clang_bug.py from diff * decompress q6k to f32 on webgpu instead of clang * remove unused code * inter-weight decomp with larger wgpu kernels * parallelize decompression submissions * refactor dequantize scheduling * add progress bar back * fix bug * temp fix for loading GGUF Q6_K to fp16 not fp32 * fix rendering of exported CLANG * remove weight casts, sketch js functions for clang * get symbolic vars from jit_cache for model export * include symbolic vars in exported CLANG * render js for clang transformer * toggle clang/webgpu deployment; refactor decomp * compile and render clang Q6_K->fp16 and int8 quant * fix rendered clang for abs(fp16), to work in wasm * simplify clang js wrapping * run compiled clang in worker * prepare llama weights in workers, q6k to int8/fp16 * tinychat on clang in browser, f32/int8 weights * move wasm inference to (now flexible) worker * don't load redundant embeddings * modest wasm perf gain with compile flags * set default backend, enable backend choice/backup * render symbolic vars in exported WEBGPU * quantize webgpu llama to int8/f32 * improve UX arising from rendered WEBGPU * clean up webgpu launch * new weights split: smaller chunks, tinygrad quant. * switch webgpu inference to int8 quant * remove unneeded clang decompression * eliminate unneeded kv cache transfer to wasm * use 1 worker for simplified clang decompression * display launch errors * refactor: stream load weight chunks to WebGPU * show loading chunk completion * quantize embeddings to int8 * test float16 as input for quantization * webgpu: use f16 source, int8 embed, eliminate q6k * simplify split weights prep: all from state_dict * revert change to nn.state.gguf_load * remove unneeded decompression from webgpu client * remove unneeded code * decrease dl chunks from 47 to 16 MiB * improve stability of webgpu loading on mobile * autodetect mobile, improve load stability * refactor: progress closure * refactor: one unified progress bar * remove unneeded code * revert changes to tinygrad core library * enforce ios18.3 nerfed max buf size * BEAM=3 webgpu * cache integrity, mobile save throttling * improve mobile UX - no autozoom on prompt box * clang: int8 from f16, remove q6k * reduce concurrent dls on mobile to 2 for stability * refactor: wasm backend with stream loading * prevent race between wasm load and indexedb save * split wasm kernels into separate modules * js wrapper for multiple wasm module inference * revert multi-module wasm to single module * make mobile wasm load more stable/fast * refactor: copy weights into wasm without crashes * fix bug in download queue; increase mobile dls * refactor exported clang wrapper, split weights * remove unnecessary code * greatly improve int8 quant quality with rounding * eliminate mobile throttling * increase webgpu context to 4096 tokens * export webgpu js functions * enable separate hosted weights for mobile/pc * enable prompt-thread switching during generation * stop generation when max_context is reached * show progress bar for prefill * tell user if webgpu fails, while wasm loads * make loading messages more concise * update font * revert changes to tinychat python app launch * cleanup quantization, add scale_dtype param * cleanup kv cache code * cleanup compile code * link tok_embeddings with output in webgpu export * refactor: export_model webgpu: symbolic vars * refactor: export_model weight loading * forgot to commit export_model.py * change CLANG to CPU * deal with pylint incorrectly failing tests * simplify f-strings for older CI python version * fix pre-python3.12 parser errors * [Int32Array] not Int32Array * cleanup webgpu compile after refactor export_model * refactor WASM export into export_model * merge WebGPU/WASM compile scripts * simplify max_contexts for local deployment * fix parser issues and whitespace * deduplicate variable defs for non-wasm clang export * cleanup code * cleanup compile scripts * simplify wasm inference wrapping * simplify webgpu symbolic vars export * refactor: unify export of symbolic variables * simplify WASM export * simplify clang/wasm export * update README and build scripts * separate files for browser/python apps * restore original python tinychat app files * browser and python tinychats share assets * minor cleanup * isolate compile/export model --------- Co-authored-by: George Hotz <72895+geohot@users.noreply.github.com>	2025-03-04 15:53:30 +08:00
George Hotz	0d4ba7dd87	import tinygrad.frontend.torch (#9337 ) * import tinygrad.frontend.torch * type ignore	2025-03-04 00:15:29 +08:00

1 2 3 4 5 ...

1036 Commits