1195 Commits

Author SHA1 Message Date
chenyu
572d77d1d9 bert script delete eval data after eval (#6790)
fits BS=60 which is 2% faster than 54. also fixed wandb logging params
2024-09-27 20:54:00 -04:00
chenyu
f9c8e144ff chmod +x mlperf bert script for red (#6789)
also disabled raising power cap in setup. wozeparrot mentioned that's unstable and might cause bert training issue on red
2024-09-27 11:27:32 -04:00
Francis Lata
d3a387be63 [MLPerf] Prepare openimages dataset script (#6747)
* prepare openimages for MLPerf

* cleanup

* fix issue when clearing jit_cache on retinanet eval

* revert pandas specific changes
2024-09-27 11:13:56 -04:00
chenyu
2fc26890c9 default BS=9 in handcode_opt bert (#6783)
using 54 for 6 gpus now, and 2 is not a good default
2024-09-27 04:38:16 -04:00
George Hotz
9a3f6f392d llm.c tok/s 2024-09-27 00:46:18 -07:00
George Hotz
b0e70ab04f llm.c updates 2024-09-27 15:25:59 +08:00
chenyu
bea7ed5986 add RUNMLPERF=1 to bert dev_run.sh (#6775)
already set in run_and_time.sh, need RUNMLPERF=1 for it to load real data
2024-09-26 11:00:49 -04:00
chenyu
12de203a43 add IGNORE_JIT_FIRST_BEAM to bert scripts (#6769)
* update bert BEAM params

copied from resnet to start with

* just IGNORE_JIT_FIRST_BEAM
2024-09-26 05:38:24 -04:00
wozeparrot
15cd42cfb9 feat: support TRACEMETA=2 in handcode_opt (#6767) 2024-09-26 16:58:29 +08:00
chenyu
5a5fbfa1eb smaller bert script change (#6768)
only WANDB and RUNMLPERF order. BENCHMARK and BEAM will be done differently
2024-09-26 04:54:28 -04:00
chenyu
0424c4967d fix handcode_opt.py for bert (#6756) 2024-09-26 00:20:24 -04:00
chenyu
396c96357b update mlperf bert scripts (#6755)
removed DISABLE_DROPOUT=1.
updated BS to 54 that works on tinyboxes with dropouts.
used bert's sparse_categorical_crossentropy that takes Tensor ignore_index in accuracy method
2024-09-25 23:55:05 -04:00
George Hotz
7e73c7b3cc hotfix: bump stable diffusion val distance 2024-09-26 11:15:29 +08:00
wozeparrot
c100f3d406 default threefry (#6116) 2024-09-25 17:45:13 +08:00
George Hotz
f45d178a55 hotfix: support JIT_BATCH_SIZE=0, make that the default 2024-09-25 10:36:04 +08:00
wozeparrot
f932116e05 feat: small things from default_threefry (#6708) 2024-09-24 17:00:47 +08:00
Anurag Lamsal
568757e087 fix model_eval.py in the mlperf folder searching for bert vocab in the wrong directory (#6649) 2024-09-24 11:20:44 +08:00
samm393
19c11792fd Flux.1 (#6334)
* initial commit

* whitespace

* get rid of torch import

* indentation

* less hardcoding

* add flux.1-dev

* jit

* no double

* t5 tidy up

* validation image

* reuse sdxl autoencoder

* typing changes

* empty lines

* remove unneeded comments

---------

Co-authored-by: George Hotz <72895+geohot@users.noreply.github.com>
2024-09-24 10:08:04 +08:00
George Hotz
b9e6d42a1f Revert "gated native math in OpenCL (#6683)" (#6691)
This reverts commit 2fe3eeed17.
2024-09-24 08:48:10 +08:00
George Hotz
2fe3eeed17 gated native math in OpenCL (#6683)
* gated native math

* Update cstyle.py
2024-09-23 19:22:13 +08:00
Tobias Fischer
c1bbd15bd9 Sharded SDXL Inference (#6328)
* initial sharding fixes

* sigma device fix

* emptyline space fix

---------

Co-authored-by: chenyu <chenyu@fastmail.com>
2024-09-21 01:26:43 -04:00
chenyu
b14c1bc417 UOps.RANGE is_increasing (#6615)
* UOps.RANGE is_increasing

283 -> 47 valids

* test
2024-09-20 03:14:52 -04:00
George Hotz
d02bb270b7 add copyin copyout for image on GPU [run_process_replay] (#6580)
* add copyin copyout for image on GPU [run_process_replay]

* add timing

* enqueue vs total run

* it's failing but that's fine
2024-09-18 16:06:20 +08:00
George Hotz
d4b662c318 new openpilot compile (#6573)
* new openpilot compile

* note, copyout doesn't work for images
2024-09-18 14:22:50 +08:00
kormann
f5dd25d376 enable whisper batch for long sequences (#6458)
* long batch +test

* long batch +test

* cleanup

* rollback syntactic changes

---------

Co-authored-by: chenyu <chenyu@fastmail.com>
2024-09-17 00:42:10 -04:00
chenyu
798be6bb74 add gated read_image count in openpilot compile2 (#6546)
530 to go
2024-09-16 21:17:00 -04:00
Francis Lata
b7ce9a1530 UNet3D MLPerf (#3470)
* add training set transforms

* add DICE cross entropy loss

* convert pred and label to Tensor when calculating DICE score

* cleanups and allow train dataset batching

* fix DICE CE loss calculation

* jitted training step

* clean up DICE CE loss calculation

* initial support for sharding

* Revert "initial support for sharding"

This reverts commit e3670813b8.

* minor updates

* cleanup imports

* add support for sharding

* apply temp patch to try to avoid OOM

* revert cstyle changes

* add gradient acc

* hotfix

* add FP16 support

* add ability to train on smaller image sizes

* add support for saving and loading checkpoints + cleanup some various modes

* fix issue with using smaller patch size + update W&B logging

* disable LR_WARMUP_EPOCHS

* updates

* minor cleanups

* cleanup

* update order of transformations

* more cleanups

* realize loss

* cleanup

* more cleanup

* some cleanups

* add RAM usage

* minor cleanups

* add support for gradient accumulation

* cleanup imports

* minor updates to not use GA_STEPS

* remove FP16 option since it's available now globally

* update multi-GPU setup

* add timing logs for training loop

* go back to using existing dataloader and add ability to preprocess data to save time

* clean up optimization and re-enable JIT and multi-GPU support for training and evaluation

* free train and eval steps memory

* cleanups and scale batch size based on the number of GPUs

* fix GlobalCounters import

* fix seed

* fix W&B setup

* update batch size default size

* add back metric divergence check

* put back JIT on UNet3d eval

* move dataset preprocessing inside training code

* add test for dice_loss

* add config logging support to W&B and other cleanups

* change how default float is getting retrieved

* remove TinyJit import duplicate

* update config logging to W&B and remove JIT on eval_step

* no need for caching preprocessed data anymore

* fix how evaluation is ran and how often

* add support for LR scaling

* fix issue with gaussian being moved to scipy.signal.windows

* remove DICE loss unit test

* fix issue where loss isn't compatible with multiGPU

* add individual BEAM control for train and eval steps

* fix ndimage scipy import

* add BENCHMARK

* cleanups on BENCHMARK + fix on rand_flip augmentation during training

* cleanup train and eval BEAM envs

* add checkpointing support after every eval

* cleanup model_eval

* disable grad during eval

* use new preprocessing dataset mechanism

* remove unused import

* use training and inference_mode contexts

* start eval after benchmarking

* add data fetching time

* cleanup decorators

* more cleanups on training script

* add message during benchmarking mode

* realize when reassigning LR on scheduler and update default number of epochs

* add JIT on eval step

* remove JIT on eval_step

* add train dataloader for unet3d

* move checkpointing to be done after every epoch

* revert removal of JIT on unet3d inference

* save checkpoint if metric is not successful

* Revert "add train dataloader for unet3d"

This reverts commit c166d129df.

* Revert "Revert "add train dataloader for unet3d""

This reverts commit 36366c65d2.

* hotfix: seed was defaulting to a value of 0

* fix SEED value

* remove the usage of context managers for setting BEAM and going from training to inference

* support new stack API for calculating eval loss and metric

* Revert "remove the usage of context managers for setting BEAM and going from training to inference"

This reverts commit 2c0ba8d322.

* check training and test preprocessed folders separately

* clean up imports and log FUSE_CONV_BW

* use train and val preprocessing constants

* add kits19 dataset setup script

* update to use the new test decorator for disabling grad

* update kits19 dataset setup script

* add docs on how to train the model

* set default value for BASEDIR

* add detailed instruction about BASEDIR usage

---------

Co-authored-by: chenyu <chenyu@fastmail.com>
2024-09-10 04:37:28 -04:00
kormann
f6f4f3222f whisper long batch (#6335)
* reset

* test

* only part refactor
2024-09-09 21:03:59 -04:00
qazal
935b6b658f delete seen from the scheduler api [run_process_replay] (#6427)
docs
2024-09-09 16:26:34 +08:00
wozeparrot
cb61cfce24 feat: example and extra tweaks (#6310) 2024-08-28 19:26:11 -07:00
Tobias Fischer
3517aa89d9 sdxl batched inference fixes (#6293) 2024-08-28 07:44:58 -04:00
qazal
552fbd5527 update llm.c with UOp ast [run_process_replay] (#6296) 2024-08-27 15:04:54 +03:00
chenyu
c9a9631818 no UnaryOps.NEG in generated UOp patterns (#6209)
* no UnaryOps.NEG in generated UOp patterns

removed pattern `x * (-1) -> -x`  and `x != True`

* those are fine because NEG became CMPNE and True

* fix sd validation L2 norm
2024-08-21 11:08:22 -04:00
George Hotz
9faf205601 CIFAR trainer + various bugfixes / improvements (#6146)
* move cifar into datasets

* support for pathlib Tensors, tar_extract, and fetch gunzip

* too early for Device.DEFAULT

* simpler hlb_cifar + .to(None) is default

* new compiler failure, start beautiful_cifar

* beautiful cifar runs but is broken

* jit train step

* cleaner

* std_mean, not mean_std

* more correct

* fast indexing

* don't print that

* torch load broken

* add eval

* nicer bar

* decoraters are the way to do this

* bounds check the load

* a few ops

* batchnorm bugfix, if track_running_stats is False, use online estimate

* full timing

* fix fusion

* unneeded realize

* master tensor
2024-08-20 16:58:46 -07:00
George Hotz
d9c62a33c3 add cifar to datasets.py (#6210) 2024-08-20 11:42:49 -07:00
George Hotz
17a043edad tensor inference (#6156)
* tensor inference

* test is even better name
2024-08-18 00:19:28 -07:00
qazal
28c75bf2a6 merge uops with ops (#6111)
Co-authored-by: chenyu <chenyu@fastmail.com>
2024-08-16 18:17:57 -04:00
qazal
c23d44c779 AST is UOp (#6030)
* most of the work from the uops2 branch

* schedule

* realize

* kernel

* lowerer

* search

* green

* merge uops with ops

* Revert "merge uops with ops"

This reverts commit 1408a59f12.

* fix benchmark

* remove extra dedup
2024-08-16 22:09:00 +03:00
George Hotz
14b613e281 add STEPS to beautiful_mnist 2024-08-10 15:23:44 -07:00
wozeparrot
d269bc95fa faster tinychat (#5993) 2024-08-08 19:16:26 -07:00
Elias Wahl
c9b4602854 no load in INITMLPERF (#5957) 2024-08-08 11:28:24 -04:00
Elias Wahl
c9862e17d4 MLPERF BERT submission scripts (#5931)
* green

* red

* fix benchmark

* log

* count train samples

* oops. 4.0 -> 4.1

* note to todo

* no pillow
2024-08-06 18:09:18 -04:00
chenyu
1dab75ae37 clean up mlperf dataloader import (#5940)
use tinygrad tqdm for dataset, and PIL Image is only needed for resnet
2024-08-06 17:10:08 -04:00
George Hotz
e077bc7baf move memory planner to realize (#5937) 2024-08-06 10:41:29 -07:00
Elias Wahl
937bf5fe12 better hparam (#5891) 2024-08-03 12:38:53 -04:00
Elias Wahl
4a114756f6 New BERT dataloader (#5881)
* One file == One topic

* update test

* new dataloader

* update train script

* get index is faster
2024-08-02 15:12:23 -04:00
David Hou
9a485f36e4 shard kvcache (#5830) 2024-07-30 20:29:54 -07:00
George Hotz
21c5e8e1b7 extreme llama speed, 57.34 tok/s (#5827)
* extreme llama speed

* mergable
2024-07-30 18:32:09 -07:00
Francis Lata
a0baff7a3d update dataloader script example (#5818) 2024-07-30 15:18:29 -04:00
wozeparrot
eebb1b9922 feat: temperature 0 llama3 benchmark (#5806) 2024-07-30 12:05:36 -07:00