Francis Lata
b802f74cee
add dataloader
2024-10-11 23:04:21 -04:00
Francis Lata
5dbebf460e
continue with dataloader implementation
2024-10-05 23:51:16 -07:00
Francis Lata
8ca848d542
reuse existing prepare_target
2024-10-05 18:21:15 -07:00
Francis Lata
30f6f6a094
add first transform for train set
2024-10-05 18:12:38 -07:00
Francis Lata
12c4259b16
Merge branch 'master' into retinanet_mlperf
2024-10-05 16:22:58 -07:00
chenyu
16c1fa4208
use BEAM=3 for red box bert runs ( #6904 )
...
BEAM=4 slightly exceeded 30 minutes setup
2024-10-05 09:21:12 -04:00
chenyu
0e706227a2
add seed to bert result log filename ( #6903 )
...
* add seed to bert result log filename
* different name for different benchmark
2024-10-05 09:15:24 -04:00
George Hotz
f4ec39fe58
switch symbolic from old to uops, final PR ( #6872 )
...
* switch symbolic from old to uops, final PR
* two wrong answers
* not needed resolves
* symbolic ops passes
* symbolic ops passes
* progress
* tests pass (almost)
* fix last test
* fix some tests
* global binding and unbinding
* Revert "global binding and unbinding"
This reverts commit 9456725630 .
* that test works now
* vars on uop doesn't recurse
* fix fuzzer
* update
* fix type
* fix gpt, it's UOp now
* ssimplify symbolics
2024-10-04 16:42:27 +08:00
chenyu
7391376528
update bert hparams ( #6876 )
...
4h32m with this https://wandb.ai/chenyuxyz/MLPerf-BERT/runs/q99frv1l/overview .
loss scaler 2**13->2**10. matched the closest submission, no nan for ~10 runs.
increased lr and total step a bit.
`PARALLEL=0` after setup, same as resnet.
2024-10-04 00:39:06 -04:00
Francis Lata
5bc8230739
start dataloader work with input images
2024-10-02 11:47:37 -07:00
Francis Lata
f81603d983
minor cleanup
2024-10-02 06:57:23 -07:00
Francis Lata
d281e64411
add optimizer
2024-10-02 04:46:08 -07:00
Francis Lata
b8e24b4f4d
recursively go through backbone layers to freeze them
2024-10-02 04:21:44 -07:00
Francis Lata
7432cd7857
Merge branch 'master' into retinanet_mlperf
2024-10-01 19:26:35 -07:00
chenyu
5f77217772
bert default CKPT to 0 ( #6840 )
...
not required
2024-10-01 21:55:56 -04:00
George Hotz
547733e57c
stunning_mnist [run_process_replay] ( #6828 )
...
* stunning_mnist [run_process_replay]
* add loss to stunning mnist
2024-10-01 15:00:48 +08:00
chenyu
f59517754e
add RESET_STEP in bert to control reset ( #6818 )
...
same as resnet
2024-09-30 09:39:04 -04:00
George Hotz
2ed94e447f
gpt2: corealize opt and loss
2024-09-30 09:11:20 +08:00
George Hotz
a76c6c740c
hand pad gpt2 ( #6805 )
2024-09-30 09:03:07 +08:00
chenyu
494b20e886
bert BS back to 54 ( #6791 )
...
60 does not run end to end
2024-09-27 22:16:05 -04:00
chenyu
572d77d1d9
bert script delete eval data after eval ( #6790 )
...
fits BS=60 which is 2% faster than 54. also fixed wandb logging params
2024-09-27 20:54:00 -04:00
Francis Lata
35105e769b
small cleanup
2024-09-27 13:28:53 -07:00
Francis Lata
a879d19c82
update initializers for model
2024-09-27 13:25:28 -07:00
Francis Lata
52a1ba7fcb
Merge branch 'master' into retinanet_mlperf
2024-09-27 08:32:44 -07:00
chenyu
f9c8e144ff
chmod +x mlperf bert script for red ( #6789 )
...
also disabled raising power cap in setup. wozeparrot mentioned that's unstable and might cause bert training issue on red
2024-09-27 11:27:32 -04:00
Francis Lata
9fbc3f1fc7
Merge branch 'master' into retinanet_mlperf
2024-09-27 08:16:24 -07:00
Francis Lata
d3a387be63
[MLPerf] Prepare openimages dataset script ( #6747 )
...
* prepare openimages for MLPerf
* cleanup
* fix issue when clearing jit_cache on retinanet eval
* revert pandas specific changes
2024-09-27 11:13:56 -04:00
Francis Lata
23563c84d6
Merge branch 'master' into retinanet_mlperf
2024-09-27 07:12:52 -07:00
chenyu
2fc26890c9
default BS=9 in handcode_opt bert ( #6783 )
...
using 54 for 6 gpus now, and 2 is not a good default
2024-09-27 04:38:16 -04:00
George Hotz
9a3f6f392d
llm.c tok/s
2024-09-27 00:46:18 -07:00
George Hotz
b0e70ab04f
llm.c updates
2024-09-27 15:25:59 +08:00
Francis Lata
211b04ba2c
Merge branch 'master' into retinanet_mlperf
2024-09-26 15:03:00 -07:00
chenyu
bea7ed5986
add RUNMLPERF=1 to bert dev_run.sh ( #6775 )
...
already set in run_and_time.sh, need RUNMLPERF=1 for it to load real data
2024-09-26 11:00:49 -04:00
chenyu
12de203a43
add IGNORE_JIT_FIRST_BEAM to bert scripts ( #6769 )
...
* update bert BEAM params
copied from resnet to start with
* just IGNORE_JIT_FIRST_BEAM
2024-09-26 05:38:24 -04:00
Francis Lata
ea05de325c
Merge branch 'master' into retinanet_mlperf
2024-09-26 02:20:28 -07:00
wozeparrot
15cd42cfb9
feat: support TRACEMETA=2 in handcode_opt ( #6767 )
2024-09-26 16:58:29 +08:00
chenyu
5a5fbfa1eb
smaller bert script change ( #6768 )
...
only WANDB and RUNMLPERF order. BENCHMARK and BEAM will be done differently
2024-09-26 04:54:28 -04:00
chenyu
0424c4967d
fix handcode_opt.py for bert ( #6756 )
2024-09-26 00:20:24 -04:00
chenyu
396c96357b
update mlperf bert scripts ( #6755 )
...
removed DISABLE_DROPOUT=1.
updated BS to 54 that works on tinyboxes with dropouts.
used bert's sparse_categorical_crossentropy that takes Tensor ignore_index in accuracy method
2024-09-25 23:55:05 -04:00
George Hotz
7e73c7b3cc
hotfix: bump stable diffusion val distance
2024-09-26 11:15:29 +08:00
Francis Lata
b7a8de1a4e
Merge branch 'master' into retinanet_mlperf
2024-09-25 10:57:32 -07:00
wozeparrot
c100f3d406
default threefry ( #6116 )
2024-09-25 17:45:13 +08:00
George Hotz
f45d178a55
hotfix: support JIT_BATCH_SIZE=0, make that the default
2024-09-25 10:36:04 +08:00
wozeparrot
f932116e05
feat: small things from default_threefry ( #6708 )
2024-09-24 17:00:47 +08:00
Anurag Lamsal
568757e087
fix model_eval.py in the mlperf folder searching for bert vocab in the wrong directory ( #6649 )
2024-09-24 11:20:44 +08:00
samm393
19c11792fd
Flux.1 ( #6334 )
...
* initial commit
* whitespace
* get rid of torch import
* indentation
* less hardcoding
* add flux.1-dev
* jit
* no double
* t5 tidy up
* validation image
* reuse sdxl autoencoder
* typing changes
* empty lines
* remove unneeded comments
---------
Co-authored-by: George Hotz <72895+geohot@users.noreply.github.com >
2024-09-24 10:08:04 +08:00
George Hotz
b9e6d42a1f
Revert "gated native math in OpenCL ( #6683 )" ( #6691 )
...
This reverts commit 2fe3eeed17 .
2024-09-24 08:48:10 +08:00
Francis Lata
b8925aeb62
update model_eval with new dataloader
2024-09-23 13:03:34 -07:00
Francis Lata
98adc8a40d
Merge branch 'master' into retinanet_mlperf
2024-09-23 12:14:59 -07:00
Francis Lata
b15be98d51
add focal loss
2024-09-23 12:10:59 -07:00