Francis Lata
4bebe61a9c
add dataloader + test
2024-10-16 15:38:47 -04:00
Francis Lata
3d857d758e
Merge branch 'master' into retinanet_mlperf
2024-10-16 15:36:37 -04:00
Francis Lata
90eff347e2
tinytqdm write support ( #6359 )
...
* add write support
* add test
* update test case to compare write outputs
* assert final write output
* flush when using write
* update write logic
* Revert "update write logic"
This reverts commit 5e0e611b46 .
---------
Co-authored-by: chenyu <chenyu@fastmail.com >
2024-10-16 14:51:41 -04:00
Francis Lata
498141c579
Merge branch 'master' into retinanet_mlperf
2024-10-16 10:14:39 -04:00
George Hotz
3169cb386d
remove graph [pr] ( #7085 )
2024-10-16 11:40:07 +08:00
George Hotz
26df50cf43
move memory_planner to memory.py [pr] ( #7079 )
2024-10-16 10:04:35 +08:00
Francis Lata
d5813a3c42
Merge branch 'master' into retinanet_mlperf
2024-10-12 22:04:58 -04:00
chenyu
ed1ed9e4ff
bert use BS=72 ( #7015 )
...
memory 131 -> 138
green tflops 201 -> 209
red tflops 160 -> 169
2024-10-12 09:41:56 -04:00
George Hotz
a71bb09ec3
remove symbolic file [pr] ( #7012 )
2024-10-12 18:44:44 +08:00
Francis Lata
1295a3020f
Merge branch 'master' into retinanet_mlperf
2024-10-11 23:08:17 -04:00
Francis Lata
b802f74cee
add dataloader
2024-10-11 23:04:21 -04:00
George Hotz
5c9f76e274
hotfix: openpilot compile3 compare to i==1
2024-10-12 09:44:24 +08:00
chenyu
36056e0760
update mlperf systems and copy 4.1 to 5.0 ( #7004 )
2024-10-11 16:20:34 -04:00
chenyu
0e42662f2a
log seed at the right place for bert ( #7000 )
2024-10-11 10:39:40 -04:00
nimlgen
5496a36536
update red mlperf bert readme ( #6969 )
2024-10-11 13:08:06 +03:00
Friedrich Carl Eichenroth
859d6d0407
Fix mypy examples/beautiful_*.py ( #6978 )
...
* fix mypy examples/beautiful_*.py
* backwards
* add test
* Revert "add test"
This reverts commit 4d88845ba3 .
---------
Co-authored-by: chenyu <chenyu@fastmail.com >
2024-10-10 11:34:29 -04:00
Kinvert
960c495755
added beautiful fashion mnist and example ( #6961 )
...
* added beautiful fashion mnist and example
* fixing whitespace
* refactor Fashion MNIST to fewer lines
* fix newline to reduce diff
* Update beautiful_mnist.py
* Update beautiful_mnist.py
---------
Co-authored-by: George Hotz <72895+geohot@users.noreply.github.com >
2024-10-10 12:01:07 +08:00
chenyu
b5546912e2
10% more TRAIN_STEPS for bert ( #6971 )
...
got two very close run, adding more steps for buffer
2024-10-09 19:21:43 -04:00
chenyu
35cf48659b
limit beam param for bert on green ( #6966 )
...
seems to mitigate the crash
2024-10-09 11:48:18 -04:00
chenyu
1ff2c98f8a
fix logfile name for bert red ( #6952 )
2024-10-08 05:37:52 -04:00
chenyu
a78c96273a
update bert epoch logging ( #6940 )
...
* update bert epoch logging
epoch for bert is simply number of examples seen (which is used for RCP check)
* update total steps too
* more changes
2024-10-08 00:34:06 -04:00
chenyu
102dfe5510
back to 2**10 for bert loss scaler ( #6934 )
...
getting 2 NaN for this, revert back to 2**10
2024-10-07 10:17:21 -04:00
chenyu
0cf815a93a
bert use BS=66 and update hparams ( #6932 )
...
with dropout memory improvement, we can fit BS=66 now. revert back to the hparams in #5891 too
2024-10-07 05:08:27 -04:00
chenyu
718b959349
log epoch start and stop for bert ( #6912 )
2024-10-06 06:39:46 -04:00
Francis Lata
5dbebf460e
continue with dataloader implementation
2024-10-05 23:51:16 -07:00
Francis Lata
8ca848d542
reuse existing prepare_target
2024-10-05 18:21:15 -07:00
Francis Lata
30f6f6a094
add first transform for train set
2024-10-05 18:12:38 -07:00
Francis Lata
12c4259b16
Merge branch 'master' into retinanet_mlperf
2024-10-05 16:22:58 -07:00
chenyu
16c1fa4208
use BEAM=3 for red box bert runs ( #6904 )
...
BEAM=4 slightly exceeded 30 minutes setup
2024-10-05 09:21:12 -04:00
chenyu
0e706227a2
add seed to bert result log filename ( #6903 )
...
* add seed to bert result log filename
* different name for different benchmark
2024-10-05 09:15:24 -04:00
George Hotz
f4ec39fe58
switch symbolic from old to uops, final PR ( #6872 )
...
* switch symbolic from old to uops, final PR
* two wrong answers
* not needed resolves
* symbolic ops passes
* symbolic ops passes
* progress
* tests pass (almost)
* fix last test
* fix some tests
* global binding and unbinding
* Revert "global binding and unbinding"
This reverts commit 9456725630 .
* that test works now
* vars on uop doesn't recurse
* fix fuzzer
* update
* fix type
* fix gpt, it's UOp now
* ssimplify symbolics
2024-10-04 16:42:27 +08:00
chenyu
7391376528
update bert hparams ( #6876 )
...
4h32m with this https://wandb.ai/chenyuxyz/MLPerf-BERT/runs/q99frv1l/overview .
loss scaler 2**13->2**10. matched the closest submission, no nan for ~10 runs.
increased lr and total step a bit.
`PARALLEL=0` after setup, same as resnet.
2024-10-04 00:39:06 -04:00
Francis Lata
5bc8230739
start dataloader work with input images
2024-10-02 11:47:37 -07:00
Francis Lata
f81603d983
minor cleanup
2024-10-02 06:57:23 -07:00
Francis Lata
d281e64411
add optimizer
2024-10-02 04:46:08 -07:00
Francis Lata
b8e24b4f4d
recursively go through backbone layers to freeze them
2024-10-02 04:21:44 -07:00
Francis Lata
7432cd7857
Merge branch 'master' into retinanet_mlperf
2024-10-01 19:26:35 -07:00
chenyu
5f77217772
bert default CKPT to 0 ( #6840 )
...
not required
2024-10-01 21:55:56 -04:00
George Hotz
547733e57c
stunning_mnist [run_process_replay] ( #6828 )
...
* stunning_mnist [run_process_replay]
* add loss to stunning mnist
2024-10-01 15:00:48 +08:00
chenyu
f59517754e
add RESET_STEP in bert to control reset ( #6818 )
...
same as resnet
2024-09-30 09:39:04 -04:00
George Hotz
2ed94e447f
gpt2: corealize opt and loss
2024-09-30 09:11:20 +08:00
George Hotz
a76c6c740c
hand pad gpt2 ( #6805 )
2024-09-30 09:03:07 +08:00
chenyu
494b20e886
bert BS back to 54 ( #6791 )
...
60 does not run end to end
2024-09-27 22:16:05 -04:00
chenyu
572d77d1d9
bert script delete eval data after eval ( #6790 )
...
fits BS=60 which is 2% faster than 54. also fixed wandb logging params
2024-09-27 20:54:00 -04:00
Francis Lata
35105e769b
small cleanup
2024-09-27 13:28:53 -07:00
Francis Lata
a879d19c82
update initializers for model
2024-09-27 13:25:28 -07:00
Francis Lata
52a1ba7fcb
Merge branch 'master' into retinanet_mlperf
2024-09-27 08:32:44 -07:00
chenyu
f9c8e144ff
chmod +x mlperf bert script for red ( #6789 )
...
also disabled raising power cap in setup. wozeparrot mentioned that's unstable and might cause bert training issue on red
2024-09-27 11:27:32 -04:00
Francis Lata
9fbc3f1fc7
Merge branch 'master' into retinanet_mlperf
2024-09-27 08:16:24 -07:00
Francis Lata
d3a387be63
[MLPerf] Prepare openimages dataset script ( #6747 )
...
* prepare openimages for MLPerf
* cleanup
* fix issue when clearing jit_cache on retinanet eval
* revert pandas specific changes
2024-09-27 11:13:56 -04:00