Commit Graph

1111 Commits

Author SHA1 Message Date
Francis Lata
27ec792c19 check for CKPT when target metric is reached before saving 2025-03-02 00:41:08 -08:00
Francis Lata
3ac4ae5870 hotfix: log metric and move target metric check outside of CKPT 2025-03-01 04:31:00 -08:00
Francis Lata
974309862d update dataloader seed 2025-02-28 21:41:30 +00:00
Francis Lata
6a62ece474 minor cleanups 2025-02-28 15:43:11 +00:00
Francis Lata
074e9f742b more typing fixes 2025-02-28 15:42:11 +00:00
Francis Lata
e9d1af26b2 undo more changes 2025-02-28 15:11:17 +00:00
Francis Lata
47edcdb834 undo changse 2025-02-28 15:08:55 +00:00
Francis Lata
bdf442717c update seeding on dataloader and the start of training script 2025-02-28 14:58:28 +00:00
Francis Lata
87bfa77f4a some typing cleanups 2025-02-28 14:47:29 +00:00
Francis Lata
dc394e8214 Merge branch 'master' into retinanet_mlperf 2025-02-27 15:33:20 -05:00
George Hotz
67ba073c55 hotfix: test accuracy in beautiful_mnist_torch 2025-02-27 11:18:59 +08:00
Francis Lata
4fa62ba304 Merge branch 'master' into retinanet_mlperf 2025-02-26 13:27:35 -05:00
Francis Lata
86b737a120 leakyrelu to leaky_relu (#9270) 2025-02-26 13:22:08 -05:00
Francis Lata
7cb226d757 Revert "Revert "add nan check during training""
This reverts commit b7b2943197.
2025-02-26 15:43:20 +00:00
Francis Lata
e006ae24ea Merge branch 'master' into retinanet_mlperf 2025-02-26 07:31:32 +00:00
George Hotz
3f4eb9006a test for device mismatch [pr] (#9250)
* test for device mismatch [pr]

* fix bert
2025-02-26 13:06:33 +08:00
chenyu
979e84f30e RESET_STEP in bert setup and beam (#9248)
running dev_beam migh OOM without it but runs fine in real run.
2025-02-25 19:15:10 -05:00
Francis Lata
b7b2943197 Revert "add nan check during training"
This reverts commit ddf1f0d5dd.
2025-02-25 21:43:28 +00:00
chenyu
6610ad58ab hotfix bert no shard with only one device (#9243)
`LLVM=1 BERT_SIZE="tiny" DEFAULT_FLOAT=HALF BENCHMARK=5 MODEL="bert" python3 examples/mlperf/model_train.py` runs for me with this. it should not failed with single device shard though
2025-02-25 09:05:11 -05:00
Francis Lata
ddf1f0d5dd add nan check during training 2025-02-25 10:53:31 +00:00
Francis Lata
8737020d75 add JIT reset support 2025-02-25 10:52:26 +00:00
Francis Lata
30d5daa121 Merge branch 'master' into retinanet_mlperf 2025-02-25 10:32:34 +00:00
nimlgen
b4c3780df0 hotfix: interop example (#9237)
* hotfix: interop example

* rm this

* fix

* fix ci mps

* atol rtol

* no uaf
2025-02-25 10:32:00 +03:00
chenyu
8c7be428e5 update bert BS to 78 (#9236)
fits 78 now. about 215 tflops on green
2025-02-24 22:47:35 -05:00
nimlgen
56288243e6 metal PyTorch interop (#9229)
* add from_blob support to mps cuda

* objc_id

* metal pytorch interop

* fix comments

---------

Co-authored-by: George Hotz <geohot@gmail.com>
2025-02-24 22:36:08 +03:00
nimlgen
1d06d61b16 from_blob for cuda (#9223)
* from_blob for cuda

* maybe docs?

* minor docs

* example

* waiting 9224

---------

Co-authored-by: George Hotz <72895+geohot@users.noreply.github.com>
2025-02-24 14:02:06 +03:00
George Hotz
24615db5f5 hotfix: torch cuda interop example 2025-02-24 09:02:48 +00:00
Francis Lata
2c3417dfce Merge branch 'master' into retinanet_mlperf 2025-02-23 21:23:28 +00:00
Francis Lata
60c13c2932 update loss calculation for regresionhead and some cleanups 2025-02-23 21:22:33 +00:00
ShikChen
05e3202fba remove unused memsize_to_str and minor cleanups [pr] (#9211)
* fix edge cases in memsize_to_str()

Inputs <= 1 now return "0.00 B" for 0 and "1.00 B" for 1, avoiding an
IndexError. Also, memsize_to_str(1000) now returns "1.00 KB" instead of
"1000.00 B".

Replaced the list comprehension with a next(...) generator for conciseness
and efficiency.

* simplify code using idiomatic python

- Remove the unused `memsize_to_str()` function in helpers.
- Use a tuple for checking multiple string prefixes/suffixes.
- Avoid unnecessary list construction by using iterables directly.
- Check None in @diskcache to ensure proper caching of falsy values.

* revert generators back to list comprehension

Sometimes building list first could be faster. Keep it as is.
2025-02-23 09:58:37 -05:00
George Hotz
4e6665bda5 different way to write torch backend (#9197)
* different way to write torch backend

* both backends

* more work

* simpler code

* more work

* test both

* imply unwrap/wrap

* FORWARD_ONLY=1 TINY_BACKEND=1 python3 test/test_ops.py TestOps.test_add works

* ready to start making test_ops work in torch backend

* backward pass, TINY_BACKEND=1 python3 test/test_ops.py TestOps.test_add works

* FORWARD_ONLY=1 TINY_BACKEND=1 python3 test/test_ops.py TestOps.test_simple_conv2d works

* matmul backward is broken with as_strided
2025-02-22 14:42:26 +08:00
George Hotz
e87be0131e torch backend start (#9191)
* start torch backend

* progress

* ugh, you need cpp crap

* 1+1 works

* 1+1 works

* becoming a real backend

* ready to merge?
2025-02-21 16:57:28 +08:00
chenyu
2e7c2780a9 CLANG -> CPU (#9189) 2025-02-20 18:03:09 -05:00
Francis Lata
7dba815c47 fix train script 2025-02-19 20:43:02 +00:00
Francis Lata
fc36f09b1e no need to return loaded keys for resnet 2025-02-19 20:35:03 +00:00
chenyu
3b37cc898b add bert tiny config (#9177)
set with BERT_SIZE=tiny. easier to study embedding and fusion
2025-02-19 14:57:03 -05:00
Francis Lata
41378e74a6 model init, hyperparam, and data preprocessing updates 2025-02-19 18:47:06 +00:00
chenyu
975c318dbc bert use int32 for input ids (#9173)
original data was int32 for these. float might have caused precision issues
2025-02-19 08:17:27 -05:00
chenyu
ff05bff221 put bert data shard inside jit (#9160)
python time 45ms -> 9ms, it was spending time to schedule the shard

also init bert data on CLANG since it's from numpy, so we don't create the tensor on default device then shard into GPUS
2025-02-18 10:36:54 -05:00
chenyu
5dc1257ce0 clean up bert fake data iterator [pr] (#9145)
reuse the same get_data_bert path in setup and real run
2025-02-17 20:03:38 -05:00
George Hotz
7eea9b639d hotfix: add replay_pkl debugging env 2025-02-17 17:34:58 +08:00
George Hotz
4672d9af73 actual tests for the dsp backend [pr] (#9102)
* actual tests for the dsp backend [pr]

* fix name
2025-02-15 15:17:56 +08:00
chenyu
81597ddd96 increase lr for bert (#9098)
had one run that converged better https://wandb.ai/chenyuxyz/MLPerf-BERT/runs/u66tv2hh/overview
2025-02-14 19:10:35 -05:00
Francis Lata
cfa1c2d50e hyperparameter adjustments and cleanups 2025-02-14 17:53:06 +00:00
chenyu
b58e7b1898 zero out the weight in bert init run (#9076)
`DEFAULT_FLOAT=HALF BENCHMARK=10 BS=66 EVAL_BS=6 GPUS=6 MODEL=bert python3 examples/mlperf/model_train.py` no longer oom. I think the buffer of random init weights caused the oom.
2025-02-14 08:40:41 -05:00
Francis Lata
caf9b2baa2 Merge branch 'master' into retinanet_mlperf 2025-02-14 06:28:37 +00:00
chenyu
9e91898941 bert eval at the end of training (#9070)
always eval at the last epoch
2025-02-13 16:29:44 -05:00
Francis Lata
3a2f126e7b Merge branch 'master' into retinanet_mlperf 2025-02-13 15:40:10 +00:00
Francis Lata
5f26692068 remove frozen layers from optimizer's params 2025-02-13 06:36:13 +00:00
chenyu
f4f56d7c15 move time_linearizer to extra.optimization.helpers [pr] (#9048)
no longer used in tinygrad
2025-02-12 15:49:58 -05:00