Francis Lata
27ec792c19
check for CKPT when target metric is reached before saving
2025-03-02 00:41:08 -08:00
Francis Lata
3ac4ae5870
hotfix: log metric and move target metric check outside of CKPT
2025-03-01 04:31:00 -08:00
Francis Lata
974309862d
update dataloader seed
2025-02-28 21:41:30 +00:00
Francis Lata
6a62ece474
minor cleanups
2025-02-28 15:43:11 +00:00
Francis Lata
074e9f742b
more typing fixes
2025-02-28 15:42:11 +00:00
Francis Lata
e9d1af26b2
undo more changes
2025-02-28 15:11:17 +00:00
Francis Lata
47edcdb834
undo changse
2025-02-28 15:08:55 +00:00
Francis Lata
bdf442717c
update seeding on dataloader and the start of training script
2025-02-28 14:58:28 +00:00
Francis Lata
87bfa77f4a
some typing cleanups
2025-02-28 14:47:29 +00:00
Francis Lata
dc394e8214
Merge branch 'master' into retinanet_mlperf
2025-02-27 15:33:20 -05:00
George Hotz
67ba073c55
hotfix: test accuracy in beautiful_mnist_torch
2025-02-27 11:18:59 +08:00
Francis Lata
4fa62ba304
Merge branch 'master' into retinanet_mlperf
2025-02-26 13:27:35 -05:00
Francis Lata
86b737a120
leakyrelu to leaky_relu ( #9270 )
2025-02-26 13:22:08 -05:00
Francis Lata
7cb226d757
Revert "Revert "add nan check during training""
...
This reverts commit b7b2943197 .
2025-02-26 15:43:20 +00:00
Francis Lata
e006ae24ea
Merge branch 'master' into retinanet_mlperf
2025-02-26 07:31:32 +00:00
George Hotz
3f4eb9006a
test for device mismatch [pr] ( #9250 )
...
* test for device mismatch [pr]
* fix bert
2025-02-26 13:06:33 +08:00
chenyu
979e84f30e
RESET_STEP in bert setup and beam ( #9248 )
...
running dev_beam migh OOM without it but runs fine in real run.
2025-02-25 19:15:10 -05:00
Francis Lata
b7b2943197
Revert "add nan check during training"
...
This reverts commit ddf1f0d5dd .
2025-02-25 21:43:28 +00:00
chenyu
6610ad58ab
hotfix bert no shard with only one device ( #9243 )
...
`LLVM=1 BERT_SIZE="tiny" DEFAULT_FLOAT=HALF BENCHMARK=5 MODEL="bert" python3 examples/mlperf/model_train.py` runs for me with this. it should not failed with single device shard though
2025-02-25 09:05:11 -05:00
Francis Lata
ddf1f0d5dd
add nan check during training
2025-02-25 10:53:31 +00:00
Francis Lata
8737020d75
add JIT reset support
2025-02-25 10:52:26 +00:00
Francis Lata
30d5daa121
Merge branch 'master' into retinanet_mlperf
2025-02-25 10:32:34 +00:00
nimlgen
b4c3780df0
hotfix: interop example ( #9237 )
...
* hotfix: interop example
* rm this
* fix
* fix ci mps
* atol rtol
* no uaf
2025-02-25 10:32:00 +03:00
chenyu
8c7be428e5
update bert BS to 78 ( #9236 )
...
fits 78 now. about 215 tflops on green
2025-02-24 22:47:35 -05:00
nimlgen
56288243e6
metal PyTorch interop ( #9229 )
...
* add from_blob support to mps cuda
* objc_id
* metal pytorch interop
* fix comments
---------
Co-authored-by: George Hotz <geohot@gmail.com >
2025-02-24 22:36:08 +03:00
nimlgen
1d06d61b16
from_blob for cuda ( #9223 )
...
* from_blob for cuda
* maybe docs?
* minor docs
* example
* waiting 9224
---------
Co-authored-by: George Hotz <72895+geohot@users.noreply.github.com >
2025-02-24 14:02:06 +03:00
George Hotz
24615db5f5
hotfix: torch cuda interop example
2025-02-24 09:02:48 +00:00
Francis Lata
2c3417dfce
Merge branch 'master' into retinanet_mlperf
2025-02-23 21:23:28 +00:00
Francis Lata
60c13c2932
update loss calculation for regresionhead and some cleanups
2025-02-23 21:22:33 +00:00
ShikChen
05e3202fba
remove unused memsize_to_str and minor cleanups [pr] ( #9211 )
...
* fix edge cases in memsize_to_str()
Inputs <= 1 now return "0.00 B" for 0 and "1.00 B" for 1, avoiding an
IndexError. Also, memsize_to_str(1000) now returns "1.00 KB" instead of
"1000.00 B".
Replaced the list comprehension with a next(...) generator for conciseness
and efficiency.
* simplify code using idiomatic python
- Remove the unused `memsize_to_str()` function in helpers.
- Use a tuple for checking multiple string prefixes/suffixes.
- Avoid unnecessary list construction by using iterables directly.
- Check None in @diskcache to ensure proper caching of falsy values.
* revert generators back to list comprehension
Sometimes building list first could be faster. Keep it as is.
2025-02-23 09:58:37 -05:00
George Hotz
4e6665bda5
different way to write torch backend ( #9197 )
...
* different way to write torch backend
* both backends
* more work
* simpler code
* more work
* test both
* imply unwrap/wrap
* FORWARD_ONLY=1 TINY_BACKEND=1 python3 test/test_ops.py TestOps.test_add works
* ready to start making test_ops work in torch backend
* backward pass, TINY_BACKEND=1 python3 test/test_ops.py TestOps.test_add works
* FORWARD_ONLY=1 TINY_BACKEND=1 python3 test/test_ops.py TestOps.test_simple_conv2d works
* matmul backward is broken with as_strided
2025-02-22 14:42:26 +08:00
George Hotz
e87be0131e
torch backend start ( #9191 )
...
* start torch backend
* progress
* ugh, you need cpp crap
* 1+1 works
* 1+1 works
* becoming a real backend
* ready to merge?
2025-02-21 16:57:28 +08:00
chenyu
2e7c2780a9
CLANG -> CPU ( #9189 )
2025-02-20 18:03:09 -05:00
Francis Lata
7dba815c47
fix train script
2025-02-19 20:43:02 +00:00
Francis Lata
fc36f09b1e
no need to return loaded keys for resnet
2025-02-19 20:35:03 +00:00
chenyu
3b37cc898b
add bert tiny config ( #9177 )
...
set with BERT_SIZE=tiny. easier to study embedding and fusion
2025-02-19 14:57:03 -05:00
Francis Lata
41378e74a6
model init, hyperparam, and data preprocessing updates
2025-02-19 18:47:06 +00:00
chenyu
975c318dbc
bert use int32 for input ids ( #9173 )
...
original data was int32 for these. float might have caused precision issues
2025-02-19 08:17:27 -05:00
chenyu
ff05bff221
put bert data shard inside jit ( #9160 )
...
python time 45ms -> 9ms, it was spending time to schedule the shard
also init bert data on CLANG since it's from numpy, so we don't create the tensor on default device then shard into GPUS
2025-02-18 10:36:54 -05:00
chenyu
5dc1257ce0
clean up bert fake data iterator [pr] ( #9145 )
...
reuse the same get_data_bert path in setup and real run
2025-02-17 20:03:38 -05:00
George Hotz
7eea9b639d
hotfix: add replay_pkl debugging env
2025-02-17 17:34:58 +08:00
George Hotz
4672d9af73
actual tests for the dsp backend [pr] ( #9102 )
...
* actual tests for the dsp backend [pr]
* fix name
2025-02-15 15:17:56 +08:00
chenyu
81597ddd96
increase lr for bert ( #9098 )
...
had one run that converged better https://wandb.ai/chenyuxyz/MLPerf-BERT/runs/u66tv2hh/overview
2025-02-14 19:10:35 -05:00
Francis Lata
cfa1c2d50e
hyperparameter adjustments and cleanups
2025-02-14 17:53:06 +00:00
chenyu
b58e7b1898
zero out the weight in bert init run ( #9076 )
...
`DEFAULT_FLOAT=HALF BENCHMARK=10 BS=66 EVAL_BS=6 GPUS=6 MODEL=bert python3 examples/mlperf/model_train.py` no longer oom. I think the buffer of random init weights caused the oom.
2025-02-14 08:40:41 -05:00
Francis Lata
caf9b2baa2
Merge branch 'master' into retinanet_mlperf
2025-02-14 06:28:37 +00:00
chenyu
9e91898941
bert eval at the end of training ( #9070 )
...
always eval at the last epoch
2025-02-13 16:29:44 -05:00
Francis Lata
3a2f126e7b
Merge branch 'master' into retinanet_mlperf
2025-02-13 15:40:10 +00:00
Francis Lata
5f26692068
remove frozen layers from optimizer's params
2025-02-13 06:36:13 +00:00
chenyu
f4f56d7c15
move time_linearizer to extra.optimization.helpers [pr] ( #9048 )
...
no longer used in tinygrad
2025-02-12 15:49:58 -05:00