Commit Graph

8210 Commits

Author SHA1 Message Date
Francis Lata
27ec792c19 check for CKPT when target metric is reached before saving 2025-03-02 00:41:08 -08:00
Francis Lata
3ac4ae5870 hotfix: log metric and move target metric check outside of CKPT 2025-03-01 04:31:00 -08:00
Francis Lata
974309862d update dataloader seed 2025-02-28 21:41:30 +00:00
Francis Lata
6a62ece474 minor cleanups 2025-02-28 15:43:11 +00:00
Francis Lata
074e9f742b more typing fixes 2025-02-28 15:42:11 +00:00
Francis Lata
e9d1af26b2 undo more changes 2025-02-28 15:11:17 +00:00
Francis Lata
47edcdb834 undo changse 2025-02-28 15:08:55 +00:00
Francis Lata
bdf442717c update seeding on dataloader and the start of training script 2025-02-28 14:58:28 +00:00
Francis Lata
87bfa77f4a some typing cleanups 2025-02-28 14:47:29 +00:00
Francis Lata
dc394e8214 Merge branch 'master' into retinanet_mlperf 2025-02-27 15:33:20 -05:00
chenyu
8ee2b460ee Tensor.var_mean (#9287) 2025-02-27 15:15:31 -05:00
qazal
cdf66cc67f test: recompute expanded CAST (#9286)
* those views should merge

* diff cleanup

* gpu

* put it behind CAST_AFTER_EXPAND
2025-02-27 19:22:17 +01:00
nimlgen
43e60914f3 init torch hooking (#9284)
* smth

* mv

* prof wk

* revert and move

* fix

* nvprof

* fix and no print much
2025-02-27 19:36:55 +03:00
George Hotz
387ea41e99 increase speed of torch mnist: use gradient api (#9282) 2025-02-27 11:57:41 +08:00
Priyank Patel
a0764f0dc0 (bounty) Make mnist training run with torch backend (#9233)
* yml changes

* torch backend remove meta decomps and add test

* torch backend bump timeout for tests

---------

Co-authored-by: George Hotz <72895+geohot@users.noreply.github.com>
2025-02-27 11:32:25 +08:00
George Hotz
67ba073c55 hotfix: test accuracy in beautiful_mnist_torch 2025-02-27 11:18:59 +08:00
George Hotz
9088125a6a a lil more torch (#9280) 2025-02-27 11:12:20 +08:00
George Hotz
b6a14911c8 start torch.compile support (#9279) 2025-02-27 10:29:51 +08:00
chenyu
4342300eff lower test_gemm_8192 amd to 70 (#9277)
flaky
2025-02-26 16:32:08 -05:00
nimlgen
c4c29c8acc nv: parse elf attrs (#9275)
* better

* hm

* hm

* fixed
2025-02-26 23:21:57 +03:00
chenyu
6350725e2d simpler leaky_relu (#9271)
rendered as `*(data0+alu0) = ((val0<0.0f)?(0.01f*val0):val0);` instead of two wheres.

possible to update rewrite rules too
2025-02-26 13:43:48 -05:00
Francis Lata
4fa62ba304 Merge branch 'master' into retinanet_mlperf 2025-02-26 13:27:35 -05:00
Francis Lata
86b737a120 leakyrelu to leaky_relu (#9270) 2025-02-26 13:22:08 -05:00
chenyu
cd822bbe11 hotfix torch_grad.detach().cpu().numpy() in test_ops (#9268) 2025-02-26 12:27:35 -05:00
chenyu
49ca90df75 update test_ops backward tests (#9267)
instead of `(out+1).square().mean().backward()`, use forward.sum().gradient to get closer to the gradients
2025-02-26 12:09:24 -05:00
Francis Lata
7cb226d757 Revert "Revert "add nan check during training""
This reverts commit b7b2943197.
2025-02-26 15:43:20 +00:00
Francis Lata
e0e50fc482 Merge branch 'master' into retinanet_mlperf 2025-02-26 15:43:05 +00:00
chenyu
aaf0a8069f xor -> bitwise_xor (#9264) 2025-02-26 10:21:14 -05:00
George Hotz
2158dc4849 full fix for as_strided in torch backend (#9257)
* fixes from chargpt for torch backend

* shrink support

* add stride support

* comment cleanup

* a few more

* work

* import the stream hack

* llvm multi auto
2025-02-26 22:34:05 +08:00
qazal
f60f997bf7 viz ui fixes [pr] (#9261) 2025-02-26 14:52:18 +01:00
qazal
bfd1e55bda show zoom to fit button in VIZ if graph isn't in view [pr] (#9258)
* show zoom to fit button in VIZ if graph isn't in view [pr]

* select #render
2025-02-26 14:20:39 +01:00
qazal
f70bad42ce minor becomes_map cleanup + comments [pr] (#9256)
* substitute assign source for KERNEL + comments [pr]

* minor becomes_map cleanup + comments [pr]
2025-02-26 12:36:27 +01:00
George Hotz
7780393460 rig up torch's testing framework [pr] (#9254)
* rig up torch's testing framework [pr]

* support more movement ops

* dec on expand

* fix tests

* work

* fix tests

* a few more

* decomps + opt hook

* installed pytest
2025-02-26 18:46:22 +08:00
qazal
b3755370ae substitute assign source for KERNEL + comments [pr] (#9255) 2025-02-26 11:44:29 +01:00
qazal
941559098b do not lockup VIZ when rendering big graphs [pr] (#8795)
* new viz renderer

* aesthetics

* progress message

* pruning + timeout at 2s
2025-02-26 09:15:26 +01:00
qazal
e162aa862d is_realized only if buffer is allocated (#9253)
* is_realized only if the buffer is allocated

* fix the image check too

* assert test_lil_model after ExecItems run
2025-02-26 08:58:08 +01:00
George Hotz
b603af373e run some tests from torch [pr] (#9252)
* run some tests from torch [pr]

* yml

* wrap_out

* clean up for the new people

* a lil more
2025-02-26 15:42:22 +08:00
Francis Lata
e006ae24ea Merge branch 'master' into retinanet_mlperf 2025-02-26 07:31:32 +00:00
George Hotz
3f4eb9006a test for device mismatch [pr] (#9250)
* test for device mismatch [pr]

* fix bert
2025-02-26 13:06:33 +08:00
Sieds Lykles
9c4d9d9f10 Acc first (#9232)
* put acc in front of the add chain

* handle the other case

* Make loop collapse more generic

* Remove mulacc_unrolled

* Actually remove it

---------

Co-authored-by: George Hotz <geohot@gmail.com>
Co-authored-by: chenyu <chenyu@fastmail.com>
2025-02-25 22:10:15 -05:00
chenyu
979e84f30e RESET_STEP in bert setup and beam (#9248)
running dev_beam migh OOM without it but runs fine in real run.
2025-02-25 19:15:10 -05:00
Francis Lata
b7b2943197 Revert "add nan check during training"
This reverts commit ddf1f0d5dd.
2025-02-25 21:43:28 +00:00
nimlgen
2676c9d46e dsp: raise exec errors as RuntimeError for beam (#9246) 2025-02-25 19:22:35 +03:00
nimlgen
70db8c3003 hcq: dyn alloc signals (#9238)
* hcq: dyn alloc signals

* types and uniqueue devs

* typing

* mypy

* mypy one more time

* test

* make fds to not intersect in mockgpu between drivers
2025-02-25 17:22:24 +03:00
chenyu
6610ad58ab hotfix bert no shard with only one device (#9243)
`LLVM=1 BERT_SIZE="tiny" DEFAULT_FLOAT=HALF BENCHMARK=5 MODEL="bert" python3 examples/mlperf/model_train.py` runs for me with this. it should not failed with single device shard though
2025-02-25 09:05:11 -05:00
qazal
bba9c22f53 implement the new subbuffer spec for DISK [pr] (#9241) 2025-02-25 13:36:23 +01:00
qazal
48dfed064a remove const/var from the kernel graph [pr] (#9240) 2025-02-25 12:21:55 +01:00
Francis Lata
ddf1f0d5dd add nan check during training 2025-02-25 10:53:31 +00:00
Francis Lata
8737020d75 add JIT reset support 2025-02-25 10:52:26 +00:00
Francis Lata
30d5daa121 Merge branch 'master' into retinanet_mlperf 2025-02-25 10:32:34 +00:00