1234 Commits

Author SHA1 Message Date
chenyu
485e80da69 run_and_time for resnet ci (#10405) 2025-05-18 23:39:57 -04:00
George Hotz
411392dfb7 move files into uop dir (#10399)
* move files into uop dir [pr]

* tinygrad.uop is a thing

* fix uop docs, no pr

* fix viz
2025-05-18 11:38:28 -07:00
George Hotz
0b733ba75e multi device training with GPT2 [pr] (#10375)
* multi device training with GPT2 [pr]

* Update grouper.py
2025-05-17 15:33:56 -07:00
wozeparrot
12a1ccc680 clean: double import (#10345) 2025-05-15 20:15:09 -07:00
wozeparrot
1ed04f993b move benchmark stat tracking to influxdb (#10185) 2025-05-15 16:14:56 -07:00
George Hotz
568d6d96e7 small changes from new multi [pr] (#10318) 2025-05-14 20:50:59 -07:00
George Hotz
bfc30fa6ea hotfix: typo in shm_name 2025-05-14 19:34:52 -07:00
George Hotz
2bc54b3e22 manually handle OSX 2025-05-14 19:17:51 -07:00
George Hotz
ab460486d7 Revert "resnet dataloader osx (#10316)"
This reverts commit aef336930a.
2025-05-14 19:15:07 -07:00
George Hotz
aef336930a resnet dataloader osx (#10316)
* mlperf dataloader on mac

* resnet dataloader [pr]

* simple should work
2025-05-14 18:31:26 -07:00
chenyu
fbaa26247a randn_like in minrf (#10298)
tested that it trains to similar loss
2025-05-14 07:59:50 -04:00
George Hotz
98c84a711d min rectified flow example [pr] (#10252)
* work on minrf example

* more

* jit sample

* t is tensor not const

* fixes

* more convs

* fix dropout

* don't print

* 504

* big patch

* onehot

* touch

* use embeddings

* dumb uses final layer

* act

* non fl

* match

* tp

* 3

* of

* ppsz

* normal

* add adln

* no t

* weird transformer

* weird transformer

* contig

* actual speed fix

* dumb

* cb

* 0

* t is 0

* mort-t

* args

* dumb days are over

* readable

* contig

* no more t mask

* mask_t

* init to zero

* clean

* steps

* work

* tt

* t

* solid
2025-05-11 18:36:44 -07:00
Adam Van Ymeren
a28ca0680f update dead link (#10242) 2025-05-09 19:59:52 -04:00
Rory Clear
9f2931ae67 Fix yolo load failing silently (#10046)
* wait for js before loading model

* use f32

* revert html changes, try both cameras and remove f16 req

* clean
2025-05-07 11:46:09 -07:00
Kevin Buhler
363481e2fb correct mispelled words (#10165) 2025-05-05 08:12:41 -07:00
chenyu
4a04098389 fix llama3 with nf4 quantize (#10107)
also int8 outputs is wrong
2025-04-29 15:14:36 -04:00
qazal
a59d18da21 hack for VIZ=1 with examples/llama (#10103)
* hack for VIZ=1 with examples/llama

* move it alongside BEAM=0
2025-04-29 23:42:17 +08:00
chenyu
3eba3d6ee9 don't pass model in convert_from_huggingface and convert_from_gguf (#10094)
it only needs n_layers
2025-04-28 20:11:19 -04:00
chenyu
610ee79b22 cherry pick mlperf5.0 branch to master (#10089) 2025-04-28 15:36:56 -04:00
George Hotz
b341296304 hotfix: save sdxl ram 2025-04-27 12:09:45 -04:00
George Hotz
68c5f7ba80 load fast in sdxl (#10072)
* load fast in sdxl

* back to that with the ret

* no context
2025-04-27 11:58:51 -04:00
George Hotz
4b8ef6ce78 hotfix: sdxl corealize 2025-04-27 10:41:46 -04:00
George Hotz
1253819151 make beautiful indexing use a Variable (#10063)
* make beautiful indexing use a Variable

* stunning test

* better color

* training is broken

* fix tests

* fix variable indexing

* fix test

* no contiguous

* revert that

* revert that too

* indexing two bind

* skip for webgpu

* make not slow
2025-04-27 08:22:38 -04:00
Rory Clear
a13a43c4fe yolo 416 to 640 res (#10047) 2025-04-26 20:45:58 -04:00
George Hotz
ea5dddc537 reduce collapse generic (#10045)
* reduce collapse generic

* new arange folder

* new range folding

* correct with sym

* all tests pass

* indexing ops passes

* failing tests

* fix tests, remove unused

* revert that

* torch indexing is fast

* skip on webgpu

* touchups

* comments
2025-04-26 09:13:24 -04:00
Rory Clear
3a189fa561 More yolo processing in tinygrad (#9928)
* more tg less np

* update webgpu html for new compile

* resize boxes

* remove text

* add back note

* fix indentation

* fix indentation

* remove magic num

* remove now unused funcs

* back to numpy nms

* no loop

* fix iou suppression

* update test

* dont suppress other classes

* add working scale

* fix expected value, rounded up 0.24 was being counted

* add postprocess bool for onnx test

* fix indents

* clean

* clean

* fix indent

* remove print

* fix indent

* remove unused import

* remove hardcoded 0.25

* space

* spacing

* clean label_predictions func

* remove single item lists

* space

* use postprocess output in test

* space

* clean

* clean

* remove redundant threshold

* remove redundant threshold

* clean

* rename var

* move loop into func

* unhardcode iou_threshold

* remove unused values

* clean

* add note

* clean

* keep const

* move back funcs

---------

Co-authored-by: George Hotz <72895+geohot@users.noreply.github.com>
2025-04-24 16:21:46 -04:00
chenyu
74c6cf8be3 lint mlperf model_train (#10038) 2025-04-24 16:19:44 -04:00
chenyu
a25abf55e3 retinanet only call postprocess_detections with RUNMLPERF (#10017)
during setup only need to compile `_eval_step().numpy()`
2025-04-23 20:45:38 -04:00
chenyu
65faa1d94b explicit device in mlperf scripts (#10015) 2025-04-23 17:11:52 -04:00
chenyu
a3f938dbee remove retinanet INITMLPERF from beam script (#10011)
it only controls logging, loading real data or not is solely controlled by RUNMLPERF
2025-04-23 14:32:54 -04:00
Francis Lata
5542aeb0e4 RetinaNet MLPerf flag updates (#10009)
* add RUNMLPERF and update INITMLPERF usage

* update scripts to use RUNMLPERF
2025-04-23 13:00:34 -04:00
George Hotz
de0504276b pop 0 is slow [pr] (#10007) 2025-04-23 17:00:59 +01:00
chenyu
d3a8d5c128 print postprocess_detections time in retinanet eval (#10005)
`BS=96 BASEDIR="/raid/datasets/openimages" MODEL=retinanet python examples/mlperf/model_eval.py`

```
...
loaded dataset             @  8.64s
loaded initial data        @ 12.57s
******  619.97 ms to enqueue, 46042.13 ms to realize ( 116.22 ms fetching, 45399.58 ms postprocess_detections).     0.09 examples/sec.  0.83 TFLOPS  @ 59.23s
******  147.49 ms to enqueue, 37362.16 ms to realize ( 146.96 ms fetching, 36618.84 ms postprocess_detections).     0.11 examples/sec.  1.03 TFLOPS  @ 96.74s
******  152.85 ms to enqueue, 37244.08 ms to realize ( 120.67 ms fetching, 36235.19 ms postprocess_detections).     0.11 examples/sec.  1.04 TFLOPS  @ 134.14s
******  146.39 ms to enqueue, 37279.85 ms to realize (  65.07 ms fetching, 36233.56 ms postprocess_detections).     0.11 examples/sec.  1.04 TFLOPS  @ 171.56s
******  152.41 ms to enqueue, 37264.04 ms to realize ( 127.08 ms fetching, 36196.10 ms postprocess_detections).     0.11 examples/sec.  1.04 TFLOPS  @ 208.98s
******  151.29 ms to enqueue, 36868.08 ms to realize ( 142.73 ms fetching, 36153.07 ms postprocess_detections).     0.11 examples/sec.  1.05 TFLOPS  @ 246.00s
******  136.41 ms to enqueue, 37325.04 ms to realize (  90.29 ms fetching, 36573.38 ms postprocess_detections).     0.11 examples/sec.  1.04 TFLOPS  @ 283.46s
```
2025-04-23 11:39:56 -04:00
chenyu
c39128133c retinanet green scripts (#9996)
also removed realize in data_get and used empty for fake data. slightly bigger lr. https://wandb.ai/chenyuxyz/MLPerf-RetinaNet/runs/8skid0e8?nw=nwuserchenyuxyz
2025-04-23 08:28:03 -04:00
chenyu
fb89d9a584 retinanet eval combine output on GPUS[0] (#9966)
eval 35 sec -> 20 sec. it was spending 13 seconds assembling output tensor on CPU backend. GPUS[0] seems to have enough memory, otherwise we can lower EVAL_BS
2025-04-22 07:43:51 -04:00
chenyu
5294c32279 dev scripts for retinanet (#9968)
also BASE_DIR -> BASEDIR for consistency, and move wandb up a bit for more accurate timing
2025-04-21 17:54:56 -04:00
Francis Lata
defa1e77f6 get the proper dataset count (#9962) 2025-04-21 12:11:37 -04:00
Francis Lata
d7e247f329 RetinaNet INITMLPERF support (#9950)
* fixes to make fake data work

* fix eval beam

* fix merge issue
2025-04-21 10:32:05 -04:00
Francis Lata
ea4cb2c715 small cleanups (#9947) 2025-04-20 20:33:20 -04:00
chenyu
6c30948df6 hand_coded_optimizations returns list[Opt] [pr] (#9938)
new api looks like `k.apply_opts(hand_coded_optimizations(k))`
2025-04-19 20:26:59 -04:00
chenyu
3fdba48fc7 update bert green and README (#9934)
submission candidate
2025-04-18 21:21:28 -04:00
chenyu
617b45748f fuse embedding for bert on red (#9925)
also updated BEAM param and use AMD driver for actual run. 535ms step
2025-04-18 07:20:25 -04:00
chenyu
e2ed673c94 FUSE_ARANGE_UINT to not fuse uint (#9915)
hack to bypass rand, can FUSE_ARANGE on green for 6ms per step
2025-04-16 18:49:38 -04:00
chenyu
e8024c8281 faster bert global_norm (#9901)
tinyamd 2% faster.  also updated beam params that's 2-3% faster.

update mlperf doc and steps too
2025-04-15 18:24:44 -04:00
Sieds Lykles
91ccf1c343 Off by one error in start_pos (#9792)
Variable upper bound is inclusive
2025-04-15 15:07:13 -04:00
Francis Lata
31483050c0 add eval_freq flag (#9894) 2025-04-15 06:42:40 -04:00
chenyu
43d3a75d6c increase bert max train_steps (#9883) 2025-04-14 08:53:44 -04:00
Nishant Rajadhyaksha
32ed128598 fixing transformer training bug (#9877) 2025-04-13 19:34:20 -04:00
chenyu
e2a40fb523 update bert mi300x script (#9872)
2 runs failed to converge in 10 back to back runs, increase total train steps and some beam params (2% faster step)
2025-04-13 10:07:36 -04:00
Francis Lata
2793cca9a6 RetinaNet MLPerf (#8385)
* add support for a custom BASEDIR for openimages download

* make export step faster

* add focal loss

* update model_eval with new dataloader

* generate_anchors in tinygrad

* update initializers for model

* small cleanup

* revert isin enhancements

* recursively go through backbone layers to freeze them

* add optimizer

* minor cleanup

* start dataloader work with input images

* add first transform for train set

* reuse existing prepare_target

* continue with dataloader implementation

* add dataloader

* separate out KiTS19 dataset test cases

* create mock data samples for test

* add dataloader + test

* cleanup dataloader test and revert shm path

* trim dataloader related code needed from ref

* got dataloader with normalize working

* update image to be float32

* add back normalization and negate it in test

* clean up reference dataset implementation + ruff changes

* add validation set test

* add proper training loop over the training dataset

* add LambdaLR support

* add LR scheduler and the start of training step

* get forward call to model work and setup multi-GPU

* already passed device

* return matches from dataloader

* hotfix for dataloader typo causing some hang

* start some work on classification loss

* update focal loss to support masking

* add missing test and cleanup focal loss

* cleanup unit tests

* remove masking support for sigmoid_focal_loss

* make ClassificationHead loss work

* cleanups + fix dataloader tests

* remove sigmoid when computing loss

* make anchors use Tensors

* simplify anchors batching

* revert anchors to use np

* implement regression loss

* fix regression loss

* cleanup losses

* move BoxCoder to MLPerf helpers

* revert helper changes

* fixes after helper refactor cleanup

* add tests for l1_loss

* start re-enabling training step

* minor cleanup

* add pycocotools to testing dependencies

* make training work

* adjust regression loss to mask after L1 loss is calculated

* reduce img and lbl sizes by half for KiTS19 dataset tests

* Revert "reduce img and lbl sizes by half for KiTS19 dataset tests"

This reverts commit d115b0c664.

* temporarily disable openimages dataset tests to debug CI

* enable openimages dataset test and create samples once

* temporarily disable openimages validation set test

* reenable test and add some debugging to the test

* add boto3 testing dependencies

* add pandas to testing dependencies

* This reverts commit 467704fec6.

* reenable test

* move sample creation to setup

* realize boxcoder's encoding

* add wandb

* fix wandb resuming feature

* move anchors as part of dataloader

* fix dtype for anchor inside dataloader and fix horizontal flip transformation

* add support for BENCHMARK

* set seed

* debug dataset test failuire

* Revert "debug dataset test failuire"

This reverts commit 1b2f9d7f50.

* fix dataloader script

* do not realize when sharding model weights

* setup openimages samples differently

* create the necessary samples per test case

* enable lr scheduler and fix benchmark timing

* add jit to the training loop

* add checkpointing and training resume capabilities

* refactor on training loop and start the work on val looop

* add debug logging for dataloader test

* debug test

* assert boxes again

* update validation dataloader and more cleanups

* fix validation test case

* add multi device support to retinanet eval

* fix issue with realized on dataloader

* remove optional disk tensors in dataloader

* remove verbose debugging on datasets test

* put back parallel testing and remove img_ids Tensor from dataloader

* cleanup train and validation dataloader

* return validation targets in dataloader

* cleanup boxes and labels in dataloader

* fix img_ids repeating its values

* remove unnecessary targets from validation dataloader

* add validation loop to training script

* adjust LR to be the ratio of the batch size

* minor cleanups

* remove frozen layers from optimizer's params

* hyperparameter adjustments and cleanups

* model init, hyperparam, and data preprocessing updates

* no need to return loaded keys for resnet

* fix train script

* update loss calculation for regresionhead and some cleanups

* add JIT reset support

* add nan check during training

* Revert "add nan check during training"

This reverts commit ddf1f0d5dd.

* Revert "Revert "add nan check during training""

This reverts commit b7b2943197.

* some typing cleanups

* update seeding on dataloader and the start of training script

* undo changse

* undo more changes

* more typing fixes

* minor cleanups

* update dataloader seed

* hotfix: log metric and move target metric check outside of CKPT

* check for CKPT when target metric is reached before saving

* add TRAIN_BEAM and EVAL_BEAM

* minor cleanup

* update hyperparams and add support for EVAL_BS

* add green coloring to metric reached statement

* initial work to support f16

* update model initializers to be monkeypatched

* update layers to support float32 weight loading + float16 training

* don't return loss that's scaled

* run eval on benchmark beam

* move BEAM to their respective steps

* update layers to be compatible with fp16

* end BENCHMARK after first eval

* cleanups and adjust learning rate for fp16

* remove duplicated files from test

* revert losses changes

* Revert "revert losses changes"

This reverts commit aebccf93ac.

* go back to old LR

* cast batchnorm to float32

* set new loss scaler default value for float16

* remove LambdaLRScheduler

* remove runner and use dataloader on eval

* fix retinanet eval with new dataloader

* remove unused import

* revert lr_scheduler updates

* use BS=96 with new learning rate

* rename module initializers

* more cleanups on training loop

* remove contig from optim.step

* simplify sum when computing loss
2025-04-12 22:11:51 -04:00