Commit Graph

10417 Commits

Author SHA1 Message Date
Sieds Lykles
91ccf1c343 Off by one error in start_pos (#9792)
Variable upper bound is inclusive
2025-04-15 15:07:13 -04:00
pkotzbach
5849c43382 FP8s part 1 (#9887)
* fp8s part 1

* prettier

* fixes

* fixes

* remove stuff that should be in next pr

* revert

* add creation

---------

Co-authored-by: pkotzbach <pawkotz@gmail.com>
2025-04-15 11:20:02 -04:00
Francis Lata
31483050c0 add eval_freq flag (#9894) 2025-04-15 06:42:40 -04:00
nimlgen
83ae83d871 compare amd and am to cpu as well (#9896) 2025-04-15 13:32:18 +03:00
nimlgen
23a95dd84d script to compare amd and am kerns (#9889)
* script to compare amd and am kerns

* tool

* is it used???
2025-04-15 00:11:22 +03:00
chenyu
ce454793e6 support specifying dtype for Tensor.linear (#9886) 2025-04-14 13:55:11 -04:00
b1tg
e8a0aee88d add arch to AMDLLVMRenderer (#9884)
* add arch to AMDLLVMRenderer

* __reduce__ to match others

---------

Co-authored-by: nimlgen <138685161+nimlgen@users.noreply.github.com>
2025-04-14 19:59:22 +03:00
George Hotz
44e4934167 fast pattern matcher [pr] (#9737)
* FastPatternMatcher

* works without that

* fix test pickle

* strict len

* compile match function

* dynamic compile

* fast

* faster

* compile

* track

* a lot faster

* clean up

* dup or

* faster and simpler

* fast match doesn't support store

* plane

* minor refactor

* real speed

* don't imply return None

* upat

* fix test

* heard you wanted more speed

* no generator

* split cf

* early fixup

* fxn fixup

* reconstruct_function

* Revert "reconstruct_function"

This reverts commit 37dac010ab.

* simpler stuff

* too big

* upat compile error

* cleanups

* don't cache that

* cleanups

* 10 -> 15
2025-04-14 15:24:41 +01:00
chenyu
43d3a75d6c increase bert max train_steps (#9883) 2025-04-14 08:53:44 -04:00
qazal
bf099520a4 add names to grouper rewrites + cleanups [pr] (#9881)
* add names to grouper rewrites + cleanups [pr]

* assign_targets
2025-04-14 19:47:36 +08:00
George Hotz
ca8aaadd00 clean up some patterns [pr] (#9880)
* clean up some patterns [pr]

* cleanest

* move that into upat_interpret
2025-04-14 11:33:22 +01:00
George Hotz
355739fc94 switch to universal match [pr] (#9879)
* switch to universal match [pr]

* 10 -> 15
2025-04-14 09:15:37 +01:00
Nishant Rajadhyaksha
32ed128598 fixing transformer training bug (#9877) 2025-04-13 19:34:20 -04:00
George Hotz
bd5939514d clean up a few patterns [pr] (#9873) 2025-04-13 20:19:37 +01:00
Alexey Zaytsev
78a6af3da7 Use $CUDA_PATH/include for CUDA headers (#9858) 2025-04-13 16:20:19 +01:00
chenyu
e2a40fb523 update bert mi300x script (#9872)
2 runs failed to converge in 10 back to back runs, increase total train steps and some beam params (2% faster step)
2025-04-13 10:07:36 -04:00
qazal
e201bc3e93 process replay kernel asts in toposort order [pr] (#9869)
* process replay kernel asts in toposort order [pr]

* use HEAD replay
2025-04-13 17:20:34 +08:00
qazal
7191f88551 add asserts for KERNEL op ast [pr] (#9868) 2025-04-13 16:50:18 +08:00
qazal
5ee9c343e6 add device to NullRenderer [pr] (#9867) 2025-04-13 13:17:16 +08:00
Francis Lata
2793cca9a6 RetinaNet MLPerf (#8385)
* add support for a custom BASEDIR for openimages download

* make export step faster

* add focal loss

* update model_eval with new dataloader

* generate_anchors in tinygrad

* update initializers for model

* small cleanup

* revert isin enhancements

* recursively go through backbone layers to freeze them

* add optimizer

* minor cleanup

* start dataloader work with input images

* add first transform for train set

* reuse existing prepare_target

* continue with dataloader implementation

* add dataloader

* separate out KiTS19 dataset test cases

* create mock data samples for test

* add dataloader + test

* cleanup dataloader test and revert shm path

* trim dataloader related code needed from ref

* got dataloader with normalize working

* update image to be float32

* add back normalization and negate it in test

* clean up reference dataset implementation + ruff changes

* add validation set test

* add proper training loop over the training dataset

* add LambdaLR support

* add LR scheduler and the start of training step

* get forward call to model work and setup multi-GPU

* already passed device

* return matches from dataloader

* hotfix for dataloader typo causing some hang

* start some work on classification loss

* update focal loss to support masking

* add missing test and cleanup focal loss

* cleanup unit tests

* remove masking support for sigmoid_focal_loss

* make ClassificationHead loss work

* cleanups + fix dataloader tests

* remove sigmoid when computing loss

* make anchors use Tensors

* simplify anchors batching

* revert anchors to use np

* implement regression loss

* fix regression loss

* cleanup losses

* move BoxCoder to MLPerf helpers

* revert helper changes

* fixes after helper refactor cleanup

* add tests for l1_loss

* start re-enabling training step

* minor cleanup

* add pycocotools to testing dependencies

* make training work

* adjust regression loss to mask after L1 loss is calculated

* reduce img and lbl sizes by half for KiTS19 dataset tests

* Revert "reduce img and lbl sizes by half for KiTS19 dataset tests"

This reverts commit d115b0c664.

* temporarily disable openimages dataset tests to debug CI

* enable openimages dataset test and create samples once

* temporarily disable openimages validation set test

* reenable test and add some debugging to the test

* add boto3 testing dependencies

* add pandas to testing dependencies

* This reverts commit 467704fec6.

* reenable test

* move sample creation to setup

* realize boxcoder's encoding

* add wandb

* fix wandb resuming feature

* move anchors as part of dataloader

* fix dtype for anchor inside dataloader and fix horizontal flip transformation

* add support for BENCHMARK

* set seed

* debug dataset test failuire

* Revert "debug dataset test failuire"

This reverts commit 1b2f9d7f50.

* fix dataloader script

* do not realize when sharding model weights

* setup openimages samples differently

* create the necessary samples per test case

* enable lr scheduler and fix benchmark timing

* add jit to the training loop

* add checkpointing and training resume capabilities

* refactor on training loop and start the work on val looop

* add debug logging for dataloader test

* debug test

* assert boxes again

* update validation dataloader and more cleanups

* fix validation test case

* add multi device support to retinanet eval

* fix issue with realized on dataloader

* remove optional disk tensors in dataloader

* remove verbose debugging on datasets test

* put back parallel testing and remove img_ids Tensor from dataloader

* cleanup train and validation dataloader

* return validation targets in dataloader

* cleanup boxes and labels in dataloader

* fix img_ids repeating its values

* remove unnecessary targets from validation dataloader

* add validation loop to training script

* adjust LR to be the ratio of the batch size

* minor cleanups

* remove frozen layers from optimizer's params

* hyperparameter adjustments and cleanups

* model init, hyperparam, and data preprocessing updates

* no need to return loaded keys for resnet

* fix train script

* update loss calculation for regresionhead and some cleanups

* add JIT reset support

* add nan check during training

* Revert "add nan check during training"

This reverts commit ddf1f0d5dd.

* Revert "Revert "add nan check during training""

This reverts commit b7b2943197.

* some typing cleanups

* update seeding on dataloader and the start of training script

* undo changse

* undo more changes

* more typing fixes

* minor cleanups

* update dataloader seed

* hotfix: log metric and move target metric check outside of CKPT

* check for CKPT when target metric is reached before saving

* add TRAIN_BEAM and EVAL_BEAM

* minor cleanup

* update hyperparams and add support for EVAL_BS

* add green coloring to metric reached statement

* initial work to support f16

* update model initializers to be monkeypatched

* update layers to support float32 weight loading + float16 training

* don't return loss that's scaled

* run eval on benchmark beam

* move BEAM to their respective steps

* update layers to be compatible with fp16

* end BENCHMARK after first eval

* cleanups and adjust learning rate for fp16

* remove duplicated files from test

* revert losses changes

* Revert "revert losses changes"

This reverts commit aebccf93ac.

* go back to old LR

* cast batchnorm to float32

* set new loss scaler default value for float16

* remove LambdaLRScheduler

* remove runner and use dataloader on eval

* fix retinanet eval with new dataloader

* remove unused import

* revert lr_scheduler updates

* use BS=96 with new learning rate

* rename module initializers

* more cleanups on training loop

* remove contig from optim.step

* simplify sum when computing loss
2025-04-12 22:11:51 -04:00
nimlgen
23b67f532c amd: minor comments and readme updates (#9865) 2025-04-12 23:24:05 +03:00
nimlgen
7c466c24f7 am_smi: refactor to support arches (#9864)
* am_smi: refactor to support arches

* shorter
2025-04-12 20:37:01 +03:00
nimlgen
a9430b4118 am: fix metrics table for smu14_0_2 (#9863) 2025-04-12 19:07:22 +03:00
Alexey Zaytsev
3bce5ad2b4 clang should not emit the .comment section (#9859)
This section gets included in the finanl image, and we get a lot of garbage with DEBUG=7
2025-04-12 10:59:11 +08:00
Alexey Zaytsev
7dda6aae7d Skip CLOUD in external_test_example (#9857)
Closes #9814
2025-04-12 10:17:44 +08:00
nimlgen
7919bb4f8a amd: do not use log2 (#9852) 2025-04-11 19:53:06 +03:00
nimlgen
ada0f67d3d am: fix speed of ring copies (#9854) 2025-04-11 17:28:06 +03:00
chenyu
4aab16ca6a bert script cleanup and assert nan loss (#9851) 2025-04-11 05:41:49 -04:00
qazal
ad677f8e55 create_ast cleanups from kernelize [pr] (#9849) 2025-04-11 16:10:21 +08:00
qazal
cbc5e7ed45 unbind variables when creating ScheduleItems [pr] (#9846) 2025-04-11 15:23:53 +08:00
chenyu
6896197978 relax ATOL for TC half tests more (#9847) 2025-04-11 03:20:22 -04:00
George Hotz
dd52951dd0 fix single kernel softmax with cast (#9842)
* fix single kernel softmax with cast

* tolerate none

* 3e-4

* skip on dtype
2025-04-11 12:12:02 +08:00
chenyu
8c6299bced move hand_coded_optimizations to heuristic.py [pr] (#9844)
* move hand_coded_optimizations to heuristic.py [pr]

also folded all long lines

* make a copy and rename self -> k

* fix test
2025-04-10 23:40:16 -04:00
chenyu
e0ec8be37d use CPU for test_schedule_ring (#9843)
* use CPU for test_schedule_ring

* why pre-commit is good
2025-04-10 23:20:53 -04:00
qazal
7045920786 give _apply_map_to_tensors substitutes name [pr] (#9840) 2025-04-11 10:38:57 +08:00
qazal
40ef2f2857 add ast fixup stage to tensor_map [pr] (#9839) 2025-04-11 09:24:01 +08:00
qazal
fbc6aa53d4 script for local process_replay + fix viz name [pr] (#9837) 2025-04-11 00:39:18 +08:00
b1tg
a35b475d18 fix am driver for gfx1201 (#9836) 2025-04-10 19:33:02 +03:00
qazal
16956b79de canonicalize Device.DEFAULT (#9835) 2025-04-10 23:02:11 +08:00
George Hotz
f666dd14eb fix get reduce contraction with test (#9834) 2025-04-10 22:24:21 +08:00
George Hotz
c3fa470852 hotfix: remove tracebacklimit, it persists if you catch the exception and made webgpu flaky 2025-04-10 20:29:25 +08:00
chenyu
7fa5f29582 add test_embedding to test_softmax_fusion (#9832) 2025-04-10 08:25:34 -04:00
chenyu
995d20673a increase bert TRAIN_STEPS for mi300x (#9833)
got a few non converged ones so try to increase steps. we need >= 90% runs to converge
2025-04-10 08:25:09 -04:00
George Hotz
25e2a3cf5d hotfix: fix get_contraction_with_reduce 2025-04-10 20:18:19 +08:00
George Hotz
53f0b2aad7 fix infinite loop in flash attention (#9827)
* fix infinite loop in flash attention

* get_contraction_with_reduce

* skip that test

* SINGLE_KERNEL_SOFTMAX + fix multi

* default IGNORE_OOB

* print change
2025-04-10 20:06:44 +08:00
qazal
16afe04f45 move process replay to grouper (#9830)
* simpler

* sched
2025-04-10 18:27:42 +08:00
chenyu
c8f47c1d07 not_support_multi_device helper (#9831)
unify the test helper to skip ci device that does not support multi
2025-04-10 05:25:29 -04:00
chenyu
817746b30e add contiguous to EmbeddingBert output (#9829)
for some reason with random dropout it creates different ast on each device. And search embedding is slow. This workaround saved 6 minutes setup time on mi300x (25->19) and resulted in similar speed
2025-04-10 04:31:19 -04:00
qazal
fd4f06e623 kernelize prereqs [pr] (#9811)
* kernelize prereqs [pr]

* work

* tensor maps to assign

* unwrap st

* process replay

* grouper changes

* replay
2025-04-10 15:22:20 +08:00
chenyu
c462162db8 update benchmark bert scripts with BS and ACC_DTYPE (#9826)
BS=16, ACC_DTYPE=half for tinybox, BS=128, ACC_DTYPE=float for mi300x
2025-04-10 02:06:02 -04:00