Commit Graph

10633 Commits

Author SHA1 Message Date
George Hotz
8919370c76 hotfix: fix test_save_all_dtypes on METAL 2025-04-18 08:42:31 +01:00
qazal
16dfe0a902 upstream remu (#9921) 2025-04-18 01:57:36 +03:00
qazal
d287afe3b1 remove shapeless const check in full_shape [pr] (#9911)
* remove shapeless const check in full_shape [pr]

* those can go too
2025-04-18 00:00:26 +03:00
chenyu
fe6a482f1d pin hypothesis version to 6.131.0 (#9920)
6.131.1 seems to cause timeout in CI
2025-04-17 16:34:10 -04:00
chenyu
f5256e0020 Kernel.apply_opts [pr] (#9917)
* Kernel.apply_opts [pr]

updated all `for opt in`. also updated a few test_liinearizer tests to not implcitly depend on hand_coded_optimization

* not you yet
2025-04-17 08:00:56 -04:00
chenyu
e2ed673c94 FUSE_ARANGE_UINT to not fuse uint (#9915)
hack to bypass rand, can FUSE_ARANGE on green for 6ms per step
2025-04-16 18:49:38 -04:00
qazal
497daa658a hotfix: edge-labels go above the overlay (#9910) 2025-04-16 23:38:12 +08:00
qazal
e8e43c6dad ensure edge labels are always on top (#9908) 2025-04-16 21:08:06 +08:00
qazal
5265f25088 add counter for incoming edges in viz (#9907) 2025-04-16 20:14:14 +08:00
Eitan Turok
2c7c205bc5 Fix dtype comparisons in vectorized transcendental + tests (#9794)
* init test

* cleanup

* init

* update

* fix

* fix python runtime for vectorized code

* awesome helper

* update

* update

* cleanup

* more cleaning

* cleanup more

* fix tests

* more cleaning

* cleanup more

* fix

* even cleaner

* failing tests is sad

* cleanup

* better name

* make tests pass

* remove vec from python runtime

* remove vec from eval_uop

* remove expected failues

* better name
2025-04-16 08:06:12 -04:00
qazal
929e5a9905 do not construct GrouperContext [pr] (#9906) 2025-04-16 18:26:31 +08:00
Xingyu
047c8fd70d Add amax support to Tensor operations in Torch Backend (#9905)
* Add amax support to Tensor operations
- Implemented amax function in backend.py for tensor max operations.
- Added unit tests for amax in test.py to ensure correct functionality.

* Fix formatting in amax output function
- Adjusted spacing in the amax output lambda function in backend.py
- Improved code readability for better maintenance
2025-04-16 10:35:50 +01:00
uuuvn
d7f623dac2 Use Buffer in cloud server instead of opaques (#9875)
Not-quite-required but makes cloud graph a *lot* cleaner because unlike
raw compiled programs `GraphRunner` takes `Buffer`s like other runners.

Otherwise either of: adding a new option to not free on `__del__`,
(ab)using `external_ptr` to prevent free, or making something like a
`FakeBuffer` is required.
2025-04-16 10:17:32 +01:00
qazal
05334e0f3f construct children from UOp.toposort [pr] (#9882)
* construct children from UOp.toposort [pr]

* only for bases
2025-04-16 16:55:59 +08:00
geohotstan
4e8f25109a Revert "ONNX add output shape validation (#9720)" (#9904)
This reverts commit ac713e04db.
2025-04-16 03:15:56 -04:00
chenyu
e8024c8281 faster bert global_norm (#9901)
tinyamd 2% faster.  also updated beam params that's 2-3% faster.

update mlperf doc and steps too
2025-04-15 18:24:44 -04:00
Sieds Lykles
91ccf1c343 Off by one error in start_pos (#9792)
Variable upper bound is inclusive
2025-04-15 15:07:13 -04:00
pkotzbach
5849c43382 FP8s part 1 (#9887)
* fp8s part 1

* prettier

* fixes

* fixes

* remove stuff that should be in next pr

* revert

* add creation

---------

Co-authored-by: pkotzbach <pawkotz@gmail.com>
2025-04-15 11:20:02 -04:00
Francis Lata
31483050c0 add eval_freq flag (#9894) 2025-04-15 06:42:40 -04:00
nimlgen
83ae83d871 compare amd and am to cpu as well (#9896) 2025-04-15 13:32:18 +03:00
nimlgen
23a95dd84d script to compare amd and am kerns (#9889)
* script to compare amd and am kerns

* tool

* is it used???
2025-04-15 00:11:22 +03:00
chenyu
ce454793e6 support specifying dtype for Tensor.linear (#9886) 2025-04-14 13:55:11 -04:00
b1tg
e8a0aee88d add arch to AMDLLVMRenderer (#9884)
* add arch to AMDLLVMRenderer

* __reduce__ to match others

---------

Co-authored-by: nimlgen <138685161+nimlgen@users.noreply.github.com>
2025-04-14 19:59:22 +03:00
George Hotz
44e4934167 fast pattern matcher [pr] (#9737)
* FastPatternMatcher

* works without that

* fix test pickle

* strict len

* compile match function

* dynamic compile

* fast

* faster

* compile

* track

* a lot faster

* clean up

* dup or

* faster and simpler

* fast match doesn't support store

* plane

* minor refactor

* real speed

* don't imply return None

* upat

* fix test

* heard you wanted more speed

* no generator

* split cf

* early fixup

* fxn fixup

* reconstruct_function

* Revert "reconstruct_function"

This reverts commit 37dac010ab.

* simpler stuff

* too big

* upat compile error

* cleanups

* don't cache that

* cleanups

* 10 -> 15
2025-04-14 15:24:41 +01:00
chenyu
43d3a75d6c increase bert max train_steps (#9883) 2025-04-14 08:53:44 -04:00
qazal
bf099520a4 add names to grouper rewrites + cleanups [pr] (#9881)
* add names to grouper rewrites + cleanups [pr]

* assign_targets
2025-04-14 19:47:36 +08:00
George Hotz
ca8aaadd00 clean up some patterns [pr] (#9880)
* clean up some patterns [pr]

* cleanest

* move that into upat_interpret
2025-04-14 11:33:22 +01:00
George Hotz
355739fc94 switch to universal match [pr] (#9879)
* switch to universal match [pr]

* 10 -> 15
2025-04-14 09:15:37 +01:00
Nishant Rajadhyaksha
32ed128598 fixing transformer training bug (#9877) 2025-04-13 19:34:20 -04:00
George Hotz
bd5939514d clean up a few patterns [pr] (#9873) 2025-04-13 20:19:37 +01:00
Alexey Zaytsev
78a6af3da7 Use $CUDA_PATH/include for CUDA headers (#9858) 2025-04-13 16:20:19 +01:00
chenyu
e2a40fb523 update bert mi300x script (#9872)
2 runs failed to converge in 10 back to back runs, increase total train steps and some beam params (2% faster step)
2025-04-13 10:07:36 -04:00
qazal
e201bc3e93 process replay kernel asts in toposort order [pr] (#9869)
* process replay kernel asts in toposort order [pr]

* use HEAD replay
2025-04-13 17:20:34 +08:00
qazal
7191f88551 add asserts for KERNEL op ast [pr] (#9868) 2025-04-13 16:50:18 +08:00
qazal
5ee9c343e6 add device to NullRenderer [pr] (#9867) 2025-04-13 13:17:16 +08:00
Francis Lata
2793cca9a6 RetinaNet MLPerf (#8385)
* add support for a custom BASEDIR for openimages download

* make export step faster

* add focal loss

* update model_eval with new dataloader

* generate_anchors in tinygrad

* update initializers for model

* small cleanup

* revert isin enhancements

* recursively go through backbone layers to freeze them

* add optimizer

* minor cleanup

* start dataloader work with input images

* add first transform for train set

* reuse existing prepare_target

* continue with dataloader implementation

* add dataloader

* separate out KiTS19 dataset test cases

* create mock data samples for test

* add dataloader + test

* cleanup dataloader test and revert shm path

* trim dataloader related code needed from ref

* got dataloader with normalize working

* update image to be float32

* add back normalization and negate it in test

* clean up reference dataset implementation + ruff changes

* add validation set test

* add proper training loop over the training dataset

* add LambdaLR support

* add LR scheduler and the start of training step

* get forward call to model work and setup multi-GPU

* already passed device

* return matches from dataloader

* hotfix for dataloader typo causing some hang

* start some work on classification loss

* update focal loss to support masking

* add missing test and cleanup focal loss

* cleanup unit tests

* remove masking support for sigmoid_focal_loss

* make ClassificationHead loss work

* cleanups + fix dataloader tests

* remove sigmoid when computing loss

* make anchors use Tensors

* simplify anchors batching

* revert anchors to use np

* implement regression loss

* fix regression loss

* cleanup losses

* move BoxCoder to MLPerf helpers

* revert helper changes

* fixes after helper refactor cleanup

* add tests for l1_loss

* start re-enabling training step

* minor cleanup

* add pycocotools to testing dependencies

* make training work

* adjust regression loss to mask after L1 loss is calculated

* reduce img and lbl sizes by half for KiTS19 dataset tests

* Revert "reduce img and lbl sizes by half for KiTS19 dataset tests"

This reverts commit d115b0c664.

* temporarily disable openimages dataset tests to debug CI

* enable openimages dataset test and create samples once

* temporarily disable openimages validation set test

* reenable test and add some debugging to the test

* add boto3 testing dependencies

* add pandas to testing dependencies

* This reverts commit 467704fec6.

* reenable test

* move sample creation to setup

* realize boxcoder's encoding

* add wandb

* fix wandb resuming feature

* move anchors as part of dataloader

* fix dtype for anchor inside dataloader and fix horizontal flip transformation

* add support for BENCHMARK

* set seed

* debug dataset test failuire

* Revert "debug dataset test failuire"

This reverts commit 1b2f9d7f50.

* fix dataloader script

* do not realize when sharding model weights

* setup openimages samples differently

* create the necessary samples per test case

* enable lr scheduler and fix benchmark timing

* add jit to the training loop

* add checkpointing and training resume capabilities

* refactor on training loop and start the work on val looop

* add debug logging for dataloader test

* debug test

* assert boxes again

* update validation dataloader and more cleanups

* fix validation test case

* add multi device support to retinanet eval

* fix issue with realized on dataloader

* remove optional disk tensors in dataloader

* remove verbose debugging on datasets test

* put back parallel testing and remove img_ids Tensor from dataloader

* cleanup train and validation dataloader

* return validation targets in dataloader

* cleanup boxes and labels in dataloader

* fix img_ids repeating its values

* remove unnecessary targets from validation dataloader

* add validation loop to training script

* adjust LR to be the ratio of the batch size

* minor cleanups

* remove frozen layers from optimizer's params

* hyperparameter adjustments and cleanups

* model init, hyperparam, and data preprocessing updates

* no need to return loaded keys for resnet

* fix train script

* update loss calculation for regresionhead and some cleanups

* add JIT reset support

* add nan check during training

* Revert "add nan check during training"

This reverts commit ddf1f0d5dd.

* Revert "Revert "add nan check during training""

This reverts commit b7b2943197.

* some typing cleanups

* update seeding on dataloader and the start of training script

* undo changse

* undo more changes

* more typing fixes

* minor cleanups

* update dataloader seed

* hotfix: log metric and move target metric check outside of CKPT

* check for CKPT when target metric is reached before saving

* add TRAIN_BEAM and EVAL_BEAM

* minor cleanup

* update hyperparams and add support for EVAL_BS

* add green coloring to metric reached statement

* initial work to support f16

* update model initializers to be monkeypatched

* update layers to support float32 weight loading + float16 training

* don't return loss that's scaled

* run eval on benchmark beam

* move BEAM to their respective steps

* update layers to be compatible with fp16

* end BENCHMARK after first eval

* cleanups and adjust learning rate for fp16

* remove duplicated files from test

* revert losses changes

* Revert "revert losses changes"

This reverts commit aebccf93ac.

* go back to old LR

* cast batchnorm to float32

* set new loss scaler default value for float16

* remove LambdaLRScheduler

* remove runner and use dataloader on eval

* fix retinanet eval with new dataloader

* remove unused import

* revert lr_scheduler updates

* use BS=96 with new learning rate

* rename module initializers

* more cleanups on training loop

* remove contig from optim.step

* simplify sum when computing loss
2025-04-12 22:11:51 -04:00
nimlgen
23b67f532c amd: minor comments and readme updates (#9865) 2025-04-12 23:24:05 +03:00
nimlgen
7c466c24f7 am_smi: refactor to support arches (#9864)
* am_smi: refactor to support arches

* shorter
2025-04-12 20:37:01 +03:00
nimlgen
a9430b4118 am: fix metrics table for smu14_0_2 (#9863) 2025-04-12 19:07:22 +03:00
Alexey Zaytsev
3bce5ad2b4 clang should not emit the .comment section (#9859)
This section gets included in the finanl image, and we get a lot of garbage with DEBUG=7
2025-04-12 10:59:11 +08:00
Alexey Zaytsev
7dda6aae7d Skip CLOUD in external_test_example (#9857)
Closes #9814
2025-04-12 10:17:44 +08:00
nimlgen
7919bb4f8a amd: do not use log2 (#9852) 2025-04-11 19:53:06 +03:00
nimlgen
ada0f67d3d am: fix speed of ring copies (#9854) 2025-04-11 17:28:06 +03:00
chenyu
4aab16ca6a bert script cleanup and assert nan loss (#9851) 2025-04-11 05:41:49 -04:00
qazal
ad677f8e55 create_ast cleanups from kernelize [pr] (#9849) 2025-04-11 16:10:21 +08:00
qazal
cbc5e7ed45 unbind variables when creating ScheduleItems [pr] (#9846) 2025-04-11 15:23:53 +08:00
chenyu
6896197978 relax ATOL for TC half tests more (#9847) 2025-04-11 03:20:22 -04:00
George Hotz
dd52951dd0 fix single kernel softmax with cast (#9842)
* fix single kernel softmax with cast

* tolerate none

* 3e-4

* skip on dtype
2025-04-11 12:12:02 +08:00
chenyu
8c6299bced move hand_coded_optimizations to heuristic.py [pr] (#9844)
* move hand_coded_optimizations to heuristic.py [pr]

also folded all long lines

* make a copy and rename self -> k

* fix test
2025-04-10 23:40:16 -04:00
chenyu
e0ec8be37d use CPU for test_schedule_ring (#9843)
* use CPU for test_schedule_ring

* why pre-commit is good
2025-04-10 23:20:53 -04:00