Commit Graph

4249 Commits

Author SHA1 Message Date
Francis Lam
a9a1fa6bbf wmma: add reduce axis choice to TC action space (#4328)
* wmma: add reduce axis choice to TC action space

* add test for TC multi-reduce axis choice
2024-04-29 19:15:39 -04:00
chenyu
93abcd3113 fix function.py sum backward without downcast_half (#4353)
without downcast_half, sum output dtype can be different from input dtype. cast back to input dtype in function.py
2024-04-29 17:53:02 -04:00
Francis Lam
18c61ce077 test/fuzz_linearizer: add --atol/rtol and change half distribution (#4352) 2024-04-29 15:53:59 -04:00
Elias Wahl
71ff68b445 dropout after eval step (#4351) 2024-04-29 15:47:21 -04:00
Elias Wahl
27613dd881 MLPerf BERT: Main training loop (#4288)
* BERT language modeling head + trunc normal initializers

* add train loop + helpers

* shuffle in dataloaders + slight changes in main loop

* beam change

* Minor changes

* random.shuffle

* HParam update

* Use deque for dataloader

* wandb bert project name

* half fixes

* BENCHMARK + remove epoch

* cast + print()

---------

Co-authored-by: chenyu <chenyu@fastmail.com>
2024-04-29 14:35:27 -04:00
Sohaib
61c97d5305 refactor ops_gpu ctypes (#4331)
* refactor ops_gpu ctypes

- remove redundant byref as ctypes automatically handles passing `type` as
  `POINTER(type)`
- use walrus operator instead of init_c_var when possible

* clSetKernelArg argtype is POINTER(None)
2024-04-30 01:33:34 +08:00
qazal
cc1797673e all fusion opportunities (#4348) 2024-04-29 19:32:23 +03:00
chenyu
f363f39e83 fix dtype of const folded sum (#4349)
const folding sum should return in the same dtype the same as regular sum, which can be different from input dtype
2024-04-29 11:40:45 -04:00
geohotstan
bf412aeb80 use tolist instead of numpy for extracting parameters in onnx (#4333)
* still some numpy left

* all pass

* oops indent

* fix up safe_python

* to_python_const
2024-04-29 10:48:20 -04:00
qazal
774a9b0bca override assign_target in fuzz_schedule (#4342)
* store assign_targets

* cleanup

* override target
2024-04-29 11:04:04 +03:00
Francis Lata
bb849a57d1 [MLPerf] UNet3D dataloader (#4343)
* add support for train/val datasets for kits19

* split dataset into train and val sets

* add tests for kits19 dataloader

* add MLPerf dataset tests to CI

* update unet3d model_eval script

* fix linting

* add nibabel

* fix how mock dataset gets created

* update ref implementation with permalink and no edits

* clean up test and update rand_flip implementation

* cleanups
2024-04-28 22:34:18 -04:00
chenyu
82d0ed3cf3 cap default dataset wikipedia max_workers to 32 (#4345)
64 on tinybox OOM
2024-04-28 21:55:21 -04:00
chenyu
c1d8d425eb fix mean of half tensor if sum is greater than hlaf.max (#4327)
sum of half does acc in float32 already, add an arg to not downcast to half and use that in mean
2024-04-28 18:04:54 -04:00
qazal
e027879475 hotfix: remove double assignment (#4340) 2024-04-28 13:41:31 -04:00
qazal
23445db2b9 no skipped tests in RHIP (#4337)
* delete skip

* delete split skip

* remu dev

* compiler fails here

* Revert "remu dev"

This reverts commit 28b933d4eb.
2024-04-28 12:23:05 -04:00
Obada Khalili
e4befa41d7 Fix in _reshape_mask (#4332)
* handle reshape with remainder in _reshape_mask

* remove trailing whitespce

* use helper_test_op to generate tensors from shapes

* test in shapetracket too

* remove whitespace

* revert property name in other class tests
2024-04-28 11:57:39 -04:00
Timmy
664b563c91 Add insert_before to Linearizer Functions (#4320)
* adding insert_before to linearizer functions

* uop insert_before test case

* formatting

* more formatting

* more formatting

* syntax

* removing self.cast

* addressing err

* removing noqa s
2024-04-28 11:38:36 -04:00
qazal
3372bea322 reduce children fusion tests (#4321)
* base tests

* real-world tests
2024-04-28 11:14:02 -04:00
Arnav Mehta
f3de17912f added the download if not present missing function (#4318) 2024-04-28 16:31:08 +08:00
geohotstan
bc36940c28 fix (#4319) 2024-04-28 16:29:04 +08:00
nimlgen
8d1649d8c2 raise error when too many resources requested in nv (#4324) 2024-04-27 23:48:51 +03:00
qazal
c6c12ba94a save schedule graph pre validation (#4317) 2024-04-27 12:06:15 +03:00
Victor Ziliang Peng
40264c7d1e Update index.md (#4315) 2024-04-27 15:12:44 +08:00
chenyu
24a6342950 add mem/s to external_benchmark_resnet (#4309) 2024-04-26 20:07:17 -04:00
Francis Lam
1f2642c73b kernel: fix calculation of smem size to ignore UNROLL (#4308)
* kernel: fix calculation of smem size to ignore UNROLL

* simplify prod array
2024-04-26 14:34:56 -04:00
Szymon Ożóg
de832d26c6 disable bfloat16 from ptx tests (#4305) 2024-04-26 01:20:10 -04:00
chenyu
ec65aea32f resnet stop the script once hit target (#4303)
* resnet stop the script once hit target

* comment
2024-04-25 23:54:56 -04:00
chenyu
1891ebb655 make ring allreduce chunks a multiple of 2^n if possible (#4302)
in resnet, instead of chunking as [43691, 43691, 43691, 43691, 43690, 43690], chunk as [43712, 43712, 43680, 43680, 43680, 43680] and those can have 32 local.

more than 2X faster for the applicable kernels and overall 1% for resnet
2024-04-25 23:45:28 -04:00
George Hotz
1e37c4a7a1 minor llm.c improvements 2024-04-26 11:15:31 +08:00
chenyu
3ec4b745d6 JIT=2 for mac cifar benchmark (#4300)
also double BS for resnet training benchmark to match submission target
2024-04-25 18:33:40 -04:00
David Hou
c2dbe2a78b new split reduce heuristic try 2 (#4294)
* new split reduce heuristic

* update comment

* rename

---------

Co-authored-by: chenyu <chenyu@fastmail.com>
2024-04-25 18:14:15 -04:00
Szymon Ożóg
f1ebcffb87 Ptx beam fix (#4296)
* Fix beam search for PTX

* fix ptr arm test
2024-04-25 15:39:39 -04:00
chenyu
f9a7badace use LR=7 for resnet with BS=1536 (#4299)
had 3 runs after lr float32, seems quite stable and converges at epoch 34 and 35
2024-04-25 15:23:10 -04:00
qazal
9a47ed0705 test crossing diamond assigns (#4298) 2024-04-25 21:52:05 +03:00
chenyu
5ae252ae83 use at least float32 for optim.lr (#4297)
* use at least float32 for optim.lr

when doing mixed precision training (float32 weight, default_float=half), still use float32 to store lr.
it would have been upcasted later in actual weight update, but would have lost precision.
this improved resnet convergence significantly

* undo type annotation
2024-04-25 14:42:28 -04:00
David Hou
6f792b727b More improvements for resnet layer bench (#4272)
* fix first layer size, new schedule stuff

* estimates

* get different conv layers

* \r for estimated times

* E501

* space after comma
2024-04-25 12:40:49 -04:00
David Hou
ac9464f47a allow specify number of beam workers (#4292) 2024-04-25 10:44:43 -04:00
qazal
74a1be88f5 test reduce graph permutations (#4291) 2024-04-25 11:34:44 +03:00
George Hotz
0f0627bc60 add mnist tutorial 2024-04-25 16:08:32 +08:00
chenyu
d31e220cbf add mlperf-logging to setup.py mlperf (#4289) 2024-04-24 23:34:34 -04:00
nimlgen
6b8a85939d fix lds size for amd (#4287) 2024-04-24 22:54:42 +03:00
chenyu
c11bad766d prepare mlperf submission (#4270)
* prepare mlperf submission

* 28min compile and 3h53m

* red 30 minute compile and 56 TFLOPS
2024-04-24 13:19:31 -04:00
Szymon Ożóg
c606a0ba6f Docs link fix (#4286)
* Update quickstart.md

* Update README.md

* Update quickstart.md

* Update README.md
2024-04-24 12:54:43 -04:00
chenyu
c1fbacb182 resnet benchmarks use DEFAULT_FLOAT=HALF (#4285)
also update LR default to scaled based on 1536 (the BS we are submitting)
2024-04-24 12:10:57 -04:00
Szymon Ożóg
002a14088e Ptx store gate cast to bool (#4284)
* Cast gate to bool

* Update

* Add PTX fuzzing to benchmark
2024-04-24 11:43:44 -04:00
George Hotz
dbe3e1d548 or true fixes ci (#4283)
* or true fixes ci

* all with two pipes
2024-04-24 20:48:26 +08:00
qazal
53853e6d08 save the schedule graph in SAVE_SCHEDULE (#4248)
* save the schedule graph with assigns

* extend graph
2024-04-24 12:08:51 +03:00
George Hotz
acb32e1766 hotfix: PM4 supports timing 2024-04-24 08:38:59 +00:00
George Hotz
ad28fdecb1 si.inputs+outputs -> bufs (#4279) 2024-04-24 15:12:34 +08:00
chenyu
8401de9922 resnet benchmark return early in eval (#4278)
only do few eval steps to compile, and skip second epoch when doing beam + benchmark. save 2 minutes
2024-04-24 00:55:01 -04:00