Timmy
664b563c91
Add insert_before to Linearizer Functions ( #4320 )
...
* adding insert_before to linearizer functions
* uop insert_before test case
* formatting
* more formatting
* more formatting
* syntax
* removing self.cast
* addressing err
* removing noqa s
2024-04-28 11:38:36 -04:00
qazal
3372bea322
reduce children fusion tests ( #4321 )
...
* base tests
* real-world tests
2024-04-28 11:14:02 -04:00
Arnav Mehta
f3de17912f
added the download if not present missing function ( #4318 )
2024-04-28 16:31:08 +08:00
geohotstan
bc36940c28
fix ( #4319 )
2024-04-28 16:29:04 +08:00
nimlgen
8d1649d8c2
raise error when too many resources requested in nv ( #4324 )
2024-04-27 23:48:51 +03:00
qazal
c6c12ba94a
save schedule graph pre validation ( #4317 )
2024-04-27 12:06:15 +03:00
Victor Ziliang Peng
40264c7d1e
Update index.md ( #4315 )
2024-04-27 15:12:44 +08:00
chenyu
24a6342950
add mem/s to external_benchmark_resnet ( #4309 )
2024-04-26 20:07:17 -04:00
Francis Lam
1f2642c73b
kernel: fix calculation of smem size to ignore UNROLL ( #4308 )
...
* kernel: fix calculation of smem size to ignore UNROLL
* simplify prod array
2024-04-26 14:34:56 -04:00
Szymon Ożóg
de832d26c6
disable bfloat16 from ptx tests ( #4305 )
2024-04-26 01:20:10 -04:00
chenyu
ec65aea32f
resnet stop the script once hit target ( #4303 )
...
* resnet stop the script once hit target
* comment
2024-04-25 23:54:56 -04:00
chenyu
1891ebb655
make ring allreduce chunks a multiple of 2^n if possible ( #4302 )
...
in resnet, instead of chunking as [43691, 43691, 43691, 43691, 43690, 43690], chunk as [43712, 43712, 43680, 43680, 43680, 43680] and those can have 32 local.
more than 2X faster for the applicable kernels and overall 1% for resnet
2024-04-25 23:45:28 -04:00
George Hotz
1e37c4a7a1
minor llm.c improvements
2024-04-26 11:15:31 +08:00
chenyu
3ec4b745d6
JIT=2 for mac cifar benchmark ( #4300 )
...
also double BS for resnet training benchmark to match submission target
2024-04-25 18:33:40 -04:00
David Hou
c2dbe2a78b
new split reduce heuristic try 2 ( #4294 )
...
* new split reduce heuristic
* update comment
* rename
---------
Co-authored-by: chenyu <chenyu@fastmail.com >
2024-04-25 18:14:15 -04:00
Szymon Ożóg
f1ebcffb87
Ptx beam fix ( #4296 )
...
* Fix beam search for PTX
* fix ptr arm test
2024-04-25 15:39:39 -04:00
chenyu
f9a7badace
use LR=7 for resnet with BS=1536 ( #4299 )
...
had 3 runs after lr float32, seems quite stable and converges at epoch 34 and 35
2024-04-25 15:23:10 -04:00
qazal
9a47ed0705
test crossing diamond assigns ( #4298 )
2024-04-25 21:52:05 +03:00
chenyu
5ae252ae83
use at least float32 for optim.lr ( #4297 )
...
* use at least float32 for optim.lr
when doing mixed precision training (float32 weight, default_float=half), still use float32 to store lr.
it would have been upcasted later in actual weight update, but would have lost precision.
this improved resnet convergence significantly
* undo type annotation
2024-04-25 14:42:28 -04:00
David Hou
6f792b727b
More improvements for resnet layer bench ( #4272 )
...
* fix first layer size, new schedule stuff
* estimates
* get different conv layers
* \r for estimated times
* E501
* space after comma
2024-04-25 12:40:49 -04:00
David Hou
ac9464f47a
allow specify number of beam workers ( #4292 )
2024-04-25 10:44:43 -04:00
qazal
74a1be88f5
test reduce graph permutations ( #4291 )
2024-04-25 11:34:44 +03:00
George Hotz
0f0627bc60
add mnist tutorial
2024-04-25 16:08:32 +08:00
chenyu
d31e220cbf
add mlperf-logging to setup.py mlperf ( #4289 )
2024-04-24 23:34:34 -04:00
nimlgen
6b8a85939d
fix lds size for amd ( #4287 )
2024-04-24 22:54:42 +03:00
chenyu
c11bad766d
prepare mlperf submission ( #4270 )
...
* prepare mlperf submission
* 28min compile and 3h53m
* red 30 minute compile and 56 TFLOPS
2024-04-24 13:19:31 -04:00
Szymon Ożóg
c606a0ba6f
Docs link fix ( #4286 )
...
* Update quickstart.md
* Update README.md
* Update quickstart.md
* Update README.md
2024-04-24 12:54:43 -04:00
chenyu
c1fbacb182
resnet benchmarks use DEFAULT_FLOAT=HALF ( #4285 )
...
also update LR default to scaled based on 1536 (the BS we are submitting)
2024-04-24 12:10:57 -04:00
Szymon Ożóg
002a14088e
Ptx store gate cast to bool ( #4284 )
...
* Cast gate to bool
* Update
* Add PTX fuzzing to benchmark
2024-04-24 11:43:44 -04:00
George Hotz
dbe3e1d548
or true fixes ci ( #4283 )
...
* or true fixes ci
* all with two pipes
2024-04-24 20:48:26 +08:00
qazal
53853e6d08
save the schedule graph in SAVE_SCHEDULE ( #4248 )
...
* save the schedule graph with assigns
* extend graph
2024-04-24 12:08:51 +03:00
George Hotz
acb32e1766
hotfix: PM4 supports timing
2024-04-24 08:38:59 +00:00
George Hotz
ad28fdecb1
si.inputs+outputs -> bufs ( #4279 )
2024-04-24 15:12:34 +08:00
chenyu
8401de9922
resnet benchmark return early in eval ( #4278 )
...
only do few eval steps to compile, and skip second epoch when doing beam + benchmark. save 2 minutes
2024-04-24 00:55:01 -04:00
George Hotz
38f97aa0fe
rename rawbufs to bufs in ExecItem ( #4274 )
2024-04-24 11:27:27 +08:00
George Hotz
60e3aa5cb1
more docs ( #4271 )
...
* more work on docs
* CompilerOptions is dataclass
2024-04-24 10:52:42 +08:00
chenyu
6637ecc5fe
use IGNORE_JIT_FIRST_BEAM to not BEAM in jit cnt=0 ( #4269 )
...
we want to have different BEAM values for resnet train and eval. global JITBEAM cannot do this. added the flag to change beam behavior at cnt=0 (so it default behaves the same with or without TinyJit), and for cnt=1 it uses existing BEAM.value.
Also updated the context var BEAM in resnet to be outside of TinyJit. saves about 3 minutes compile time
2024-04-23 18:59:43 -04:00
nimlgen
f3b4dff7c9
KFDProgram -> AMDProgram ( #4268 )
2024-04-24 00:29:50 +03:00
geohotstan
17328ded7d
setitem no return value ( #4266 )
...
* no ret value and just force contiguous
* ok revert contiguous stuff
* actually do force it contiguous
* revert again lol
* add simple regression test
* add assert for MLB
* guess we're contiguous everything from now on
* lol ugly af empty return...
* don't change order cuz i don't get disk
2024-04-23 16:28:14 -04:00
Elias Wahl
3a48773f1a
BERT dataloader ( #4252 )
...
* add dataloader
* comment
2024-04-23 13:44:49 -04:00
Elias Wahl
69341144ba
Wikipedia preprocessing script ( #4229 )
...
* Preprocessing script
* short seq prob
* comments + env vars
* Add preprocessing reference. Add test
* lint fix + add eval test support
* whitespaces
* point to commit
* comment
* rename
* better comments
2024-04-23 10:28:01 -04:00
chenyu
759b4f41c3
few more KFD -> AMD ( #4262 )
...
benchmark gemm and default_parallel
2024-04-23 10:15:37 -04:00
Szymon Ożóg
6c25f1abf7
Optimize ptx loops ( #4263 )
...
* Optimize PTX loops
* Update assembly.py
2024-04-23 12:20:14 +04:00
George Hotz
967638f0d5
update docs, remove corealize ( #4264 )
...
* update docs, remove corealize
* handle 0 line count
* tensor schedule
2024-04-23 12:05:29 +04:00
George Hotz
9b7efa72ea
hotfix: skip 0 line count files in sz.py
2024-04-23 11:56:03 +04:00
George Hotz
acf4ba5c9f
method cache respects beam option ( #4261 )
...
* method cache respects beam option
* cleanup get_runner
2024-04-23 09:00:41 +04:00
George Hotz
9a95781d51
renamed ( #4260 )
2024-04-23 09:00:28 +04:00
George Hotz
2ae4f45272
WIP PM4 Support ( #4110 )
...
* pm4 kernel launch works
* disable USE_THREAD_DIMENSIONS
* add kernel code
* work on real pm4
* pm4 signal
* same
* gate pm4
* hcq tests pass
* ops passes
* pm4 is closer
* pm4 debug (#4165 )
* start debug tests passing
* prg
* smth
* hdp flush
* cleaner 1
* do not need this
* logs not need
* small things
* linter
* remove AQL
* test hcq
* fix tests
* it's subtracting, it shouldn't be -1
* pm4 changes (#4251 )
* not need this anymore
* sdma signal with non atomic
---------
Co-authored-by: nimlgen <138685161+nimlgen@users.noreply.github.com >
2024-04-23 08:31:27 +04:00
Francis Lam
3f6c7ca8bf
test: fix test_tensor_core_padded on CUDA and add to benchmarks ( #4258 )
...
* test: fix test_tensor_core_padded on CUDA and add to benchmarks
* fix linter
* run both tests in one call
2024-04-22 23:22:11 -04:00
Francis Lam
a90de3b574
search: add additional 7 factors to the action space ( #4256 )
...
also bump the DB version after the padded TC merge
2024-04-22 19:14:23 -04:00