Commit Graph

4201 Commits

Author SHA1 Message Date
George Hotz
ad28fdecb1 si.inputs+outputs -> bufs (#4279) 2024-04-24 15:12:34 +08:00
chenyu
8401de9922 resnet benchmark return early in eval (#4278)
only do few eval steps to compile, and skip second epoch when doing beam + benchmark. save 2 minutes
2024-04-24 00:55:01 -04:00
George Hotz
38f97aa0fe rename rawbufs to bufs in ExecItem (#4274) 2024-04-24 11:27:27 +08:00
George Hotz
60e3aa5cb1 more docs (#4271)
* more work on docs

* CompilerOptions is dataclass
2024-04-24 10:52:42 +08:00
chenyu
6637ecc5fe use IGNORE_JIT_FIRST_BEAM to not BEAM in jit cnt=0 (#4269)
we want to have different BEAM values for resnet train and eval. global JITBEAM cannot do this. added the flag to change beam behavior at cnt=0 (so it default behaves the same with or without TinyJit), and for cnt=1 it uses existing BEAM.value.

Also updated the context var BEAM in resnet to be outside of TinyJit. saves about 3 minutes compile time
2024-04-23 18:59:43 -04:00
nimlgen
f3b4dff7c9 KFDProgram -> AMDProgram (#4268) 2024-04-24 00:29:50 +03:00
geohotstan
17328ded7d setitem no return value (#4266)
* no ret value and just force contiguous

* ok revert contiguous stuff

* actually do force it contiguous

* revert again lol

* add simple regression test

* add assert for MLB

* guess we're contiguous everything from now on

* lol ugly af empty return...

* don't change order cuz i don't get disk
2024-04-23 16:28:14 -04:00
Elias Wahl
3a48773f1a BERT dataloader (#4252)
* add dataloader

* comment
2024-04-23 13:44:49 -04:00
Elias Wahl
69341144ba Wikipedia preprocessing script (#4229)
* Preprocessing script

* short seq prob

* comments + env vars

* Add preprocessing reference. Add test

* lint fix + add eval test support

* whitespaces

* point to commit

* comment

* rename

* better comments
2024-04-23 10:28:01 -04:00
chenyu
759b4f41c3 few more KFD -> AMD (#4262)
benchmark gemm and default_parallel
2024-04-23 10:15:37 -04:00
Szymon Ożóg
6c25f1abf7 Optimize ptx loops (#4263)
* Optimize PTX loops

* Update assembly.py
2024-04-23 12:20:14 +04:00
George Hotz
967638f0d5 update docs, remove corealize (#4264)
* update docs, remove corealize

* handle 0 line count

* tensor schedule
2024-04-23 12:05:29 +04:00
George Hotz
9b7efa72ea hotfix: skip 0 line count files in sz.py 2024-04-23 11:56:03 +04:00
George Hotz
acf4ba5c9f method cache respects beam option (#4261)
* method cache respects beam option

* cleanup get_runner
2024-04-23 09:00:41 +04:00
George Hotz
9a95781d51 renamed (#4260) 2024-04-23 09:00:28 +04:00
George Hotz
2ae4f45272 WIP PM4 Support (#4110)
* pm4 kernel launch works

* disable USE_THREAD_DIMENSIONS

* add kernel code

* work on real pm4

* pm4 signal

* same

* gate pm4

* hcq tests pass

* ops passes

* pm4 is closer

* pm4 debug (#4165)

* start debug tests passing

* prg

* smth

* hdp flush

* cleaner 1

* do not need this

* logs not need

* small things

* linter

* remove AQL

* test hcq

* fix tests

* it's subtracting, it shouldn't be -1

* pm4 changes (#4251)

* not need this anymore

* sdma signal with non atomic

---------

Co-authored-by: nimlgen <138685161+nimlgen@users.noreply.github.com>
2024-04-23 08:31:27 +04:00
Francis Lam
3f6c7ca8bf test: fix test_tensor_core_padded on CUDA and add to benchmarks (#4258)
* test: fix test_tensor_core_padded on CUDA and add to benchmarks

* fix linter

* run both tests in one call
2024-04-22 23:22:11 -04:00
Francis Lam
a90de3b574 search: add additional 7 factors to the action space (#4256)
also bump the DB version after the padded TC merge
2024-04-22 19:14:23 -04:00
chenyu
de2b1fb468 update adding_new_accelerators doc (#4255)
mlops -> function, and removed some old ops
2024-04-22 18:50:19 -04:00
Francis Lam
bbb0ad4800 wmma: widen TC usage in search by using PADTO on TC axes when possible (#4216)
* wmma: widen TC usage in search by using PADTO on TC axes when possible

* test: start tests for the new padding TC behavior

* search: upgrade padded TC search to TC_OPT >= 2

* test: add behavior and correctness test for padded TC

added optional argument to apply_tensor_core to set TC_OPT level

* linearizer: add tests for the PADTO behvaior and docs
2024-04-22 16:50:31 -04:00
George Hotz
9e53d6cffa hotfix: 8000 lines 2024-04-22 20:58:16 +04:00
nimlgen
e6227bdb15 nv driver (#4044)
* start

* fix err 93

* gpu

* ioctl mappings

* alloc like cuda

* semaphores

* wait for semaphores value

* start ops_nv

* very simple kernels work

* init several gpus

* qmd dumper

* dirty, but most of kernels work

* always all test_ops

* progress, more tests, stable

* test_ops passes, gpt2 works

but wth big fifo, wrap of fifo doesn't work, i think it's something coherency releated

* need better sync

* fix sync

* alloc2

* all tests pass!

* cleanup 1

* cleanup

* multigpu, simple transfer

* fix sync

* correct init

* nv_gpu autogen + sync bug fix

* clean extra/nv_gpu_driver

* p2p

* clean up

* remove old gen

* small fixes

* cleanup

* cleanup 2

* small fixes

* bigger queue size

* cleanups

* wait

* fixed signals for devs

* fix hang + parallel beam

* small fixes

* detect when local memory is big in kernel

* correct assert

* small fixes

* correct tls size est

* one va space

* less lines

* shorter

* save 2 lines

* save some lines

* remove type ignores

---------

Co-authored-by: George Hotz <72895+geohot@users.noreply.github.com>
2024-04-22 19:50:20 +04:00
qazal
77a3780005 assert reduce recompute (#4250) 2024-04-22 16:12:39 +03:00
qazal
a9bc7c1c49 unify assign tests (#4247) 2024-04-22 11:01:15 +03:00
chenyu
37f8be6450 resnet print epoch ops and mem in benchmark (#4244)
* resnet print epoch ops and mem in benchmark

also added a flag to optionally disable reset jitted steps

* real per epoch stats
2024-04-21 18:32:31 -04:00
Micah Zoltu
7bc862767c Improves error message when CUDA module fails to load. (#4243) 2024-04-21 11:10:14 -04:00
wozeparrot
4c99d49c4d some docstrings (#4201)
* feat: create and data access docstrings

* fix: linter

---------

Co-authored-by: George Hotz <72895+geohot@users.noreply.github.com>
2024-04-21 16:34:08 +04:00
chenyu
30fc1ad415 remove TODO: remove explicit dtypes after broadcast fix in stable_diffusion (#4241)
this is done
2024-04-21 00:31:24 -04:00
chenyu
a1940ced77 remove the assign hack in whisper (#4240)
no longer needed, the commented test case was removed too
2024-04-20 23:56:44 -04:00
chenyu
3f126c7664 fix examples vits / converstion.py (#4239)
it was passing a const numpy array into Tensor.arange
2024-04-20 23:29:12 -04:00
chenyu
31c9d9a228 fix test_linearizer tc opt tests for bf16 (#4237)
bf16 tc has larger rtol
2024-04-20 11:51:50 -04:00
chenyu
f1d9d0a151 cleanup external_test_opt (#4234)
no more OPT=2 or OPT=3, check strict number of kernels, enabled tests that fusion works now
2024-04-20 04:00:08 -04:00
David Hou
dc4b1af09c more realistic edge behavior for resnet benchmark (#4231)
* more realistic edge behavior for resnet benchmark

* schedule_step

* realize all parameters ahead of time

* don't save setup and misc schedules
2024-04-19 20:07:46 -04:00
David Hou
f6eea03749 SAVE_SCHEDULE as contextvar (#4230) 2024-04-19 18:51:57 -04:00
qazal
2094b3b327 graph ScheduleItems (#4224)
* graph schedules

* add logging

* inplace

---------

Co-authored-by: George Hotz <72895+geohot@users.noreply.github.com>
2024-04-19 16:17:11 +04:00
George Hotz
cd88afc98b datasets isn't a feature + filter docstrings (#4228)
* datasets isn't a feature

* filter docstrings in sz
2024-04-19 16:16:10 +04:00
George Hotz
b9570d6100 clean up update stats (#4226)
* WIP: clean up update stats

* line savings now

* fix graphs

* fix tests

* tighter prints

* remove extra jit=false

* debug=2 means wait

* that won't update stats

* still wait
2024-04-19 15:41:30 +04:00
qazal
1c87e5dbf6 fuzz schedule context vars (#4223)
* fuzz schedule context vars

* fuzz unique toposorts

* merge ground truth with the rest

* Revert "merge ground truth with the rest"

This reverts commit 1f3463bb57.

* readability>

* can override
2024-04-19 13:16:25 +03:00
George Hotz
d99b512084 llm.c timing (#4219)
* add timing info

* fix malloc

* 8s with beam
2024-04-19 12:43:21 +04:00
qazal
43841a32b7 Merge pull request #4222 from Qazalin/fuzz-multi0
Tunable multi output fusion
2024-04-19 08:07:45 +03:00
qazal
b2fe3884fc Merge branch 'master' into fuzz-multi0 2024-04-19 07:56:26 +03:00
qazal
abb10c83cd tunable multi output fusion 2024-04-19 07:44:31 +03:00
chenyu
a1133beb80 KFD GEMM (#4221)
added to benchmark CI and fixed duplicated filenames between cuda and ptx
2024-04-19 00:43:18 -04:00
chenyu
3f3af0fb85 test_linearizer_failures 29 passes now (#4215)
TC + PADTO fixed
2024-04-18 19:49:23 -04:00
Elias Wahl
2ecd61e3e2 monkey patching (#4214) 2024-04-18 19:20:52 -04:00
Francis Lam
126826afc8 linearizer: refactor to define accs with potentially TC-modified idxs (#4211) 2024-04-18 15:31:06 -04:00
George Hotz
39b60a25f0 more llm c work (#4207)
* more llm c work

* print nicely

* fake load pretrained

* select warmups

* output c code
2024-04-18 22:20:44 +04:00
chenyu
f7416916df update resnet hparams based on BS=1632 RCP (#4210)
https://github.com/mlcommons/logging/blob/master/mlperf_logging/rcp_checker/training_4.0.0/rcps_resnet.json
2024-04-18 12:01:46 -04:00
George Hotz
fa57c3e7ce continue llm.c (#4190)
* continue llm.c

* export more

* progress on llm.c

* simpler optim, names work
2024-04-18 10:57:54 +04:00
geohotstan
269a58d5fa tolist to return multidimensional list (#4192)
* lol does this work

* some more changes

* a tiny note

* rename a variable

* add test for data const and add TODO comment

* make type correct

make type correct
2024-04-18 07:43:10 +04:00