Commit Graph

104 Commits

Author SHA1 Message Date
George Hotz
fd02ab1e8b move disassemblers and openpilot (#4592)
* move disassemblers and openpilot

* delete junk

* put that in pre-commit

* fixup readme
2024-05-14 19:30:02 -07:00
chenyu
5de4a46f10 re-enable gpt2 half/beam mac benchmark (#4496)
* re-enable gpt2 half/beam mac benchmark

from fuzzer it seems to be flaky due to numerical issue, not kernel bug. we used to have half in splitted reduce.

run this in M1 Max for 20 loops and it's fine

* that should be jitted
2024-05-09 19:15:32 -04:00
chenyu
c508eb7425 revert the removal of CAST_BEFORE_VIEW (#4471)
this brings most of the memory gain for resnet back.
2024-05-08 00:14:29 -04:00
chenyu
d4062cb6fc NV tensor_cores in kernel.py (#4399) 2024-05-02 22:33:08 -04:00
chenyu
dce7ac0160 NOCLANG=1 for tinybox green ci. (#4378)
CLANG was disabled for tinybox red for speed
2024-05-01 13:31:01 -04:00
wozeparrot
4a26718ca9 feat: tinyboxgreen (#4365) 2024-04-30 19:05:37 -04:00
chenyu
fdc8fabae5 disable flaky mac gpt2 beam benchmark and add back cifar mac with JIT=2 (#4358)
* debug flaky mac gpt2 beam run

* disable for now
2024-04-30 10:41:37 -04:00
chenyu
3ec4b745d6 JIT=2 for mac cifar benchmark (#4300)
also double BS for resnet training benchmark to match submission target
2024-04-25 18:33:40 -04:00
chenyu
c1fbacb182 resnet benchmarks use DEFAULT_FLOAT=HALF (#4285)
also update LR default to scaled based on 1536 (the BS we are submitting)
2024-04-24 12:10:57 -04:00
Szymon Ożóg
002a14088e Ptx store gate cast to bool (#4284)
* Cast gate to bool

* Update

* Add PTX fuzzing to benchmark
2024-04-24 11:43:44 -04:00
chenyu
759b4f41c3 few more KFD -> AMD (#4262)
benchmark gemm and default_parallel
2024-04-23 10:15:37 -04:00
Francis Lam
3f6c7ca8bf test: fix test_tensor_core_padded on CUDA and add to benchmarks (#4258)
* test: fix test_tensor_core_padded on CUDA and add to benchmarks

* fix linter

* run both tests in one call
2024-04-22 23:22:11 -04:00
Francis Lam
bbb0ad4800 wmma: widen TC usage in search by using PADTO on TC axes when possible (#4216)
* wmma: widen TC usage in search by using PADTO on TC axes when possible

* test: start tests for the new padding TC behavior

* search: upgrade padded TC search to TC_OPT >= 2

* test: add behavior and correctness test for padded TC

added optional argument to apply_tensor_core to set TC_OPT level

* linearizer: add tests for the PADTO behvaior and docs
2024-04-22 16:50:31 -04:00
chenyu
a1133beb80 KFD GEMM (#4221)
added to benchmark CI and fixed duplicated filenames between cuda and ptx
2024-04-19 00:43:18 -04:00
chenyu
a7c6864260 remove CAST_BEFORE_VIEW (#4152)
* remove CAST_BEFORE_VIEW

testing perf, also this might have issue with assign?

* remove all
2024-04-13 01:05:08 -04:00
George Hotz
0f16709c00 hotfix: remove test speed vs torch 2024-04-11 08:37:57 -07:00
chenyu
9a95d87366 metal CI run llama with 4 shards (#4103)
this can catch multi tensor issue on mac.
2024-04-07 11:04:08 -04:00
Szymon Ożóg
68fe3527f1 Tensor core ptx (#3894)
* tensor cores

* Merge from master

* faster program start in llvm (#3897)

* Fix the result permutation in einsum (#3895)

* Fix permutation of result indices in einsum.

* Delete stray line used for breaking tests

* Fix linter error by renaming twice-used variable

---------

Co-authored-by: chenyu <chenyu@fastmail.com>

* touchup einsum (#3900)

don't need rhs_letters

* hotfix check ckpts before writing achieved model (#3901)

this killed tinybox green run

* replace dtype.name str with render_dtype (#3903)

fixed some bf16 cast issue since it does not have `.name`.
also more robust if there are lang specific type override

* add --minimal flag to nvrtc (#3899)

* wmma: fix the AMD TC threads to split the first 16 threads (#3904)

previously it was incorrectly aliasing 16 into the size 8 upcast
on the store alias.  now it splits it properly into 8 and the
remaining 2 into the correct local stride

* training cifar with BF16 on CUDA (#3905)

* training cifar with BF16 on CUDA

memory usage is between float and half due to numpy calls on dataset preprocessing, which converts into float.

* simpler bf16 functions

* bf16 cifar works for HSA too just very slow

* simpler bf16 functions, we love cuda

* include negative float in test_dtype (#3884)

* include negative float in test_dtype

* that is ub

* too annoying

* pack can overflow

* add to benchmark

* change var name to satisfy mypy

* spacing

* Update to new TensorCore format

* Spacing

---------

Co-authored-by: nimlgen <138685161+nimlgen@users.noreply.github.com>
Co-authored-by: Alejandro F Queiruga <33233447+afqueiruga@users.noreply.github.com>
Co-authored-by: chenyu <chenyu@fastmail.com>
Co-authored-by: sekstini <127142660+sekstini@users.noreply.github.com>
Co-authored-by: Francis Lam <flam@alum.mit.edu>
Co-authored-by: George Hotz <72895+geohot@users.noreply.github.com>
2024-04-04 07:32:31 -07:00
Francis Lam
7c5729a3bd wmma: refactor to remove wmma_func and create TC funcs as needed (#3945)
* wmma: refactor to remove wmma_func and create TC funcs as needed

* test_linearizer: disable bf16 CUDA during emulation testing

* cstyle: clean up creation of CUDA vec dtypes

* extra/gemm: add option to accumulate to bfloat16

* cleanups

* benchmark: add CUDA bfloat16 matmul

* more cleanups
2024-03-27 16:43:09 -04:00
George Hotz
da07f31fd4 hotfix: remove bf16 test entirely 2024-03-26 20:50:27 -07:00
George Hotz
0d5845fb5b hotfix: jit is flaky on mac 2024-03-26 20:44:05 -07:00
chenyu
8df6587c41 hotfix 97.3 for beautiful_mnist (#3941) 2024-03-26 15:02:53 -04:00
chenyu
ef537672bf bf16 support in metal (#3929)
it runs if device gpu supports bfloat. updated ci benchmark too
2024-03-25 23:17:36 -04:00
chenyu
d651835ef5 verify beautiful_mnist.py eval acc and put into benchmark ci (#3926)
* verify beautiful_mnist and put in ci

* 97.5 for eval verification
2024-03-25 16:47:49 -04:00
chenyu
83f39a8ceb env var to change default float (#3902)
* env var to change default float to fp16 or bf16

looking for standard names for these. we have FLOAT16 that does something to IMAGE and HALF to convert weights.

working on default bf16 too.
```
RuntimeError: compile failed: <null>(6): error: identifier "__bf16" is undefined
    __bf16 cast0 = (nv_bfloat16)(val0);
```

remove that in cifar

* DEFAULT_FLOAT

* default of default

* unit test

* don't check default

* tests work on linux
2024-03-24 20:33:57 -04:00
chenyu
e22d78b3d2 training cifar with BF16 on CUDA (#3905)
* training cifar with BF16 on CUDA

memory usage is between float and half due to numpy calls on dataset preprocessing, which converts into float.

* simpler bf16 functions

* bf16 cifar works for HSA too just very slow

* simpler bf16 functions, we love cuda
2024-03-24 01:37:47 -04:00
chenyu
82ce60e172 use JIT_BATCH_SIZE=4 for GPT2 3090 benchmark (#3870)
smaller first batch saves about 0.05 ms per token. 1.75ms / tok on local 3090
2024-03-22 00:40:06 -04:00
chenyu
bc482729d0 lower hlb_cifar acc to 93.3 (#3865)
ran 30 runs and the lowest i see is 93.35. lowered to 93.3 for now.

maybe reenable ema later if it reduces variance
2024-03-21 17:58:53 -04:00
chenyu
7ff47e45a1 cifar TARGET_EVAL_ACC_PCT=93.5 (#3843) 2024-03-20 16:56:51 -04:00
chenyu
727de5ba1e llama 7B on 3090 benchmark (#3837)
* llama 7B on 3090 benchmark

* symlink llama
2024-03-20 12:48:22 -04:00
chenyu
e12bc85014 use BS=128 and BS=768 for resent benchmark (#3815)
50% more hcopt perf with this one weird trick
2024-03-18 23:49:55 -04:00
chenyu
1711274654 7B llama on 4 gpus on benchmark (#3804) 2024-03-18 14:32:37 -04:00
chenyu
77febb44e6 llama 7B on 6 gpus benchmark (#3773) 2024-03-16 11:38:52 -04:00
George Hotz
0870dd5b3b hotfix: switch resnet training from HIP -> HSA in CI 2024-03-15 13:35:52 -07:00
George Hotz
5b3d8a886e split tinybox benchmark into two (#3741)
* split tinybox benchmark into two

* symlinks
2024-03-14 14:12:32 -07:00
David Hou
199f7c4342 MLPerf Resnet (cleaned up) (#3573)
* this is a lot of stuff

TEST_TRAIN env for less data

don't diskcache get_train_files

debug message

no lr_scaler for fp32

comment, typo

type stuff

don't destructure proc

make batchnorm parameters float

make batchnorm parameters float

resnet18, checkpointing

hack up checkpointing to keep the names in there

oops

wandb_resume

lower lr

eval/ckpt use e+1

lars

report top_1_acc

some wandb stuff

split fw and bw steps to save memory

oops

save model when reach target

formatting

make sgd hparams consistent

just always write the cats tag...

pass X and Y into backward_step to trigger input replace

shuffle eval set to fix batchnorm eval

dataset is sorted by class, so the means and variances are all wrong

small cleanup

hack restore only one copy of each tensor

do bufs from lin after cache check (lru should handle it fine)

record epoch in wandb

more digits for topk in eval

more env vars

small cleanup

cleanup hack tricks

cleanup hack tricks

don't save ckpt for testeval

cleanup

diskcache train file glob

clean up a little

device_str

SCE into tensor

small

small

log_softmax out of resnet.py

oops

hack :(

comments

HeNormal, track gradient norm

oops

log SYNCBN to wandb

real truncnorm

less samples for truncated normal

custom init for Linear

log layer stats

small

Revert "small"

This reverts commit 988f4c1cf3.

Revert "log layer stats"

This reverts commit 9d98224585.

rename BNSYNC to SYNCBN to be consistent with cifar

optional TRACK_NORMS

fix label smoothing :/

lars skip list

only weight decay if not in skip list

comment

default 0 TRACK_NORMS

don't allocate beam scratch buffers if in cache

clean up data pipeline, unsplit train/test, put back a hack

remove print

run test_indexing on remu (#3404)

* emulated ops_hip infra

* add int4

* include test_indexing in remu

* Revert "Merge branch 'remu-dev-mac'"

This reverts commit 6870457e57, reversing
changes made to 3c4c8c9e16.

fix bad seeding

UnsyncBatchNorm2d but with synced trainable weights

label downsample batchnorm in Bottleneck

:/

:/

i mean... it runs... its hits the acc... its fast...

new unsyncbatchnorm for resnet

small fix

don't do assign buffer reuse for axis change

* remove changes

* remove changes

* move LARS out of tinygrad/

* rand_truncn rename

* whitespace

* stray whitespace

* no more gnorms

* delete some dataloading stuff

* remove comment

* clean up train script

* small comments

* move checkpointing stuff to mlperf helpers

* if WANDB

* small comments

* remove whitespace change

* new unsynced bn

* clean up prints / loop vars

* whitespace

* undo nn changes

* clean up loops

* rearrange getenvs

* cpu_count()

* PolynomialLR whitespace

* move he_normal out

* cap warmup in polylr

* rearrange wandb log

* realize both x and y in data_get

* use double quotes

* combine prints in ckpts resume

* take UBN from cifar

* running_var

* whitespace

* whitespace

* typo

* if instead of ternary for resnet downsample

* clean up dataloader cleanup a little?

* separate rng for shuffle

* clean up imports in model_train

* clean up imports

* don't realize copyin in data_get

* remove TESTEVAL (train dataloader didn't get freed every loop)

* adjust wandb_config entries a little

* clean up wandb config dict

* reduce lines

* whitespace

* shorter lines

* put shm unlink back, but it doesn't seem to do anything

* don't pass seed per task

* monkeypatch batchnorm

* the reseed was wrong

* add epoch number to desc

* don't unsyncedbatchnorm is syncbn=1

* put back downsample name

* eval every epoch

* Revert "the reseed was wrong"

This reverts commit 3440a07dff3f40e8a8d156ca3f1938558a59249f.

* cast lr in onecycle

* support fp16

* cut off kernel if expand after reduce

* test polynomial lr

* move polynomiallr to examples/mlperf

* working PolynomialDecayWithWarmup + tests.......

add lars_util.py, oops

* keep lars_util.py as intact as possible, simplify our interface

* no more half

* polylr and lars were merged

* undo search change

* override Linear init

* remove half stuff from model_train

* update scheduler init with new args

* don't divide by input mean

* mistake in resnet.py

* restore whitespace in resnet.py

* add test_data_parallel_resnet_train_step

* move initializers out of resnet.py

* unused imports

* log_softmax to model output in test to fix precision flakiness

* log_softmax to model output in test to fix precision flakiness

* oops, don't realize here

* is None

* realize initializations in order for determinism

* BENCHMARK flag for number of steps

* add resnet to bechmark.yml

* return instead of break

* missing return

* cpu_count, rearrange benchmark.yml

* unused variable

* disable tqdm if BENCHMARK

* getenv WARMUP_EPOCHS

* unlink disktensor shm file if exists

* terminate instead of join

* properly shut down queues

* use hip in benchmark for now

---------

Co-authored-by: George Hotz <72895+geohot@users.noreply.github.com>
2024-03-14 00:53:41 -04:00
chenyu
f30fb192b7 resnet eval on tinybox ci (#3714) 2024-03-13 13:26:30 -04:00
chenyu
d69170e27e add llama 2 70B in ci and verify output (#3682)
* add llama 2 70B in ci and verify output

* ln -s llama2 dir
2024-03-11 12:48:22 -04:00
chenyu
e10ee2ed3f llama beam tinybox ci (#3680) 2024-03-11 01:35:39 -04:00
chenyu
bad6adaf8c add mixtral and 6 gpus cifar to tinybox ci (#3676)
* add mixtral and 6 gpus cifar to tinybox ci

* print total ram used at the end of loading
2024-03-10 18:25:31 -04:00
chenyu
3c3f846c45 tinybox benchmark with HSA (#3603)
* tinybox benchmark with HSA

* torch cuda init can fail

* no TORCHCUDA

* print torch version

* LD_PRELOAD="/opt/rocm/lib/libhsa-runtime64.so"
2024-03-05 11:03:52 -05:00
chenyu
957e9800f1 llama + beam to mac benchmark, full cifar to nvidia benchmark (#3612)
would merge if it's also ~1 minute. btw why is gpt2 beam not slower in the first beam run?
2024-03-04 21:35:57 -05:00
chenyu
8e5d60a322 add more gpt2 variant in mac/nvidia benchmark (#3599) 2024-03-03 17:55:30 -05:00
Francis Lam
e17f1821a7 wmma: add CUDA tensor core and fix test_speed_v_torch failure (#3544) 2024-03-01 17:51:02 -08:00
chenyu
978a997d1f print nvidia-smi in CI benchmark (#3546) 2024-02-29 17:31:37 -05:00
George Hotz
e7cda40d52 Revert "hotfix: disable metal graph"
This reverts commit 3541602877.
2024-02-28 16:25:12 -08:00
George Hotz
3541602877 hotfix: disable metal graph 2024-02-28 10:33:34 -08:00
wozeparrot
57678012e1 Upload correct benchmark artifact (#3471)
* fix: correct filename

* fix: why is this .py?
2024-02-22 01:14:16 -05:00
chenyu
02683a8659 gate the cast before movements in lazy (#3452)
it made gpt2 slower (2ms -> 2.5ms on 3090, 7ms -> 8ms on M1 Max with BEAM=2).
disabled it in gpt2 benchmark before understanding the full issue
2024-02-20 09:36:22 -05:00
chenyu
d8ad9e5660 verify eval acc for hlb_cifar training (#3344)
set to 93% to reduce flakiness for now
2024-02-07 19:19:59 -05:00