Commit Graph

120 Commits

Author SHA1 Message Date
chenyu
e9c6a36894 remove CACHELEVEL=0 in llama3 benchmark (#5025) 2024-06-17 22:43:16 -04:00
George Hotz
bee8fc29ee add GPT2 half/half+beam to AMD (#5000)
* add GPT2 half/half+beam to AMD

* winograd in training. half and half/beam file upload
2024-06-16 14:07:14 -07:00
chenyu
44dfa37c70 use threefry in stable diffusion benchmark (#4988)
also updated default steps to 10. easier to tell the image is following the prompt.
2024-06-15 20:25:29 -04:00
wozeparrot
ce1ed374c9 more tinychat fixes (#4971) 2024-06-15 16:29:39 -07:00
qazal
ff8e9eefc3 hotfix: don't use ASSERT_COMPILE for benchmarks process replay (#4981)
* use replay_codegen [run_process_replay]

* disable for now [run_process_replay]
2024-06-15 16:57:47 +03:00
uuuvn
92f49efd06 Trigger process replay from pull request title [run_process_replay] (#4980)
* Trigger process replay from pull request title

* idk how this thing works btw

* test if it will work

* try 2

* Revert "idk how this thing works btw"

This reverts commit 580da51b07.

* Revert "try 2"

This reverts commit 7ff1e86d5d.

* test if it works

* meh

* Reapply "idk how this thing works btw"

This reverts commit dd33ad7c14.

* revert
2024-06-15 16:21:00 +03:00
George Hotz
f42183ba28 hotfix: relax cifar to 93.2 2024-06-09 13:09:21 +02:00
nimlgen
6327b50e51 amd in benchmarks (#4861)
* amd in benchmarks

* remove all hsa
2024-06-08 23:24:46 +03:00
qazal
240d6b5bc0 process replay benchmarks (#4668) 2024-06-01 14:36:21 +03:00
chenyu
38bc38cdff fix llama example quantize (#4699)
* fix llama example quantize

import quantize layers from new example llama3

add to mac benchmark

* fix that

* save the files
2024-05-23 15:35:26 -04:00
chenyu
72560e30fe add CACHELEVEL=0 to tinybox green GEMM BEAM (#4693)
* add CACHELEVEL=0 to tinybox green GEMM BEAM

* BEAM=4 is more stable
2024-05-22 23:59:50 -04:00
wozeparrot
00432496d7 feat: tinyboxgreen (#4366)
* feat: tinyboxgreen

* feat: tinyboxgreenv2

* fix symlink weights

* fix: remove llama 2 70b for now

* feat: naming

* fix: remove extra cifar steps

* feat: disable mixtral on nvidia
2024-05-20 22:39:34 -04:00
chenyu
8a0d1ca7bb CI test timeout 20 min -> 10 min (#4645)
if it takes more than 10 usually setup fails anyway. also updated matmul_kfd -> matmul_amd in benchmark
2024-05-18 13:58:28 -04:00
George Hotz
07b350a8f4 new uops is an actual graph (#4560)
* new uops is an actual graph

* it's way slower

* simpler

* fix define acc

* render_loop unique

* ops test pass

* add pattern matcher back, there's bugs

* rewrite

* use priority queue

* recursive children

* fix tests

* fix tests with SINK

* fix abstractions

* fix assembly

* simpler

* link define_acc

* fix DEFINE_ACC placement

* type verify

* full cmp

* fix cmp

* ACCESS_ACC

* insert DEFINE_ACC

* fix PHI

* recursive rewrite

* fix many tests

* sum collapse

* more patterns

* correct change

* fold arange

* fix that lin test

* space

* big folding rule works

* close

* has more maxes, meh

* cached node replace

* set changed

* simplest folding yet

* works

* works

* DIV

* all tests pass

* del

* fuzz linearizer fails

* sum_collapse

* test depth 2 cf

* fix lin test 14

* fix clang depth

* disable that

* failure 14 is fixed

* fix ptx

* failure 27 is fixed

* fix llama

* run_cnt

* Revert "Optimize PTX gated loads index calculation (#4304)"

This reverts commit d97d5a7689.

* fix uops loop

* fix ptx bugs

* add barrier

* print

* mem_type in ptx direct

* bypass tests that fail in CI but pass locally

* ptx remove ptr_ar

* more ptx passing

* fix ptx tests

* assert compile support

* remove  model inference benchmark from red
2024-05-17 18:00:18 -07:00
chenyu
ca1df20fa9 benchmark name fix - resnet eval is on eval data (#4628) 2024-05-17 12:56:12 -04:00
chenyu
e5d4e6a8aa BEAM=2 in green CI for 100 TFLOPS (#4624) 2024-05-16 23:28:28 -04:00
George Hotz
fd02ab1e8b move disassemblers and openpilot (#4592)
* move disassemblers and openpilot

* delete junk

* put that in pre-commit

* fixup readme
2024-05-14 19:30:02 -07:00
chenyu
5de4a46f10 re-enable gpt2 half/beam mac benchmark (#4496)
* re-enable gpt2 half/beam mac benchmark

from fuzzer it seems to be flaky due to numerical issue, not kernel bug. we used to have half in splitted reduce.

run this in M1 Max for 20 loops and it's fine

* that should be jitted
2024-05-09 19:15:32 -04:00
chenyu
c508eb7425 revert the removal of CAST_BEFORE_VIEW (#4471)
this brings most of the memory gain for resnet back.
2024-05-08 00:14:29 -04:00
chenyu
d4062cb6fc NV tensor_cores in kernel.py (#4399) 2024-05-02 22:33:08 -04:00
chenyu
dce7ac0160 NOCLANG=1 for tinybox green ci. (#4378)
CLANG was disabled for tinybox red for speed
2024-05-01 13:31:01 -04:00
wozeparrot
4a26718ca9 feat: tinyboxgreen (#4365) 2024-04-30 19:05:37 -04:00
chenyu
fdc8fabae5 disable flaky mac gpt2 beam benchmark and add back cifar mac with JIT=2 (#4358)
* debug flaky mac gpt2 beam run

* disable for now
2024-04-30 10:41:37 -04:00
chenyu
3ec4b745d6 JIT=2 for mac cifar benchmark (#4300)
also double BS for resnet training benchmark to match submission target
2024-04-25 18:33:40 -04:00
chenyu
c1fbacb182 resnet benchmarks use DEFAULT_FLOAT=HALF (#4285)
also update LR default to scaled based on 1536 (the BS we are submitting)
2024-04-24 12:10:57 -04:00
Szymon Ożóg
002a14088e Ptx store gate cast to bool (#4284)
* Cast gate to bool

* Update

* Add PTX fuzzing to benchmark
2024-04-24 11:43:44 -04:00
chenyu
759b4f41c3 few more KFD -> AMD (#4262)
benchmark gemm and default_parallel
2024-04-23 10:15:37 -04:00
Francis Lam
3f6c7ca8bf test: fix test_tensor_core_padded on CUDA and add to benchmarks (#4258)
* test: fix test_tensor_core_padded on CUDA and add to benchmarks

* fix linter

* run both tests in one call
2024-04-22 23:22:11 -04:00
Francis Lam
bbb0ad4800 wmma: widen TC usage in search by using PADTO on TC axes when possible (#4216)
* wmma: widen TC usage in search by using PADTO on TC axes when possible

* test: start tests for the new padding TC behavior

* search: upgrade padded TC search to TC_OPT >= 2

* test: add behavior and correctness test for padded TC

added optional argument to apply_tensor_core to set TC_OPT level

* linearizer: add tests for the PADTO behvaior and docs
2024-04-22 16:50:31 -04:00
chenyu
a1133beb80 KFD GEMM (#4221)
added to benchmark CI and fixed duplicated filenames between cuda and ptx
2024-04-19 00:43:18 -04:00
chenyu
a7c6864260 remove CAST_BEFORE_VIEW (#4152)
* remove CAST_BEFORE_VIEW

testing perf, also this might have issue with assign?

* remove all
2024-04-13 01:05:08 -04:00
George Hotz
0f16709c00 hotfix: remove test speed vs torch 2024-04-11 08:37:57 -07:00
chenyu
9a95d87366 metal CI run llama with 4 shards (#4103)
this can catch multi tensor issue on mac.
2024-04-07 11:04:08 -04:00
Szymon Ożóg
68fe3527f1 Tensor core ptx (#3894)
* tensor cores

* Merge from master

* faster program start in llvm (#3897)

* Fix the result permutation in einsum (#3895)

* Fix permutation of result indices in einsum.

* Delete stray line used for breaking tests

* Fix linter error by renaming twice-used variable

---------

Co-authored-by: chenyu <chenyu@fastmail.com>

* touchup einsum (#3900)

don't need rhs_letters

* hotfix check ckpts before writing achieved model (#3901)

this killed tinybox green run

* replace dtype.name str with render_dtype (#3903)

fixed some bf16 cast issue since it does not have `.name`.
also more robust if there are lang specific type override

* add --minimal flag to nvrtc (#3899)

* wmma: fix the AMD TC threads to split the first 16 threads (#3904)

previously it was incorrectly aliasing 16 into the size 8 upcast
on the store alias.  now it splits it properly into 8 and the
remaining 2 into the correct local stride

* training cifar with BF16 on CUDA (#3905)

* training cifar with BF16 on CUDA

memory usage is between float and half due to numpy calls on dataset preprocessing, which converts into float.

* simpler bf16 functions

* bf16 cifar works for HSA too just very slow

* simpler bf16 functions, we love cuda

* include negative float in test_dtype (#3884)

* include negative float in test_dtype

* that is ub

* too annoying

* pack can overflow

* add to benchmark

* change var name to satisfy mypy

* spacing

* Update to new TensorCore format

* Spacing

---------

Co-authored-by: nimlgen <138685161+nimlgen@users.noreply.github.com>
Co-authored-by: Alejandro F Queiruga <33233447+afqueiruga@users.noreply.github.com>
Co-authored-by: chenyu <chenyu@fastmail.com>
Co-authored-by: sekstini <127142660+sekstini@users.noreply.github.com>
Co-authored-by: Francis Lam <flam@alum.mit.edu>
Co-authored-by: George Hotz <72895+geohot@users.noreply.github.com>
2024-04-04 07:32:31 -07:00
Francis Lam
7c5729a3bd wmma: refactor to remove wmma_func and create TC funcs as needed (#3945)
* wmma: refactor to remove wmma_func and create TC funcs as needed

* test_linearizer: disable bf16 CUDA during emulation testing

* cstyle: clean up creation of CUDA vec dtypes

* extra/gemm: add option to accumulate to bfloat16

* cleanups

* benchmark: add CUDA bfloat16 matmul

* more cleanups
2024-03-27 16:43:09 -04:00
George Hotz
da07f31fd4 hotfix: remove bf16 test entirely 2024-03-26 20:50:27 -07:00
George Hotz
0d5845fb5b hotfix: jit is flaky on mac 2024-03-26 20:44:05 -07:00
chenyu
8df6587c41 hotfix 97.3 for beautiful_mnist (#3941) 2024-03-26 15:02:53 -04:00
chenyu
ef537672bf bf16 support in metal (#3929)
it runs if device gpu supports bfloat. updated ci benchmark too
2024-03-25 23:17:36 -04:00
chenyu
d651835ef5 verify beautiful_mnist.py eval acc and put into benchmark ci (#3926)
* verify beautiful_mnist and put in ci

* 97.5 for eval verification
2024-03-25 16:47:49 -04:00
chenyu
83f39a8ceb env var to change default float (#3902)
* env var to change default float to fp16 or bf16

looking for standard names for these. we have FLOAT16 that does something to IMAGE and HALF to convert weights.

working on default bf16 too.
```
RuntimeError: compile failed: <null>(6): error: identifier "__bf16" is undefined
    __bf16 cast0 = (nv_bfloat16)(val0);
```

remove that in cifar

* DEFAULT_FLOAT

* default of default

* unit test

* don't check default

* tests work on linux
2024-03-24 20:33:57 -04:00
chenyu
e22d78b3d2 training cifar with BF16 on CUDA (#3905)
* training cifar with BF16 on CUDA

memory usage is between float and half due to numpy calls on dataset preprocessing, which converts into float.

* simpler bf16 functions

* bf16 cifar works for HSA too just very slow

* simpler bf16 functions, we love cuda
2024-03-24 01:37:47 -04:00
chenyu
82ce60e172 use JIT_BATCH_SIZE=4 for GPT2 3090 benchmark (#3870)
smaller first batch saves about 0.05 ms per token. 1.75ms / tok on local 3090
2024-03-22 00:40:06 -04:00
chenyu
bc482729d0 lower hlb_cifar acc to 93.3 (#3865)
ran 30 runs and the lowest i see is 93.35. lowered to 93.3 for now.

maybe reenable ema later if it reduces variance
2024-03-21 17:58:53 -04:00
chenyu
7ff47e45a1 cifar TARGET_EVAL_ACC_PCT=93.5 (#3843) 2024-03-20 16:56:51 -04:00
chenyu
727de5ba1e llama 7B on 3090 benchmark (#3837)
* llama 7B on 3090 benchmark

* symlink llama
2024-03-20 12:48:22 -04:00
chenyu
e12bc85014 use BS=128 and BS=768 for resent benchmark (#3815)
50% more hcopt perf with this one weird trick
2024-03-18 23:49:55 -04:00
chenyu
1711274654 7B llama on 4 gpus on benchmark (#3804) 2024-03-18 14:32:37 -04:00
chenyu
77febb44e6 llama 7B on 6 gpus benchmark (#3773) 2024-03-16 11:38:52 -04:00
George Hotz
0870dd5b3b hotfix: switch resnet training from HIP -> HSA in CI 2024-03-15 13:35:52 -07:00