chenyu
dc942bf1f6
jit sampling functionn in test_randomness.test_multinomial ( #5034 )
...
* jit sampling functionn in test_randomness.test_multinomial
`THREEFRY=1 python3 -m pytest test/test_randomness.py::TestRandomness::test_multinomial --durations 1` 7 sec -> 1.2 sec
* skip that
2024-06-18 14:21:05 -04:00
chenyu
e9c6a36894
remove CACHELEVEL=0 in llama3 benchmark ( #5025 )
2024-06-17 22:43:16 -04:00
chenyu
acaf9a490d
RECIP(-0.0) should be -inf ( #5024 )
...
* RECIP(-0.0) should be -inf
added test_dtype_alu for PYTHON backend
* catcht that
* fix those two
2024-06-17 22:26:58 -04:00
George Hotz
bee8fc29ee
add GPT2 half/half+beam to AMD ( #5000 )
...
* add GPT2 half/half+beam to AMD
* winograd in training. half and half/beam file upload
2024-06-16 14:07:14 -07:00
chenyu
44dfa37c70
use threefry in stable diffusion benchmark ( #4988 )
...
also updated default steps to 10. easier to tell the image is following the prompt.
2024-06-15 20:25:29 -04:00
wozeparrot
ce1ed374c9
more tinychat fixes ( #4971 )
2024-06-15 16:29:39 -07:00
qazal
ff8e9eefc3
hotfix: don't use ASSERT_COMPILE for benchmarks process replay ( #4981 )
...
* use replay_codegen [run_process_replay]
* disable for now [run_process_replay]
2024-06-15 16:57:47 +03:00
uuuvn
92f49efd06
Trigger process replay from pull request title [run_process_replay] ( #4980 )
...
* Trigger process replay from pull request title
* idk how this thing works btw
* test if it will work
* try 2
* Revert "idk how this thing works btw"
This reverts commit 580da51b07 .
* Revert "try 2"
This reverts commit 7ff1e86d5d .
* test if it works
* meh
* Reapply "idk how this thing works btw"
This reverts commit dd33ad7c14 .
* revert
2024-06-15 16:21:00 +03:00
wozeparrot
62dc36d371
autogen _try_dlopen ( #4949 )
2024-06-14 12:12:18 -07:00
chenyu
f902af4f0b
increase metal ci test timeout to 20 minutes ( #4920 )
...
make it less annoying for now
2024-06-11 18:45:51 -04:00
qazal
7f3d9e6d94
revert hsa autogen removal ( #4914 )
...
* Revert "only install comgr in AMD CI (#4909 )"
This reverts commit 7f03420d05 .
* rocm-llvm only removal
2024-06-11 12:55:45 -04:00
qazal
7f03420d05
only install comgr in AMD CI ( #4909 )
...
* test
* delete hsa autogen
2024-06-11 06:19:33 -04:00
qazal
8b5bcf309a
process replay in all of CI ( #4884 )
2024-06-10 14:49:29 -04:00
George Hotz
f42183ba28
hotfix: relax cifar to 93.2
2024-06-09 13:09:21 +02:00
nimlgen
654a8b9ef7
retire hsa ( #4885 )
...
* retire hsa
* EMULATE_AMD
2024-06-09 11:33:03 +03:00
nimlgen
6327b50e51
amd in benchmarks ( #4861 )
...
* amd in benchmarks
* remove all hsa
2024-06-08 23:24:46 +03:00
qazal
66dfd5e7bf
faster codegen process replay ( #4858 )
...
* faster codegen process replay
* use self.copy
* regenerate
* delete copy
* test a real error [run_process_replay]
* revert the error change
2024-06-07 16:20:57 +03:00
qazal
0db9674dea
skip process replay on master ( #4808 )
2024-06-03 12:29:28 +03:00
qazal
f64fa51a64
process replay for test/* ( #4799 )
...
* add input to unit tests [run_process_replay]
* add setup [run_process_replay]
* run tests [run_process_replay]
* add cuda and amd [run_process_replay]
* run everything but BEAM=2 [run_process_replay]
* skip export_model [run_process_replay]
* fix amd CI
* add concurrency back
2024-06-03 12:01:58 +03:00
qazal
240d6b5bc0
process replay benchmarks ( #4668 )
2024-06-01 14:36:21 +03:00
nimlgen
bd2e7c8b31
amd registers from file ( #4778 )
...
* amd registers from file
* remove commentes
* linetr
* no off
2024-05-31 18:48:57 +03:00
Szymon Ożóg
a4de81e9a6
Update ocelot version ( #4715 )
2024-05-24 14:32:53 -04:00
chenyu
38bc38cdff
fix llama example quantize ( #4699 )
...
* fix llama example quantize
import quantize layers from new example llama3
add to mac benchmark
* fix that
* save the files
2024-05-23 15:35:26 -04:00
chenyu
72560e30fe
add CACHELEVEL=0 to tinybox green GEMM BEAM ( #4693 )
...
* add CACHELEVEL=0 to tinybox green GEMM BEAM
* BEAM=4 is more stable
2024-05-22 23:59:50 -04:00
Yury Zhuravlev
af56f0e68a
fix HSA/KFD load for system-wide installation ( #4218 )
...
Co-authored-by: wozeparrot <wozeparrot@gmail.com >
2024-05-22 20:33:21 -07:00
nimlgen
12339f6564
disable cuda test in ci ( #4630 )
...
Co-authored-by: chenyu <chenyu@fastmail.com >
2024-05-22 23:23:32 -04:00
qazal
498cf3e7e0
fuzzer path search for DEFINE_ACC ( #4656 )
...
* insert acc
* add test_ops
* find toposorts
* todo - not yet ready
* remove the import
* atol and childless children
2024-05-23 00:50:01 +03:00
qazal
458a3961eb
catch compile errors in uops tests ( #4672 )
...
* use helper and compile
* llama beam=2
* ast length
* skip float4, fix hsa
* use empty tensors
2024-05-21 12:20:35 +03:00
wozeparrot
00432496d7
feat: tinyboxgreen ( #4366 )
...
* feat: tinyboxgreen
* feat: tinyboxgreenv2
* fix symlink weights
* fix: remove llama 2 70b for now
* feat: naming
* fix: remove extra cifar steps
* feat: disable mixtral on nvidia
2024-05-20 22:39:34 -04:00
chenyu
8a0d1ca7bb
CI test timeout 20 min -> 10 min ( #4645 )
...
if it takes more than 10 usually setup fails anyway. also updated matmul_kfd -> matmul_amd in benchmark
2024-05-18 13:58:28 -04:00
George Hotz
b74cc1d01a
uops cleanup ( #4634 )
...
* def add cleanup
* minor speedup
* add back ptx speed
* a little faster
* merge that
* only linearize once for ptx
* two graph rewrites for ptx, bug?
2024-05-17 20:02:38 -07:00
George Hotz
07b350a8f4
new uops is an actual graph ( #4560 )
...
* new uops is an actual graph
* it's way slower
* simpler
* fix define acc
* render_loop unique
* ops test pass
* add pattern matcher back, there's bugs
* rewrite
* use priority queue
* recursive children
* fix tests
* fix tests with SINK
* fix abstractions
* fix assembly
* simpler
* link define_acc
* fix DEFINE_ACC placement
* type verify
* full cmp
* fix cmp
* ACCESS_ACC
* insert DEFINE_ACC
* fix PHI
* recursive rewrite
* fix many tests
* sum collapse
* more patterns
* correct change
* fold arange
* fix that lin test
* space
* big folding rule works
* close
* has more maxes, meh
* cached node replace
* set changed
* simplest folding yet
* works
* works
* DIV
* all tests pass
* del
* fuzz linearizer fails
* sum_collapse
* test depth 2 cf
* fix lin test 14
* fix clang depth
* disable that
* failure 14 is fixed
* fix ptx
* failure 27 is fixed
* fix llama
* run_cnt
* Revert "Optimize PTX gated loads index calculation (#4304 )"
This reverts commit d97d5a7689 .
* fix uops loop
* fix ptx bugs
* add barrier
* print
* mem_type in ptx direct
* bypass tests that fail in CI but pass locally
* ptx remove ptr_ar
* more ptx passing
* fix ptx tests
* assert compile support
* remove model inference benchmark from red
2024-05-17 18:00:18 -07:00
chenyu
ca1df20fa9
benchmark name fix - resnet eval is on eval data ( #4628 )
2024-05-17 12:56:12 -04:00
chenyu
e5d4e6a8aa
BEAM=2 in green CI for 100 TFLOPS ( #4624 )
2024-05-16 23:28:28 -04:00
nimlgen
eb9689336e
nv mockgpu ( #4600 )
...
* mockgpu nv
* works
* comment that out
* fix merge
* setup gpuocelot
* install packages
* not run all of them
* passes
* fix ci
* almost
* should pass
* linter
* linter 2
* try this?
* ugn, not supported
* ci
* remove ticket from description
* better descs
2024-05-15 23:46:08 +03:00
George Hotz
5ba611787d
move image into tensor.py. delete features ( #4603 )
...
* move image into tensor.py
* change setup.py
* openpilot tests need pythonpath now
2024-05-15 10:50:25 -07:00
George Hotz
afa9753d39
ruff cleanup ( #4594 )
...
* check editor config
* no editorconfig, it doesn't work
* ruff cleanups
2024-05-14 21:16:14 -07:00
George Hotz
9425973bc7
docs cleanup and move ( #4593 )
...
* cleanup and move
* docs-legacy is gone
* don't update setup.py
2024-05-14 20:44:59 -07:00
George Hotz
fd02ab1e8b
move disassemblers and openpilot ( #4592 )
...
* move disassemblers and openpilot
* delete junk
* put that in pre-commit
* fixup readme
2024-05-14 19:30:02 -07:00
nimlgen
9b02aef45a
remove rhip ( #4579 )
...
* remove rhip
* remove hip runner
2024-05-14 17:58:19 +03:00
nimlgen
2131556c2c
amd mockgpu ( #4535 )
...
* start mock amd gpu
* virt files
* cleaner
* init ci
* small fixes
* linter
* better?
* ugh
* linter
* fix
* diable some
* run shorter
* fixes
* add hcq test
* fix
* fix cmd revert
2024-05-14 14:28:04 +03:00
chenyu
5de4a46f10
re-enable gpt2 half/beam mac benchmark ( #4496 )
...
* re-enable gpt2 half/beam mac benchmark
from fuzzer it seems to be flaky due to numerical issue, not kernel bug. we used to have half in splitted reduce.
run this in M1 Max for 20 loops and it's fine
* that should be jitted
2024-05-09 19:15:32 -04:00
chenyu
c508eb7425
revert the removal of CAST_BEFORE_VIEW ( #4471 )
...
this brings most of the memory gain for resnet back.
2024-05-08 00:14:29 -04:00
qazal
760776c59d
merge EfficientNet to C with clang job ( #4426 )
...
* merge ImageNet to C with linters
* add to clang
* delete from linter
2024-05-05 20:33:12 +03:00
chenyu
d4062cb6fc
NV tensor_cores in kernel.py ( #4399 )
2024-05-02 22:33:08 -04:00
chenyu
dce7ac0160
NOCLANG=1 for tinybox green ci. ( #4378 )
...
CLANG was disabled for tinybox red for speed
2024-05-01 13:31:01 -04:00
wozeparrot
4a26718ca9
feat: tinyboxgreen ( #4365 )
2024-04-30 19:05:37 -04:00
chenyu
fdc8fabae5
disable flaky mac gpt2 beam benchmark and add back cifar mac with JIT=2 ( #4358 )
...
* debug flaky mac gpt2 beam run
* disable for now
2024-04-30 10:41:37 -04:00
Francis Lata
bb849a57d1
[MLPerf] UNet3D dataloader ( #4343 )
...
* add support for train/val datasets for kits19
* split dataset into train and val sets
* add tests for kits19 dataloader
* add MLPerf dataset tests to CI
* update unet3d model_eval script
* fix linting
* add nibabel
* fix how mock dataset gets created
* update ref implementation with permalink and no edits
* clean up test and update rand_flip implementation
* cleanups
2024-04-28 22:34:18 -04:00
chenyu
3ec4b745d6
JIT=2 for mac cifar benchmark ( #4300 )
...
also double BS for resnet training benchmark to match submission target
2024-04-25 18:33:40 -04:00