David Hou
7fb220a567
touchup resnet_layer_bench ( #4191 )
2024-04-16 14:43:00 -04:00
David Hou
1dbf3b2b19
Benchmarks for individual resnet layers ( #4182 )
...
* resnet individual layer benchmarks!
* small
* 1 and 2
* mem_used
* no ci
* better conv print
* defaults
* prints
* adjust
* adjust
* adjust
* benchmark only one layer example
* tensor.training, zero_grad, sum instead of mean, last mem, last kernel count
* default jitcnt=1
* scale flops/kernels with jitcnt
* add note about jitcnt memory
* touchup
2024-04-16 13:53:18 -04:00
George Hotz
50e780a588
multitensor shouldn't recompile ( #4164 )
...
* multitensor shouldn't recompile
* type annotations
* fix tests
* outcount in reduce
2024-04-13 00:03:48 -07:00
chenyu
a7c6864260
remove CAST_BEFORE_VIEW ( #4152 )
...
* remove CAST_BEFORE_VIEW
testing perf, also this might have issue with assign?
* remove all
2024-04-13 01:05:08 -04:00
George Hotz
ebc94c9d6c
rewrite the jit in the context of new schedule ( #4162 )
...
* rewrite the jit in the context of new schedule
* mypy better
* fix placeholder
* tests
* all functionality should work
* fix tests
* no CacheCollector
2024-04-12 21:54:36 -07:00
George Hotz
bbda20c0db
CompiledASTRunner -> CompiledRunner ( #4148 )
2024-04-11 08:49:52 -07:00
George Hotz
b7e281cf10
JitItem -> ExecItem ( #4146 )
...
* JitItem -> ExecItem
* execitem in realize
* cleaner
* JITRunner -> Runner
2024-04-11 08:24:57 -07:00
terafo
5e6d2155e4
Add driving monitoring model to benchmarks ( #4134 )
...
* add driving monitoring model to benchmarks
* handle crash
2024-04-10 14:27:03 -04:00
geohotstan
fe88591890
update onnx to 1.16.0 ( #4127 )
...
* update
* pass tests and skip tests
2024-04-10 11:19:13 -04:00
George Hotz
ae849d12d7
numpy device + pickle it ( #4120 )
2024-04-09 13:19:30 -07:00
George Hotz
164329a8ea
address kfd feedback ( #4087 )
...
* address kfd feedback
* signals cleanup
* signals cleanup
* handle 2 doorbell pages correctly
* signal reset cleanup
* signals cleanup
* more GTT
* cleanups
* minor cleanups
2024-04-05 15:24:41 -07:00
George Hotz
a337922c44
more work on kfd ( #4079 )
...
* more work on kfd
* fix multitensor test on kfd
* stuff
2024-04-05 08:36:36 -07:00
George Hotz
3de855ea50
don't use SVM memory in KFD ( #4072 )
...
* don't use SVM memory in KFD
* copy from fd
* cleanups
* transfer
* hacks
* ops_hsa
* tighter API
2024-04-04 17:33:21 -07:00
George Hotz
7181ffd630
HWCopyQueue in KFD ( #4042 )
...
* HWCopyQueue in KFD
* hw compute queue
* test
* move test
* more tests
* fix wait
* fix multimap
* mes crash
* tests pass but slow
* stuff is working
* one more test
2024-04-03 20:14:24 -07:00
chenyu
e3c0ac9fbf
remove old envvar "OPT" ( #4060 )
2024-04-03 14:55:21 -04:00
chenyu
406cb5fd90
const fold ReduceOps ( #4059 )
2024-04-03 14:39:28 -04:00
chenyu
c71627fee6
move GlobalCounter to helpers ( #4002 )
...
break circular import between ops and buffer
2024-03-30 00:30:30 -04:00
George Hotz
f916aadaea
external that test
2024-03-29 19:35:50 -07:00
chenyu
d9ff636cf5
use is to compare with enum ( #3993 )
...
* use is to compare with enum
currently it's mixed between `==` and `is`, moved all to `is`
* more
2024-03-29 13:02:56 -04:00
chenyu
b47f6cebb2
LinearizerOptions -> CompilerOptions ( #3978 )
2024-03-28 17:50:23 -04:00
George Hotz
42b9d999ea
Buffer isn't always allocated ( #3974 )
...
* buffer alloc
* allocate
* missing allocates
* last one
2024-03-28 13:33:47 -07:00
geohotstan
bd3a7d068c
correct device for validation test in model benchmark CI ( #3960 )
...
* fix tests
* add clang back for only metal
* change the name to reflect CLANG being ran
* add back cuda
2024-03-27 13:40:06 -04:00
George Hotz
68ca4d4276
split to schedule.py ( #3949 )
...
* split to schedule.py
* split
2024-03-26 21:02:46 -07:00
George Hotz
150ea2eb76
create engine folder and move code ( #3948 )
...
* retry
* older tf
* that
2024-03-26 20:38:03 -07:00
Francis Lam
5530b0cbed
fuzz_linearizer: reduce debug verbosity and make easier for CI usage ( #3942 )
...
* fuzz_linearizer: reduce debug verbosity and make easier for CI usage
* rename FUZZ_BEAM to FUZZ_ALL_ACTIONS (not choosing a subset)
* skip simple ASTs (easier to use with LOGOPS output)
* don't fuzz a previously seen AST
* add options to allow non-zero --expected-failures
* clean up naming and use set
2024-03-26 16:25:24 -04:00
nimlgen
e2d6f76723
_alloc and _free with options ( #3934 )
...
* _alloc has options
* linter
* fix hsa
2024-03-26 09:11:41 -07:00
wozeparrot
9a9cac58f9
add lars to nn ( #3750 )
...
* feat: add lars
* feat: don't remove this comment
* clean: smaller diff
* clean: shorter line
* feat: remove mlperf lars, switch resnet
* fix: fully remove mlperf lars
* clean: comment
* feat: contiguous
* feat: no weight decay on skip params
* feat: optimizergroup
* feat: classic momentum
* fix: pylint
* clean: move comment
* fix: correct algo
* feat: lrschedulergroup
* feat: skip list tests
* feat: :| forgot that params are a thing
* feat: remove skip_list params from main params
* feat: set moment
---------
Co-authored-by: chenyu <chenyu@fastmail.com >
2024-03-24 11:43:12 -04:00
chenyu
a2b2597fc2
replace dtype.name str with render_dtype ( #3903 )
...
fixed some bf16 cast issue since it does not have `.name`.
also more robust if there are lang specific type override
2024-03-23 19:25:48 -04:00
Francis Lam
8db7a6bbcc
debug: add optional detailed BEAM_LOG logging ( #3883 )
...
* debug: add optional detailed BEAM_LOG logging
show uop count, compile and run times for each candidate in search
also add --timing to verify_kernel.py to make it easier to explore
hand-crafted applied opts
* fix linter
2024-03-22 19:23:31 -04:00
Francis Lam
5587594a00
fuzz_linearizer: add --ast and --file params to read kernels ( #3877 )
...
also fix up ast_str_to_str to support the new tuple of LazyOps
2024-03-22 14:27:40 -04:00
uuuvn
6729f20aab
Ring allreduce try 2 ( #3852 )
...
* Ring allreduce v3
* Configurable size, number of gpus and jit in benchmark
* ScheduleBarrier v0
* GB/s that make sense
* ScheduleBarrier v0.1
* Fallback on 2 GPUs
* ScheduleBarrier v0.2
* ScheduleBarrier v0.3
* ScheduleBarrier v0.3.1
* ScheduleBarrier v0.3.2
* Replace ScheduleBarrier with automatic optimization
* unused import
* fix comment
* typing
* better fallback
* python 3.8
* RING=2 and use ContextVar
* DEBUG >= 2 and change name
* linter
* type
---------
Co-authored-by: George Hotz <72895+geohot@users.noreply.github.com >
Co-authored-by: chenyu <chenyu@fastmail.com >
Co-authored-by: nimlgen <138685161+nimlgen@users.noreply.github.com >
2024-03-21 19:17:51 -04:00
Francis Lam
3c0478bfab
fuzz_linearizer: add additional DEBUG info for comparison errors ( #3866 )
2024-03-21 18:58:10 -04:00
chenyu
e50b7abe4f
diversed buf inputs based on dtype in fuzz_linearizer ( #3863 )
2024-03-21 16:23:11 -04:00
chenyu
30fa03243e
reuse fuzz_linearizer.compare_linearizer in test_linearizer_failures ( #3861 )
2024-03-21 14:12:27 -04:00
chenyu
6bf0b82267
alloc new output in fuzz_linearizer between baseline and real one ( #3859 )
...
if the kernel is an assign `a += 1`, the rawbufs[0] is updated twice and gives false compare_error
2024-03-21 11:36:05 -04:00
nimlgen
85691c8e20
fix hsa sync issue ( #3847 )
...
* fix hsa sync issue
* linter
2024-03-21 04:00:30 +03:00
Francis Lam
6d5dec2fef
log optimized kernels and a script to compare with non-optimized ones ( #3829 )
...
* search: add BEAM_VERIFY option to validate search results
refactor fuzz_linearizer comparison to allow it to be used in for
BEAM_VERIFY in device.py
* search: fix to verify the beam_search result and not the fastest
* search: fix typing and clean up
* device: remove imports from test and add LOGKERN options
LOGKERN output can be used with test/external/verify_kernel.py
to validate correctness
* fix example in verify_kernel.py
* cleanup fixes
* fix to use f-strings
2024-03-20 19:22:08 -04:00
George Hotz
8cb5215885
Revert "Ring allreduce in multitensor ( #3000 )" ( #3840 )
...
This reverts commit c5bf9e4c96 .
2024-03-20 11:41:49 -07:00
uuuvn
c5bf9e4c96
Ring allreduce in multitensor ( #3000 )
...
* Ring allreduce v3
* Configurable size, number of gpus and jit in benchmark
* ScheduleBarrier v0
* GB/s that make sense
* ScheduleBarrier v0.1
* Fallback on 2 GPUs
* ScheduleBarrier v0.2
* ScheduleBarrier v0.3
* ScheduleBarrier v0.3.1
* ScheduleBarrier v0.3.2
* Replace ScheduleBarrier with automatic optimization
* unused import
* fix comment
* typing
* better fallback
* python 3.8
---------
Co-authored-by: George Hotz <72895+geohot@users.noreply.github.com >
Co-authored-by: chenyu <chenyu@fastmail.com >
Co-authored-by: nimlgen <138685161+nimlgen@users.noreply.github.com >
2024-03-20 11:20:01 -07:00
chenyu
20681d5c4a
remove old dist multigpu ( #3811 )
2024-03-18 18:31:05 -04:00
George Hotz
bf3e1c4df2
support pickling tensors and others ( #3787 )
...
* test pickle tensors
* pickle unrealized tensor
* pickle jit, don't save Device in every CompiledASTRunner
* real test of pickle, move delete
2024-03-17 18:29:14 -07:00
qazal
e3e89c244b
multioutput uoping infra ( #3706 )
...
* linearize multioutput
* add vars to copy
2024-03-15 21:56:59 -07:00
chenyu
a2d3cf64a5
move is_dtype_supported to test.helpers ( #3762 )
...
* move is_dtype_supported to test.helpers
updated all places that check if float16 is supports
* fix tests
2024-03-15 14:33:26 -04:00
nimlgen
ba79a3c09a
some hsa lines saving + fixes ( #3752 )
...
* fix write to ring + some lines
* hsa driver test
2024-03-15 18:12:18 +03:00
chenyu
0ead0bdb65
script to benchmark beam v hcopt ( #3737 )
...
the goal is that big enough beam should be faster than hcopt/tc
also this failed on tc opt
NUM=2 FILTER_REDUCE=1 TEST_N=20 BEAM=4 DEBUG=2 python test/external/speed_beam_v_hcopt.py
2024-03-14 15:04:03 -04:00
qazal
337cd53444
multioutput ScheduleItem ( #3699 )
...
* refactor realize.py
* update docs
* update test_sched
* update runners and devices
* update openpilot and unit tests
* cleanup runner lowering
* update more tests
2024-03-13 08:59:38 -07:00
nimlgen
08064a0e29
add SEED env to fuzz_linearizer ( #3713 )
...
* add SEED env to test/external/fuzz_linearizer.py
* found some
* more platforms
2024-03-13 18:08:42 +03:00
George Hotz
ac02e7347d
ptx timing vs cuda timing ( #3659 )
2024-03-08 10:17:49 -08:00
chenyu
e25879d50e
don't get new var_val for the same ast in fuzz_linearizer ( #3657 )
...
fixed result comparison for kernels with variables
2024-03-08 09:49:24 -05:00
chenyu
1130c73844
add FUZZ_NTH to fuzz_linearizer ( #3656 )
...
* add FUZZ_NTH to fuzz_linearizer
also update tests in test_linearizer_failures to not just run on METAL
* update failures for HIP/HSA
* test_failure_21 LLVM PADTO
2024-03-08 09:16:49 -05:00