George Hotz
164329a8ea
address kfd feedback ( #4087 )
...
* address kfd feedback
* signals cleanup
* signals cleanup
* handle 2 doorbell pages correctly
* signal reset cleanup
* signals cleanup
* more GTT
* cleanups
* minor cleanups
2024-04-05 15:24:41 -07:00
George Hotz
a337922c44
more work on kfd ( #4079 )
...
* more work on kfd
* fix multitensor test on kfd
* stuff
2024-04-05 08:36:36 -07:00
George Hotz
3de855ea50
don't use SVM memory in KFD ( #4072 )
...
* don't use SVM memory in KFD
* copy from fd
* cleanups
* transfer
* hacks
* ops_hsa
* tighter API
2024-04-04 17:33:21 -07:00
George Hotz
7181ffd630
HWCopyQueue in KFD ( #4042 )
...
* HWCopyQueue in KFD
* hw compute queue
* test
* move test
* more tests
* fix wait
* fix multimap
* mes crash
* tests pass but slow
* stuff is working
* one more test
2024-04-03 20:14:24 -07:00
chenyu
e3c0ac9fbf
remove old envvar "OPT" ( #4060 )
2024-04-03 14:55:21 -04:00
chenyu
406cb5fd90
const fold ReduceOps ( #4059 )
2024-04-03 14:39:28 -04:00
chenyu
c71627fee6
move GlobalCounter to helpers ( #4002 )
...
break circular import between ops and buffer
2024-03-30 00:30:30 -04:00
George Hotz
f916aadaea
external that test
2024-03-29 19:35:50 -07:00
chenyu
d9ff636cf5
use is to compare with enum ( #3993 )
...
* use is to compare with enum
currently it's mixed between `==` and `is`, moved all to `is`
* more
2024-03-29 13:02:56 -04:00
chenyu
b47f6cebb2
LinearizerOptions -> CompilerOptions ( #3978 )
2024-03-28 17:50:23 -04:00
George Hotz
42b9d999ea
Buffer isn't always allocated ( #3974 )
...
* buffer alloc
* allocate
* missing allocates
* last one
2024-03-28 13:33:47 -07:00
geohotstan
bd3a7d068c
correct device for validation test in model benchmark CI ( #3960 )
...
* fix tests
* add clang back for only metal
* change the name to reflect CLANG being ran
* add back cuda
2024-03-27 13:40:06 -04:00
George Hotz
68ca4d4276
split to schedule.py ( #3949 )
...
* split to schedule.py
* split
2024-03-26 21:02:46 -07:00
George Hotz
150ea2eb76
create engine folder and move code ( #3948 )
...
* retry
* older tf
* that
2024-03-26 20:38:03 -07:00
Francis Lam
5530b0cbed
fuzz_linearizer: reduce debug verbosity and make easier for CI usage ( #3942 )
...
* fuzz_linearizer: reduce debug verbosity and make easier for CI usage
* rename FUZZ_BEAM to FUZZ_ALL_ACTIONS (not choosing a subset)
* skip simple ASTs (easier to use with LOGOPS output)
* don't fuzz a previously seen AST
* add options to allow non-zero --expected-failures
* clean up naming and use set
2024-03-26 16:25:24 -04:00
nimlgen
e2d6f76723
_alloc and _free with options ( #3934 )
...
* _alloc has options
* linter
* fix hsa
2024-03-26 09:11:41 -07:00
wozeparrot
9a9cac58f9
add lars to nn ( #3750 )
...
* feat: add lars
* feat: don't remove this comment
* clean: smaller diff
* clean: shorter line
* feat: remove mlperf lars, switch resnet
* fix: fully remove mlperf lars
* clean: comment
* feat: contiguous
* feat: no weight decay on skip params
* feat: optimizergroup
* feat: classic momentum
* fix: pylint
* clean: move comment
* fix: correct algo
* feat: lrschedulergroup
* feat: skip list tests
* feat: :| forgot that params are a thing
* feat: remove skip_list params from main params
* feat: set moment
---------
Co-authored-by: chenyu <chenyu@fastmail.com >
2024-03-24 11:43:12 -04:00
chenyu
a2b2597fc2
replace dtype.name str with render_dtype ( #3903 )
...
fixed some bf16 cast issue since it does not have `.name`.
also more robust if there are lang specific type override
2024-03-23 19:25:48 -04:00
Francis Lam
8db7a6bbcc
debug: add optional detailed BEAM_LOG logging ( #3883 )
...
* debug: add optional detailed BEAM_LOG logging
show uop count, compile and run times for each candidate in search
also add --timing to verify_kernel.py to make it easier to explore
hand-crafted applied opts
* fix linter
2024-03-22 19:23:31 -04:00
Francis Lam
5587594a00
fuzz_linearizer: add --ast and --file params to read kernels ( #3877 )
...
also fix up ast_str_to_str to support the new tuple of LazyOps
2024-03-22 14:27:40 -04:00
uuuvn
6729f20aab
Ring allreduce try 2 ( #3852 )
...
* Ring allreduce v3
* Configurable size, number of gpus and jit in benchmark
* ScheduleBarrier v0
* GB/s that make sense
* ScheduleBarrier v0.1
* Fallback on 2 GPUs
* ScheduleBarrier v0.2
* ScheduleBarrier v0.3
* ScheduleBarrier v0.3.1
* ScheduleBarrier v0.3.2
* Replace ScheduleBarrier with automatic optimization
* unused import
* fix comment
* typing
* better fallback
* python 3.8
* RING=2 and use ContextVar
* DEBUG >= 2 and change name
* linter
* type
---------
Co-authored-by: George Hotz <72895+geohot@users.noreply.github.com >
Co-authored-by: chenyu <chenyu@fastmail.com >
Co-authored-by: nimlgen <138685161+nimlgen@users.noreply.github.com >
2024-03-21 19:17:51 -04:00
Francis Lam
3c0478bfab
fuzz_linearizer: add additional DEBUG info for comparison errors ( #3866 )
2024-03-21 18:58:10 -04:00
chenyu
e50b7abe4f
diversed buf inputs based on dtype in fuzz_linearizer ( #3863 )
2024-03-21 16:23:11 -04:00
chenyu
30fa03243e
reuse fuzz_linearizer.compare_linearizer in test_linearizer_failures ( #3861 )
2024-03-21 14:12:27 -04:00
chenyu
6bf0b82267
alloc new output in fuzz_linearizer between baseline and real one ( #3859 )
...
if the kernel is an assign `a += 1`, the rawbufs[0] is updated twice and gives false compare_error
2024-03-21 11:36:05 -04:00
nimlgen
85691c8e20
fix hsa sync issue ( #3847 )
...
* fix hsa sync issue
* linter
2024-03-21 04:00:30 +03:00
Francis Lam
6d5dec2fef
log optimized kernels and a script to compare with non-optimized ones ( #3829 )
...
* search: add BEAM_VERIFY option to validate search results
refactor fuzz_linearizer comparison to allow it to be used in for
BEAM_VERIFY in device.py
* search: fix to verify the beam_search result and not the fastest
* search: fix typing and clean up
* device: remove imports from test and add LOGKERN options
LOGKERN output can be used with test/external/verify_kernel.py
to validate correctness
* fix example in verify_kernel.py
* cleanup fixes
* fix to use f-strings
2024-03-20 19:22:08 -04:00
George Hotz
8cb5215885
Revert "Ring allreduce in multitensor ( #3000 )" ( #3840 )
...
This reverts commit c5bf9e4c96 .
2024-03-20 11:41:49 -07:00
uuuvn
c5bf9e4c96
Ring allreduce in multitensor ( #3000 )
...
* Ring allreduce v3
* Configurable size, number of gpus and jit in benchmark
* ScheduleBarrier v0
* GB/s that make sense
* ScheduleBarrier v0.1
* Fallback on 2 GPUs
* ScheduleBarrier v0.2
* ScheduleBarrier v0.3
* ScheduleBarrier v0.3.1
* ScheduleBarrier v0.3.2
* Replace ScheduleBarrier with automatic optimization
* unused import
* fix comment
* typing
* better fallback
* python 3.8
---------
Co-authored-by: George Hotz <72895+geohot@users.noreply.github.com >
Co-authored-by: chenyu <chenyu@fastmail.com >
Co-authored-by: nimlgen <138685161+nimlgen@users.noreply.github.com >
2024-03-20 11:20:01 -07:00
chenyu
20681d5c4a
remove old dist multigpu ( #3811 )
2024-03-18 18:31:05 -04:00
George Hotz
bf3e1c4df2
support pickling tensors and others ( #3787 )
...
* test pickle tensors
* pickle unrealized tensor
* pickle jit, don't save Device in every CompiledASTRunner
* real test of pickle, move delete
2024-03-17 18:29:14 -07:00
qazal
e3e89c244b
multioutput uoping infra ( #3706 )
...
* linearize multioutput
* add vars to copy
2024-03-15 21:56:59 -07:00
chenyu
a2d3cf64a5
move is_dtype_supported to test.helpers ( #3762 )
...
* move is_dtype_supported to test.helpers
updated all places that check if float16 is supports
* fix tests
2024-03-15 14:33:26 -04:00
nimlgen
ba79a3c09a
some hsa lines saving + fixes ( #3752 )
...
* fix write to ring + some lines
* hsa driver test
2024-03-15 18:12:18 +03:00
chenyu
0ead0bdb65
script to benchmark beam v hcopt ( #3737 )
...
the goal is that big enough beam should be faster than hcopt/tc
also this failed on tc opt
NUM=2 FILTER_REDUCE=1 TEST_N=20 BEAM=4 DEBUG=2 python test/external/speed_beam_v_hcopt.py
2024-03-14 15:04:03 -04:00
qazal
337cd53444
multioutput ScheduleItem ( #3699 )
...
* refactor realize.py
* update docs
* update test_sched
* update runners and devices
* update openpilot and unit tests
* cleanup runner lowering
* update more tests
2024-03-13 08:59:38 -07:00
nimlgen
08064a0e29
add SEED env to fuzz_linearizer ( #3713 )
...
* add SEED env to test/external/fuzz_linearizer.py
* found some
* more platforms
2024-03-13 18:08:42 +03:00
George Hotz
ac02e7347d
ptx timing vs cuda timing ( #3659 )
2024-03-08 10:17:49 -08:00
chenyu
e25879d50e
don't get new var_val for the same ast in fuzz_linearizer ( #3657 )
...
fixed result comparison for kernels with variables
2024-03-08 09:49:24 -05:00
chenyu
1130c73844
add FUZZ_NTH to fuzz_linearizer ( #3656 )
...
* add FUZZ_NTH to fuzz_linearizer
also update tests in test_linearizer_failures to not just run on METAL
* update failures for HIP/HSA
* test_failure_21 LLVM PADTO
2024-03-08 09:16:49 -05:00
David Hou
9f66dcf718
PolynomialDecayWithWarmup + tests ( #3649 )
...
* working PolynomialDecayWithWarmup + tests.......
add lars_util.py, oops
* keep lars_util.py as intact as possible, simplify our interface
* whitespace
* clean up
* clean up
* asserts
* test polylr for full resnet training run
* add comment
* rename
* fix do_optim
* don't cast lr
* info
* calculate from train_files
* skip it
2024-03-07 18:53:36 -05:00
chenyu
57df8e8d82
update fuzz_linearizer ( #3648 )
...
included non-reduce kernel and kernel with variables. green msg when everything passed
it's possible that creating rawbufs failed due to memory error, included that in failure cases
2024-03-07 18:41:22 -05:00
chenyu
906cc3a69b
cleanup tests Device[Device.DEFAULT] is always Compiled ( #3645 )
2024-03-07 11:15:42 -05:00
qazal
bdd62c7fd8
make the bf16 include dynamic ( #3642 )
...
* dynamic prefix
* add common ones above
these are common dtypes
aesthetics
* regression test
fuzz it
test
* run in CI
* use .append
* faster
2024-03-07 10:31:35 -05:00
David Hou
0afaf70d57
lars optimizer + tests ( #3631 )
...
* lars optimizer + tests
* fix skip list!
* use id to compare in skip list
* go back to using set
* Tensor(bool) * Tensor(bool) is and
* don't lint external/mlperf_resnet
* whitespace
* add external_test_optim to opencl tests
* give mlperf task a name
* mlperf under onnx
* remove track_gnorm
* contiguous instead of realize
* assert momentum and weight decay positive
---------
Co-authored-by: chenyu <chenyu@fastmail.com >
2024-03-06 18:11:01 -05:00
George Hotz
8500265561
this mem fault still happening ( #3620 )
...
* this mem fault still happening
* smaller
* that print doesn't work
* overflows test
* hip doesn't uses_ptr_arithmetic
* only with locals
* test overflow new name
* it's not ptr arith
* simpler
* simple repro
* old compiler
* simpler
* put that back
2024-03-05 10:39:32 -08:00
George Hotz
f500be1313
out of bounds access caused by launch bounds ( #3615 )
...
* lin overflow
* remove launch bounds
* remove launch bounds infra
* oops, fix bufs type
2024-03-05 06:34:00 -08:00
Francis Lam
162dfb07d9
fuzz_linearizer: fix uops and add to test.yml ( #3588 )
2024-03-02 15:03:42 -08:00
George Hotz
83530a585f
add quick external data select test
2024-03-02 05:38:32 -08:00
chenyu
d89e3c4e08
enable METAL tests now runner is M1 and no fast-math ( #3523 )
2024-02-28 14:14:23 -05:00