chenyu
18e0cef14d
cheap less lines in ptx ( #3890 )
...
enought to merge lars
2024-03-23 01:12:31 -04:00
George Hotz
f0c4e06ffd
fix cuda sync ( #3888 )
2024-03-22 19:02:30 -07:00
chenyu
2d3ce53348
touchup test_dtype.test_gradient_dtype ( #3887 )
...
add back bad merge from #3613 and add float.double and float.bfloat16 to test
2024-03-22 20:56:45 -04:00
David Hou
fc11808a79
initialize Tensor grad same type as self ( #3613 )
...
* initialize Tensor grad same type as self
* also test different default float
* check dtype + try/finally
* don't test_gradient_dtype if f16 is not supported
* fix bad merge
---------
Co-authored-by: chenyu <chenyu@fastmail.com >
2024-03-22 20:33:18 -04:00
Francis Lam
8db7a6bbcc
debug: add optional detailed BEAM_LOG logging ( #3883 )
...
* debug: add optional detailed BEAM_LOG logging
show uop count, compile and run times for each candidate in search
also add --timing to verify_kernel.py to make it easier to explore
hand-crafted applied opts
* fix linter
2024-03-22 19:23:31 -04:00
chenyu
f7f67e0cc5
simple fix llama shard with quantize ( #3882 )
...
copy scale on all device for now. naive sharding does not work because scale needs expand to really save memory.
70B does not work due to HSA_STATUS_ERROR_OUT_OF_RESOURCES.
`python3 examples/llama.py --gen 2 --size 13B --shard 6 --prompt "Hello." --count 10 --temperature 0 --timing --quantize`
13B on 6 gpus uses 47 GB v.s. 34 GB quantized
2024-03-22 18:15:37 -04:00
chenyu
ee502c8055
fixup to_movement_ops and add back to CI ( #3881 )
2024-03-22 18:14:49 -04:00
nimlgen
16e31f7f0d
init multidevice cuda graph ( #3858 )
...
* init multidevice cuda graph
* cuda just works!
* clean
* linter happier
* liners happy
* update transfer inputs
* do not change free
* useless check for cuda
---------
Co-authored-by: George Hotz <72895+geohot@users.noreply.github.com >
2024-03-22 13:49:48 -07:00
George Hotz
0c197b9cf3
hotfix: hip bfloat formatting
2024-03-22 11:52:05 -07:00
George Hotz
54dc48aa47
fix assign ( #3878 )
...
* fix assign
* remove terrible optimizer hack
* oops, not realized assigns
2024-03-22 11:48:48 -07:00
Francis Lam
5587594a00
fuzz_linearizer: add --ast and --file params to read kernels ( #3877 )
...
also fix up ast_str_to_str to support the new tuple of LazyOps
2024-03-22 14:27:40 -04:00
chenyu
c5467e5bd6
diverse test value in test_dtype DATA based on dtype ( #3864 )
...
* diverse test value in test_dtype DATA based on dtype
* eh fix typo
* that too?
* PTX does not support i8 and s8
* skip that
* unused line
* pus the hack back
* remove that
2024-03-22 14:22:06 -04:00
George Hotz
86ee36e697
preschedule all ( #3875 )
2024-03-22 11:20:06 -07:00
Szymon Ożóg
d8c3f1894a
Use UOpGraph in test ( #3876 )
2024-03-22 14:12:38 -04:00
chenyu
1c51d586ea
replace raise Exception with specific errors ( #3874 )
2024-03-22 12:32:21 -04:00
nimlgen
8ef5490ec8
cuda tranfer + async copyin ( #3873 )
2024-03-22 09:01:37 -07:00
Szymon Ożóg
624bc89910
PTX - implement float 4, ptr arithmetics and other speed improvements ( #3775 )
...
* ptx float4 implementation
* remove from cache when trimming uops
* Gate for float4
* Linting fix
* disable test reasonable time for ptx
* import getenv
* Update uops.py
* linter
* Add div test for half
* upcast if op does not support operation
* fix offset
* Run only if dtype supported
* zero out registers when accessing by pred + cleanup
* Remove trailing whitespace
* revert
* spacing fix
* move cache clearing outside loop
* did this suddenly start working?
* unused import removed
* Remove cast
* Use pattern matching
* linting
---------
Co-authored-by: George Hotz <72895+geohot@users.noreply.github.com >
2024-03-22 08:54:02 -07:00
George Hotz
f4055439dc
don't include hip common ( #3851 )
...
* don't install hip common
* only that
* Revert "only that"
This reverts commit 85f22015d9 .
* less
* needed
* sep comgr
* header file
* 6.0.2
* update hsa
* hsakmt
* Revert "hsakmt"
This reverts commit d3a118078e .
2024-03-22 08:50:50 -07:00
qazal
4a27ce6ec9
tiny version of amd_hip_bfloat16 ( #3868 )
...
* add src_dtype
* add maker
* add bfloat16
* simpler
2024-03-22 08:37:30 -07:00
chenyu
82ce60e172
use JIT_BATCH_SIZE=4 for GPT2 3090 benchmark ( #3870 )
...
smaller first batch saves about 0.05 ms per token. 1.75ms / tok on local 3090
2024-03-22 00:40:06 -04:00
qazal
fe6ceff15f
proposal: multioutput JIT spec ( #3856 )
...
* corealize JIT
* requirements
2024-03-21 21:28:30 -07:00
Francis Lam
a26090d404
search: change to use "spawn" and limit the number of tasks per child ( #3862 )
...
also clean up some examples to use __main__ and not initialize
resources outside of main
2024-03-21 21:23:36 -07:00
chenyu
dca69df197
hot fix use DEBUG >= 3 for allreduce message ( #3869 )
2024-03-21 23:40:44 -04:00
uuuvn
6729f20aab
Ring allreduce try 2 ( #3852 )
...
* Ring allreduce v3
* Configurable size, number of gpus and jit in benchmark
* ScheduleBarrier v0
* GB/s that make sense
* ScheduleBarrier v0.1
* Fallback on 2 GPUs
* ScheduleBarrier v0.2
* ScheduleBarrier v0.3
* ScheduleBarrier v0.3.1
* ScheduleBarrier v0.3.2
* Replace ScheduleBarrier with automatic optimization
* unused import
* fix comment
* typing
* better fallback
* python 3.8
* RING=2 and use ContextVar
* DEBUG >= 2 and change name
* linter
* type
---------
Co-authored-by: George Hotz <72895+geohot@users.noreply.github.com >
Co-authored-by: chenyu <chenyu@fastmail.com >
Co-authored-by: nimlgen <138685161+nimlgen@users.noreply.github.com >
2024-03-21 19:17:51 -04:00
Francis Lam
3c0478bfab
fuzz_linearizer: add additional DEBUG info for comparison errors ( #3866 )
2024-03-21 18:58:10 -04:00
chenyu
bc482729d0
lower hlb_cifar acc to 93.3 ( #3865 )
...
ran 30 runs and the lowest i see is 93.35. lowered to 93.3 for now.
maybe reenable ema later if it reduces variance
2024-03-21 17:58:53 -04:00
chenyu
e50b7abe4f
diversed buf inputs based on dtype in fuzz_linearizer ( #3863 )
2024-03-21 16:23:11 -04:00
chenyu
c40f78499f
reuse fuzz_linearizer.compare_linearizer in test_linearizer_failures ( #3861 )
2024-03-21 14:23:37 -04:00
chenyu
30fa03243e
reuse fuzz_linearizer.compare_linearizer in test_linearizer_failures ( #3861 )
2024-03-21 14:12:27 -04:00
chenyu
33dd99acf4
remove helper_add_store from test_linearizer_failures ( #3860 )
2024-03-21 12:53:31 -04:00
chenyu
6bf0b82267
alloc new output in fuzz_linearizer between baseline and real one ( #3859 )
...
if the kernel is an assign `a += 1`, the rawbufs[0] is updated twice and gives false compare_error
2024-03-21 11:36:05 -04:00
nimlgen
b78352b423
do not create structs every call in CUDAProgram ( #3855 )
...
* do not create structs in cuda
* fix graph
* linter
* do not exec twice
* fix graph
2024-03-21 17:51:40 +03:00
nimlgen
e5745c1a0d
fix nan on multigpus cuda ( #3854 )
2024-03-21 15:21:55 +03:00
Anurag Lamsal
4e0819e40b
fixing the benchmark not printing in handcode resnet50 opt example ( #3850 )
2024-03-21 00:55:31 -04:00
nimlgen
85691c8e20
fix hsa sync issue ( #3847 )
...
* fix hsa sync issue
* linter
2024-03-21 04:00:30 +03:00
chenyu
f271cd682b
user _resolve_dim in argmax ( #3846 )
...
also added comment of the behavior if there are multple, and more tests
2024-03-20 20:17:30 -04:00
chenyu
5c4cf62d2c
fix View.pad arg type ( #3845 )
...
close #3779
2024-03-20 19:36:02 -04:00
Francis Lam
6d5dec2fef
log optimized kernels and a script to compare with non-optimized ones ( #3829 )
...
* search: add BEAM_VERIFY option to validate search results
refactor fuzz_linearizer comparison to allow it to be used in for
BEAM_VERIFY in device.py
* search: fix to verify the beam_search result and not the fastest
* search: fix typing and clean up
* device: remove imports from test and add LOGKERN options
LOGKERN output can be used with test/external/verify_kernel.py
to validate correctness
* fix example in verify_kernel.py
* cleanup fixes
* fix to use f-strings
2024-03-20 19:22:08 -04:00
chenyu
9d1d08fbb0
show llama bandwith with timing ( #3844 )
2024-03-20 17:19:15 -04:00
chenyu
7ff47e45a1
cifar TARGET_EVAL_ACC_PCT=93.5 ( #3843 )
2024-03-20 16:56:51 -04:00
qazal
92c5067439
conceptual small refactor ( #3842 )
2024-03-20 16:46:14 -04:00
chenyu
519336cfea
factor out partial in SumNode div int ( #3841 )
...
* factor out partial in SumNode div int
* div not rem
* space
2024-03-20 16:34:33 -04:00
George Hotz
8cb5215885
Revert "Ring allreduce in multitensor ( #3000 )" ( #3840 )
...
This reverts commit c5bf9e4c96 .
2024-03-20 11:41:49 -07:00
uuuvn
c5bf9e4c96
Ring allreduce in multitensor ( #3000 )
...
* Ring allreduce v3
* Configurable size, number of gpus and jit in benchmark
* ScheduleBarrier v0
* GB/s that make sense
* ScheduleBarrier v0.1
* Fallback on 2 GPUs
* ScheduleBarrier v0.2
* ScheduleBarrier v0.3
* ScheduleBarrier v0.3.1
* ScheduleBarrier v0.3.2
* Replace ScheduleBarrier with automatic optimization
* unused import
* fix comment
* typing
* better fallback
* python 3.8
---------
Co-authored-by: George Hotz <72895+geohot@users.noreply.github.com >
Co-authored-by: chenyu <chenyu@fastmail.com >
Co-authored-by: nimlgen <138685161+nimlgen@users.noreply.github.com >
2024-03-20 11:20:01 -07:00
chenyu
455f7bea9b
test example from half resnet that idx has number outside of int32 ( #3838 )
...
* test example from half resnet that idx has number outside of int32
* ruff
2024-03-20 13:44:20 -04:00
chenyu
727de5ba1e
llama 7B on 3090 benchmark ( #3837 )
...
* llama 7B on 3090 benchmark
* symlink llama
2024-03-20 12:48:22 -04:00
qazal
9452994201
add a better error message for resnet training ( #3836 )
...
* add a better error message
* assert
* use FileNotFoundError
2024-03-20 09:22:15 -07:00
chenyu
47b9cc2dfe
use float32 for rand buffer in test_beam_search and test in metal ( #3831 )
2024-03-19 23:22:58 -04:00
chenyu
d17900bc45
use int32 instead of default_int in simplify_phi_loops ( #3828 )
...
* use int32 instead of default_int in simplify_phi_loops
indices are in int32 now and is separated from buffer dtype. fix #3823
* return early if not supported
* it's not that
* why is it failing for RHIP
2024-03-19 17:49:58 -04:00
nimlgen
2d54e4d747
clean up hsa driver ( #3818 )
...
* clean up driver
* remove returns
2024-03-20 00:17:41 +03:00