Commit Graph

10633 Commits

Author SHA1 Message Date
Szymon Ożóg
2d0bfdf01c ptx cleanup (#3893) 2024-03-24 14:54:45 -07:00
chenyu
2e39f57594 move lines around in ops_python wmma (#3911) 2024-03-24 17:14:26 -04:00
Patrick Tsai
e27129a798 Fix linearizer failure 26 test (#3906)
* Adjust adds between WHERE and PHI

* Not much better

* undo recursive change

* hm

* iterate over where, not factored op

* oo

* consts only for loop

* UNdo var name change

* update

---------

Co-authored-by: Patrick Tsai <patosai@users.noreply.github.com>
2024-03-24 16:34:13 -04:00
chenyu
10673d1447 tiny search cleanup (#3910)
* tiny search cleanup

removed some `assert isinstance(dev, Compiled)` and lines

* remove import
2024-03-24 14:20:55 -04:00
wozeparrot
9a9cac58f9 add lars to nn (#3750)
* feat: add lars

* feat: don't remove this comment

* clean: smaller diff

* clean: shorter line

* feat: remove mlperf lars, switch resnet

* fix: fully remove mlperf lars

* clean: comment

* feat: contiguous

* feat: no weight decay on skip params

* feat: optimizergroup

* feat: classic momentum

* fix: pylint

* clean: move comment

* fix: correct algo

* feat: lrschedulergroup

* feat: skip list tests

* feat: :| forgot that params are a thing

* feat: remove skip_list params from main params

* feat: set moment

---------

Co-authored-by: chenyu <chenyu@fastmail.com>
2024-03-24 11:43:12 -04:00
chenyu
8c8b57fd5f cleanup ops python (#3908)
i just want to merge lars!
2024-03-24 11:36:31 -04:00
chenyu
2c69888654 include negative float in test_dtype (#3884)
* include negative float in test_dtype

* that is ub

* too annoying

* pack can overflow
2024-03-24 02:39:15 -04:00
chenyu
e22d78b3d2 training cifar with BF16 on CUDA (#3905)
* training cifar with BF16 on CUDA

memory usage is between float and half due to numpy calls on dataset preprocessing, which converts into float.

* simpler bf16 functions

* bf16 cifar works for HSA too just very slow

* simpler bf16 functions, we love cuda
2024-03-24 01:37:47 -04:00
Francis Lam
0145366323 wmma: fix the AMD TC threads to split the first 16 threads (#3904)
previously it was incorrectly aliasing 16 into the size 8 upcast
on the store alias.  now it splits it properly into 8 and the
remaining 2 into the correct local stride
2024-03-23 21:17:42 -04:00
sekstini
7c3632fd1e add --minimal flag to nvrtc (#3899) 2024-03-23 16:38:31 -07:00
chenyu
a2b2597fc2 replace dtype.name str with render_dtype (#3903)
fixed some bf16 cast issue since it does not have `.name`.
also more robust if there are lang specific type override
2024-03-23 19:25:48 -04:00
chenyu
24d004a89b hotfix check ckpts before writing achieved model (#3901)
this killed tinybox green run
2024-03-23 17:16:38 -04:00
chenyu
4d566f12b1 touchup einsum (#3900)
don't need rhs_letters
2024-03-23 16:46:39 -04:00
Alejandro F Queiruga
556dcfb8f2 Fix the result permutation in einsum (#3895)
* Fix permutation of result indices in einsum.

* Delete stray line used for breaking tests

* Fix linter error by renaming twice-used variable

---------

Co-authored-by: chenyu <chenyu@fastmail.com>
2024-03-23 15:48:19 -04:00
nimlgen
4e18dd78d3 faster program start in llvm (#3897) 2024-03-23 15:20:15 +03:00
George Hotz
46a3501cec nv ioctl sniffer (#3892)
* nv ioctl sniffer

* unused import

* Update __init__.py

* that work

* that fix it
2024-03-23 00:29:30 -07:00
chenyu
18e0cef14d cheap less lines in ptx (#3890)
enought to merge lars
2024-03-23 01:12:31 -04:00
George Hotz
f0c4e06ffd fix cuda sync (#3888) 2024-03-22 19:02:30 -07:00
chenyu
2d3ce53348 touchup test_dtype.test_gradient_dtype (#3887)
add back bad merge from #3613 and add float.double and float.bfloat16 to test
2024-03-22 20:56:45 -04:00
David Hou
fc11808a79 initialize Tensor grad same type as self (#3613)
* initialize Tensor grad same type as self

* also test different default float

* check dtype + try/finally

* don't test_gradient_dtype if f16 is not supported

* fix bad merge

---------

Co-authored-by: chenyu <chenyu@fastmail.com>
2024-03-22 20:33:18 -04:00
Francis Lam
8db7a6bbcc debug: add optional detailed BEAM_LOG logging (#3883)
* debug: add optional detailed BEAM_LOG logging

show uop count, compile and run times for each candidate in search

also add --timing to verify_kernel.py to make it easier to explore
hand-crafted applied opts

* fix linter
2024-03-22 19:23:31 -04:00
chenyu
f7f67e0cc5 simple fix llama shard with quantize (#3882)
copy scale on all device for now. naive sharding does not work because scale needs expand to really save memory.

70B does not work due to HSA_STATUS_ERROR_OUT_OF_RESOURCES.

`python3 examples/llama.py --gen 2 --size 13B --shard 6 --prompt "Hello." --count 10 --temperature 0 --timing --quantize`

13B on 6 gpus uses 47 GB v.s. 34 GB quantized
2024-03-22 18:15:37 -04:00
chenyu
ee502c8055 fixup to_movement_ops and add back to CI (#3881) 2024-03-22 18:14:49 -04:00
nimlgen
16e31f7f0d init multidevice cuda graph (#3858)
* init multidevice cuda graph

* cuda just works!

* clean

* linter happier

* liners happy

* update transfer inputs

* do not change free

* useless check for cuda

---------

Co-authored-by: George Hotz <72895+geohot@users.noreply.github.com>
2024-03-22 13:49:48 -07:00
George Hotz
0c197b9cf3 hotfix: hip bfloat formatting 2024-03-22 11:52:05 -07:00
George Hotz
54dc48aa47 fix assign (#3878)
* fix assign

* remove terrible optimizer hack

* oops, not realized assigns
2024-03-22 11:48:48 -07:00
Francis Lam
5587594a00 fuzz_linearizer: add --ast and --file params to read kernels (#3877)
also fix up ast_str_to_str to support the new tuple of LazyOps
2024-03-22 14:27:40 -04:00
chenyu
c5467e5bd6 diverse test value in test_dtype DATA based on dtype (#3864)
* diverse test value in test_dtype DATA based on dtype

* eh fix typo

* that too?

* PTX does not support i8 and s8

* skip that

* unused line

* pus the hack back

* remove that
2024-03-22 14:22:06 -04:00
George Hotz
86ee36e697 preschedule all (#3875) 2024-03-22 11:20:06 -07:00
Szymon Ożóg
d8c3f1894a Use UOpGraph in test (#3876) 2024-03-22 14:12:38 -04:00
chenyu
1c51d586ea replace raise Exception with specific errors (#3874) 2024-03-22 12:32:21 -04:00
nimlgen
8ef5490ec8 cuda tranfer + async copyin (#3873) 2024-03-22 09:01:37 -07:00
Szymon Ożóg
624bc89910 PTX - implement float 4, ptr arithmetics and other speed improvements (#3775)
* ptx float4 implementation

* remove from cache when trimming uops

* Gate for float4

* Linting fix

* disable test reasonable time for ptx

* import getenv

* Update uops.py

* linter

* Add div test for half

* upcast if op does not support operation

* fix offset

* Run only if dtype supported

* zero out registers when accessing by pred + cleanup

* Remove trailing whitespace

* revert

* spacing fix

* move cache clearing outside loop

* did this suddenly start working?

* unused import removed

* Remove cast

* Use pattern matching

* linting

---------

Co-authored-by: George Hotz <72895+geohot@users.noreply.github.com>
2024-03-22 08:54:02 -07:00
George Hotz
f4055439dc don't include hip common (#3851)
* don't install hip common

* only that

* Revert "only that"

This reverts commit 85f22015d9.

* less

* needed

* sep comgr

* header file

* 6.0.2

* update hsa

* hsakmt

* Revert "hsakmt"

This reverts commit d3a118078e.
2024-03-22 08:50:50 -07:00
qazal
4a27ce6ec9 tiny version of amd_hip_bfloat16 (#3868)
* add src_dtype

* add maker

* add bfloat16

* simpler
2024-03-22 08:37:30 -07:00
chenyu
82ce60e172 use JIT_BATCH_SIZE=4 for GPT2 3090 benchmark (#3870)
smaller first batch saves about 0.05 ms per token. 1.75ms / tok on local 3090
2024-03-22 00:40:06 -04:00
qazal
fe6ceff15f proposal: multioutput JIT spec (#3856)
* corealize JIT

* requirements
2024-03-21 21:28:30 -07:00
Francis Lam
a26090d404 search: change to use "spawn" and limit the number of tasks per child (#3862)
also clean up some examples to use __main__ and not initialize
resources outside of main
2024-03-21 21:23:36 -07:00
chenyu
dca69df197 hot fix use DEBUG >= 3 for allreduce message (#3869) 2024-03-21 23:40:44 -04:00
uuuvn
6729f20aab Ring allreduce try 2 (#3852)
* Ring allreduce v3

* Configurable size, number of gpus and jit in benchmark

* ScheduleBarrier v0

* GB/s that make sense

* ScheduleBarrier v0.1

* Fallback on 2 GPUs

* ScheduleBarrier v0.2

* ScheduleBarrier v0.3

* ScheduleBarrier v0.3.1

* ScheduleBarrier v0.3.2

* Replace ScheduleBarrier with automatic optimization

* unused import

* fix comment

* typing

* better fallback

* python 3.8

* RING=2 and use ContextVar

* DEBUG >= 2 and change name

* linter

* type

---------

Co-authored-by: George Hotz <72895+geohot@users.noreply.github.com>
Co-authored-by: chenyu <chenyu@fastmail.com>
Co-authored-by: nimlgen <138685161+nimlgen@users.noreply.github.com>
2024-03-21 19:17:51 -04:00
Francis Lam
3c0478bfab fuzz_linearizer: add additional DEBUG info for comparison errors (#3866) 2024-03-21 18:58:10 -04:00
chenyu
bc482729d0 lower hlb_cifar acc to 93.3 (#3865)
ran 30 runs and the lowest i see is 93.35. lowered to 93.3 for now.

maybe reenable ema later if it reduces variance
2024-03-21 17:58:53 -04:00
chenyu
e50b7abe4f diversed buf inputs based on dtype in fuzz_linearizer (#3863) 2024-03-21 16:23:11 -04:00
chenyu
c40f78499f reuse fuzz_linearizer.compare_linearizer in test_linearizer_failures (#3861) 2024-03-21 14:23:37 -04:00
chenyu
30fa03243e reuse fuzz_linearizer.compare_linearizer in test_linearizer_failures (#3861) 2024-03-21 14:12:27 -04:00
chenyu
33dd99acf4 remove helper_add_store from test_linearizer_failures (#3860) 2024-03-21 12:53:31 -04:00
chenyu
6bf0b82267 alloc new output in fuzz_linearizer between baseline and real one (#3859)
if the kernel is an assign `a += 1`, the rawbufs[0] is updated twice and gives false compare_error
2024-03-21 11:36:05 -04:00
nimlgen
b78352b423 do not create structs every call in CUDAProgram (#3855)
* do not create structs in cuda

* fix graph

* linter

* do not exec twice

* fix graph
2024-03-21 17:51:40 +03:00
nimlgen
e5745c1a0d fix nan on multigpus cuda (#3854) 2024-03-21 15:21:55 +03:00
Anurag Lamsal
4e0819e40b fixing the benchmark not printing in handcode resnet50 opt example (#3850) 2024-03-21 00:55:31 -04:00