Commit Graph

4009 Commits

Author SHA1 Message Date
chenyu
aa76d566c2 cleanup mamba (#4004)
make it read nicer and cleanup some movement methods and math simplification.
790m, 1.4b, 2.8b model does not really run.
sampling is not implemented.
jit is incorrect.
some deadcode / wrong code path and copied from torch stuff stuff.
2024-03-30 02:50:13 -04:00
George Hotz
f35f9d32f2 rename mlops to function (#4003) 2024-03-29 21:49:00 -07:00
chenyu
c71627fee6 move GlobalCounter to helpers (#4002)
break circular import between ops and buffer
2024-03-30 00:30:30 -04:00
George Hotz
9eef44521b ScheduleItem uses Buffer (#3995)
* schedule Buffer

* update

* update tests

* master

* works

* remove LoadOps.WAIT

* fix compile2

* bad test

* rename and note
2024-03-29 20:50:27 -07:00
George Hotz
1bd4f01da2 size instead of st.size (#4001) 2024-03-29 19:59:02 -07:00
George Hotz
8f1e34a2a0 early src delete (#3996)
* early src delete

* fix bad test

* fix test_linearizer
2024-03-29 19:46:07 -07:00
Szymon Ożóg
31c8ba8b84 Move transformations to PatternMatcher + clean up existing patterns (#3997) 2024-03-29 19:42:39 -07:00
George Hotz
f916aadaea external that test 2024-03-29 19:35:50 -07:00
George Hotz
c42ed8e99c don't reschedule 2024-03-29 19:17:37 -07:00
chenyu
ecf38f498e beam search resnet eval too in BENCHMARK (#4000) 2024-03-29 21:07:23 -04:00
chenyu
b43e470f80 always use f32 for rand source of randn (#3998)
* always use f32 for source of randn

fixed bfloat16 randn to not have inf.
don't really care about float64. threefry is float32 based too

* HSA is broken
2024-03-29 17:04:34 -04:00
chenyu
6b6461122e test case Tensor.randn should be finite (#3994)
* test case Tensor.randn should be finite

there's a hack to fix float16, need a generic solution that works with bf16 and threefry

* skip not supported

* bfloat16 local is wrong

* skip RHIP
2024-03-29 14:51:02 -04:00
chenyu
d9ff636cf5 use is to compare with enum (#3993)
* use is to compare with enum

currently it's mixed between `==` and `is`, moved all to `is`

* more
2024-03-29 13:02:56 -04:00
Akshit Talwar
0affbbf81c update amx gemm (#3991) 2024-03-29 11:45:03 -04:00
chenyu
4abb8245a6 rhs_order in einsum is argsort twice (#3990)
* rhs_order in einsum is argsort twice

* comment
2024-03-29 11:42:04 -04:00
chenyu
7bc560ec49 remove outdated bf16 comments in test_dtype (#3987) 2024-03-29 00:56:18 -04:00
uuuvn
8a40d7d423 Shape changing bitcast and assert bitcast in disk (#3973)
* Shape changing bitcast

* only support it on disk

* basic test

* more tests

* RuntimeError instead of assert

* create unique temp files

* move tests that use disk to test_disk_tensor

* linter

* remove assert on error messages

* that's RuntimeError now

---------

Co-authored-by: George Hotz <72895+geohot@users.noreply.github.com>
2024-03-28 21:49:10 -07:00
chenyu
793ab0512e use ctypes to truncate float64 and float32 in uops (#3986)
this fixed the softmax.argmax bug for ops_python as the float is truncated to float32
2024-03-28 23:56:50 -04:00
chenyu
101a0c683d use ctyles for uops truncate (#3985) 2024-03-28 23:31:34 -04:00
George Hotz
1bf0a7a2d1 move assign logic into lazy.py (#3984)
* move assign logic into lazy.py

* don't check the buffer
2024-03-28 20:26:38 -07:00
chenyu
3fee689ded fix ops_python for test_uops (#3982) 2024-03-28 22:48:55 -04:00
George Hotz
9a6ac2a50a create the buffer with the LazyBuffer (#3977)
* create the buffer with the LazyBuffer

* fixes

* hack underlying buffer when we change dtype

* we only care about allocated buffers

* asserts
2024-03-28 19:31:28 -07:00
chenyu
c4c243f79d update test_uops _equal to use assert_allclose (#3981)
it handles nan
2024-03-28 22:14:45 -04:00
reddyn12
9b5e15db6e Mamba Implementation (#3456)
* first commit

* state back to orig

* mamba comparisions

* rm file

* rename file

* use Tensor.einsum and mke default model 370M

* Cleaned code and made a comparision test

* Simplyfy pull request. Only has 1 mamba implementation now.

* Update prompt

* rm whitespaces

* last space

* remove Einops dependency

* rm unused code

* add tests

* rm print statement

* rm imports

* skip CLANG

* Update skipIf description

* skip model test in CI and add CLANG fix

* rm Device import

* don't be stupid

* Fix conv assign

When the prompt is too short, the logic for conv_state assign messes up. This can be fixed when padding the tokenized array to min length of 4. I padded using the empty string token, but idk if proper practice is to use the PAD token

* fix p1

* temp

* fix jit import

---------

Co-authored-by: schlimeszn <schlimeszn@gmail.com>
Co-authored-by: reddyn <nikidsniper@gmail.com>
Co-authored-by: George Hotz <72895+geohot@users.noreply.github.com>
2024-03-28 17:49:12 -07:00
George Hotz
d085837179 hotfix: that mem_used was in the wrong place 2024-03-28 17:09:04 -07:00
chenyu
1fa0351acb fix DEFINE_ACC invalid_value to have same type as localtype (#3980) 2024-03-28 19:21:17 -04:00
chenyu
b47f6cebb2 LinearizerOptions -> CompilerOptions (#3978) 2024-03-28 17:50:23 -04:00
qazal
2bfb1d3e39 dynamic assign idx (#3975) 2024-03-28 13:59:32 -07:00
George Hotz
2cfcb5623a hotfix: d was removed from buffer 2024-03-28 13:39:02 -07:00
George Hotz
42b9d999ea Buffer isn't always allocated (#3974)
* buffer alloc

* allocate

* missing allocates

* last one
2024-03-28 13:33:47 -07:00
George Hotz
9c03fe3e5d hotfix: ShapeTracker no longer has import cycle 2024-03-28 10:34:23 -07:00
chenyu
bfcaa2f70e assert __setitem__ if used other than disk (#3972)
* assert `__setitem__` if used other than disk

* that is not implemented
2024-03-28 12:16:38 -04:00
David Hou
4b95350c41 fp16 resnet (without expand backwards sum in float, doesn't work) (#3816)
* fp16 resnet

* cast running mean and var back to default float

* extra cast

* check symbolic no overflow

* add linearizer failure

* loss scaler after grad contig

* oops

* i think this works

* don't loss scale fp32

* remove overflow test case

* remove symbolic bounds check

* loss scaler should be float

* temporarily disable padto cuz bug

shruggie

* make running stats in batchnorm float32?

* calculate lars stuff in fp32?

* oops

* remove most changes

* move loss scaler out of optimizer

* no more FP16 var

* oops

---------

Co-authored-by: chenyu <chenyu@fastmail.com>
2024-03-28 01:25:37 -04:00
George Hotz
607b4a7d70 remove buffer read, save lines (#3969) 2024-03-27 22:02:47 -07:00
chenyu
80116be9a5 for loop to generate hip math functions for different floats (#3967)
* for loop to generate hip math functions for different floats

* slightly nicer
2024-03-27 23:24:29 -04:00
qazal
03d129baa8 inputs -> membufs (#3964) 2024-03-27 17:34:39 -07:00
Francis Lam
16a1d43f6f llama: prevent device initialization outside of __main__ (#3966)
* llama: prevent device initialization outside of __main__

causes HSA resources leakages in child compile processes

* llama: fix loading with multiple devices
2024-03-27 19:19:38 -04:00
Francis Lam
7c5729a3bd wmma: refactor to remove wmma_func and create TC funcs as needed (#3945)
* wmma: refactor to remove wmma_func and create TC funcs as needed

* test_linearizer: disable bf16 CUDA during emulation testing

* cstyle: clean up creation of CUDA vec dtypes

* extra/gemm: add option to accumulate to bfloat16

* cleanups

* benchmark: add CUDA bfloat16 matmul

* more cleanups
2024-03-27 16:43:09 -04:00
chenyu
88b24df40a touchup remove float() in cstyle render_const for float64 (#3962) 2024-03-27 16:08:28 -04:00
qazal
27af37f2ad misc: remove unused env vars (#3963)
* remove unused env vars

* delete CPU
2024-03-27 16:08:15 -04:00
George Hotz
60639cccac hotfix: RuntimeError for assign 2024-03-27 11:18:48 -07:00
qazal
9fb573d73c DAG cycle asserts (#3955)
* assert cycles

* these are cycle errors

* flip to positive
2024-03-27 11:09:59 -07:00
geohotstan
bd3a7d068c correct device for validation test in model benchmark CI (#3960)
* fix tests

* add clang back for only metal

* change the name to reflect CLANG being ran

* add back cuda
2024-03-27 13:40:06 -04:00
George Hotz
eec2b00edc change kernel name if it's multioutput (#3958) 2024-03-27 08:42:57 -07:00
George Hotz
d1c957a471 copy back to clang (#3951)
* copy back to clang

* force the copy for CLANG device
2024-03-27 08:13:01 -07:00
P4ssenger
332c82893a Remove redundant check on device (#3957)
* call self.nbytes

* device is canonicalized, therefore, it cannot be None
2024-03-27 07:54:33 -07:00
chenyu
6c7df1445b enforce UOps.CONST arg has python type based on dtype (#3952)
added an assert in uops, remove the cast in renderer
2024-03-27 01:41:38 -04:00
George Hotz
91f3326c0b hotfix: increase recursion limit 2024-03-26 21:26:54 -07:00
George Hotz
68ca4d4276 split to schedule.py (#3949)
* split to schedule.py

* split
2024-03-26 21:02:46 -07:00
George Hotz
da07f31fd4 hotfix: remove bf16 test entirely 2024-03-26 20:50:27 -07:00