Commit Graph

3994 Commits

Author SHA1 Message Date
chenyu
7bc560ec49 remove outdated bf16 comments in test_dtype (#3987) 2024-03-29 00:56:18 -04:00
uuuvn
8a40d7d423 Shape changing bitcast and assert bitcast in disk (#3973)
* Shape changing bitcast

* only support it on disk

* basic test

* more tests

* RuntimeError instead of assert

* create unique temp files

* move tests that use disk to test_disk_tensor

* linter

* remove assert on error messages

* that's RuntimeError now

---------

Co-authored-by: George Hotz <72895+geohot@users.noreply.github.com>
2024-03-28 21:49:10 -07:00
chenyu
793ab0512e use ctypes to truncate float64 and float32 in uops (#3986)
this fixed the softmax.argmax bug for ops_python as the float is truncated to float32
2024-03-28 23:56:50 -04:00
chenyu
101a0c683d use ctyles for uops truncate (#3985) 2024-03-28 23:31:34 -04:00
George Hotz
1bf0a7a2d1 move assign logic into lazy.py (#3984)
* move assign logic into lazy.py

* don't check the buffer
2024-03-28 20:26:38 -07:00
chenyu
3fee689ded fix ops_python for test_uops (#3982) 2024-03-28 22:48:55 -04:00
George Hotz
9a6ac2a50a create the buffer with the LazyBuffer (#3977)
* create the buffer with the LazyBuffer

* fixes

* hack underlying buffer when we change dtype

* we only care about allocated buffers

* asserts
2024-03-28 19:31:28 -07:00
chenyu
c4c243f79d update test_uops _equal to use assert_allclose (#3981)
it handles nan
2024-03-28 22:14:45 -04:00
reddyn12
9b5e15db6e Mamba Implementation (#3456)
* first commit

* state back to orig

* mamba comparisions

* rm file

* rename file

* use Tensor.einsum and mke default model 370M

* Cleaned code and made a comparision test

* Simplyfy pull request. Only has 1 mamba implementation now.

* Update prompt

* rm whitespaces

* last space

* remove Einops dependency

* rm unused code

* add tests

* rm print statement

* rm imports

* skip CLANG

* Update skipIf description

* skip model test in CI and add CLANG fix

* rm Device import

* don't be stupid

* Fix conv assign

When the prompt is too short, the logic for conv_state assign messes up. This can be fixed when padding the tokenized array to min length of 4. I padded using the empty string token, but idk if proper practice is to use the PAD token

* fix p1

* temp

* fix jit import

---------

Co-authored-by: schlimeszn <schlimeszn@gmail.com>
Co-authored-by: reddyn <nikidsniper@gmail.com>
Co-authored-by: George Hotz <72895+geohot@users.noreply.github.com>
2024-03-28 17:49:12 -07:00
George Hotz
d085837179 hotfix: that mem_used was in the wrong place 2024-03-28 17:09:04 -07:00
chenyu
1fa0351acb fix DEFINE_ACC invalid_value to have same type as localtype (#3980) 2024-03-28 19:21:17 -04:00
chenyu
b47f6cebb2 LinearizerOptions -> CompilerOptions (#3978) 2024-03-28 17:50:23 -04:00
qazal
2bfb1d3e39 dynamic assign idx (#3975) 2024-03-28 13:59:32 -07:00
George Hotz
2cfcb5623a hotfix: d was removed from buffer 2024-03-28 13:39:02 -07:00
George Hotz
42b9d999ea Buffer isn't always allocated (#3974)
* buffer alloc

* allocate

* missing allocates

* last one
2024-03-28 13:33:47 -07:00
George Hotz
9c03fe3e5d hotfix: ShapeTracker no longer has import cycle 2024-03-28 10:34:23 -07:00
chenyu
bfcaa2f70e assert __setitem__ if used other than disk (#3972)
* assert `__setitem__` if used other than disk

* that is not implemented
2024-03-28 12:16:38 -04:00
David Hou
4b95350c41 fp16 resnet (without expand backwards sum in float, doesn't work) (#3816)
* fp16 resnet

* cast running mean and var back to default float

* extra cast

* check symbolic no overflow

* add linearizer failure

* loss scaler after grad contig

* oops

* i think this works

* don't loss scale fp32

* remove overflow test case

* remove symbolic bounds check

* loss scaler should be float

* temporarily disable padto cuz bug

shruggie

* make running stats in batchnorm float32?

* calculate lars stuff in fp32?

* oops

* remove most changes

* move loss scaler out of optimizer

* no more FP16 var

* oops

---------

Co-authored-by: chenyu <chenyu@fastmail.com>
2024-03-28 01:25:37 -04:00
George Hotz
607b4a7d70 remove buffer read, save lines (#3969) 2024-03-27 22:02:47 -07:00
chenyu
80116be9a5 for loop to generate hip math functions for different floats (#3967)
* for loop to generate hip math functions for different floats

* slightly nicer
2024-03-27 23:24:29 -04:00
qazal
03d129baa8 inputs -> membufs (#3964) 2024-03-27 17:34:39 -07:00
Francis Lam
16a1d43f6f llama: prevent device initialization outside of __main__ (#3966)
* llama: prevent device initialization outside of __main__

causes HSA resources leakages in child compile processes

* llama: fix loading with multiple devices
2024-03-27 19:19:38 -04:00
Francis Lam
7c5729a3bd wmma: refactor to remove wmma_func and create TC funcs as needed (#3945)
* wmma: refactor to remove wmma_func and create TC funcs as needed

* test_linearizer: disable bf16 CUDA during emulation testing

* cstyle: clean up creation of CUDA vec dtypes

* extra/gemm: add option to accumulate to bfloat16

* cleanups

* benchmark: add CUDA bfloat16 matmul

* more cleanups
2024-03-27 16:43:09 -04:00
chenyu
88b24df40a touchup remove float() in cstyle render_const for float64 (#3962) 2024-03-27 16:08:28 -04:00
qazal
27af37f2ad misc: remove unused env vars (#3963)
* remove unused env vars

* delete CPU
2024-03-27 16:08:15 -04:00
George Hotz
60639cccac hotfix: RuntimeError for assign 2024-03-27 11:18:48 -07:00
qazal
9fb573d73c DAG cycle asserts (#3955)
* assert cycles

* these are cycle errors

* flip to positive
2024-03-27 11:09:59 -07:00
geohotstan
bd3a7d068c correct device for validation test in model benchmark CI (#3960)
* fix tests

* add clang back for only metal

* change the name to reflect CLANG being ran

* add back cuda
2024-03-27 13:40:06 -04:00
George Hotz
eec2b00edc change kernel name if it's multioutput (#3958) 2024-03-27 08:42:57 -07:00
George Hotz
d1c957a471 copy back to clang (#3951)
* copy back to clang

* force the copy for CLANG device
2024-03-27 08:13:01 -07:00
P4ssenger
332c82893a Remove redundant check on device (#3957)
* call self.nbytes

* device is canonicalized, therefore, it cannot be None
2024-03-27 07:54:33 -07:00
chenyu
6c7df1445b enforce UOps.CONST arg has python type based on dtype (#3952)
added an assert in uops, remove the cast in renderer
2024-03-27 01:41:38 -04:00
George Hotz
91f3326c0b hotfix: increase recursion limit 2024-03-26 21:26:54 -07:00
George Hotz
68ca4d4276 split to schedule.py (#3949)
* split to schedule.py

* split
2024-03-26 21:02:46 -07:00
George Hotz
da07f31fd4 hotfix: remove bf16 test entirely 2024-03-26 20:50:27 -07:00
George Hotz
0d5845fb5b hotfix: jit is flaky on mac 2024-03-26 20:44:05 -07:00
George Hotz
150ea2eb76 create engine folder and move code (#3948)
* retry

* older tf

* that
2024-03-26 20:38:03 -07:00
George Hotz
629cbc5587 only abstractions 2 (#3947) 2024-03-26 20:02:18 -07:00
chenyu
77589bc7a5 rename Scalar to ConstType and cast_scalar to as_const (#3946)
prereq cleanup to make const arg same python type as dtype
2024-03-26 22:39:58 -04:00
uuuvn
d6d902afe9 wtf (#3944) 2024-03-26 17:49:28 -07:00
Francis Lam
5530b0cbed fuzz_linearizer: reduce debug verbosity and make easier for CI usage (#3942)
* fuzz_linearizer: reduce debug verbosity and make easier for CI usage

* rename FUZZ_BEAM to FUZZ_ALL_ACTIONS (not choosing a subset)
* skip simple ASTs (easier to use with LOGOPS output)
* don't fuzz a previously seen AST
* add options to allow non-zero --expected-failures

* clean up naming and use set
2024-03-26 16:25:24 -04:00
chenyu
8df6587c41 hotfix 97.3 for beautiful_mnist (#3941) 2024-03-26 15:02:53 -04:00
chenyu
b1e3817e18 correctly handle Tensor.rand whwn default_float = bf16 (#3940)
always casting to float32 makes default half to be slow
2024-03-26 14:56:16 -04:00
chenyu
f6ff76be21 check only upcast int amount in upcasted_axis (#3938)
fixed typing and fixed #3932
2024-03-26 12:54:57 -04:00
nimlgen
e2d6f76723 _alloc and _free with options (#3934)
* _alloc has options

* linter

* fix hsa
2024-03-26 09:11:41 -07:00
nimlgen
739f47eb0f check on cuEventSynchronize (#3933) 2024-03-26 16:14:38 +03:00
George Hotz
778d17fbd3 intel matmul (#3830)
* almost right

* intel xmx
2024-03-25 22:37:20 -07:00
chenyu
ef537672bf bf16 support in metal (#3929)
it runs if device gpu supports bfloat. updated ci benchmark too
2024-03-25 23:17:36 -04:00
chenyu
72d617a37d opencl on OSX does not support fp16 extension (#3931)
running `GPU=1 python -m pytest -rA test/test_dtype.py::TestHalfDtype::test_casts_from` on mac would fail.
2024-03-25 19:50:17 -04:00
Arseny Kapoulkine
cb6e7b57a6 examples: Fix parameter bandwidth accounting for quantized LLama (#3930)
Instead of assuming every parameter is 2 bytes, just add up tensor sizes
in bytes
2024-03-25 18:41:05 -04:00