Commit Graph

1606 Commits

Author SHA1 Message Date
chenyu
85edc493b0 uops const fold rules to prevent tautological compare warnings (#4041)
* uops const fold rules to prevent tautological compare warnings

`bool < false` is false, `true < bool` is false, `a == a` is true, `a != a` is false

* not true for nan

* and nan does not work with llvm

* full truth table test

* revert a==a

* comments and indents
2024-04-02 16:45:58 -04:00
Patrick Tsai
0147174ad6 Embedding in one kernel (#4036)
* Embedding is in one kernel

* embedding is one kernel

* rm extra line

* newline

* bert test counts state vars?

* add a test?

* move items around

---------

Co-authored-by: Patrick Tsai <patosai@users.noreply.github.com>
2024-04-02 11:38:21 -04:00
Dan Hoffman
5311b45053 re-enable has_local check for linearizer test (#4034)
Co-authored-by: Dan Hoffman <daniel.hoffman@intel.com>
2024-04-02 00:02:03 -04:00
George Hotz
7425a0c646 CommandQueue is the future (#3950)
* start of command queue

* cq work

* runs

* cleanup

* outs set

* read is gone

* future buffer work

* command queue is better

* command queue works

* loadops

* delete unneeded

* command queue works

* upd

* fix tests

* use CommandQueue in compile

* delay sync
2024-04-01 17:35:48 -07:00
chenyu
82440d3416 don't call contiguous for unpadded const into multi tensor (#4032)
* don't call contiguous for unpadded const into multi tensor

fixed multi const folding for sharded const.
still wip, need to be careful that this does not break multi device cache somewhere

* ehh need a memory test for that

* simple sharded memory test
2024-04-01 19:22:14 -04:00
chenyu
77a68fc52f test examples for multi tensor const folding (#4031)
works with literal const operand now because it's copied to each shard and handled by lazy.
does not work for sharded const
2024-04-01 16:53:43 -04:00
chenyu
379d52548d const fold left const operand for ADD and MUL (#4029)
* const fold left const operand for ADD and MUL

* neg have dtype issue
2024-04-01 15:09:04 -04:00
chenyu
0e02d074bd fix Tensor.pow folding for exponent 0 and 1 (#4025) 2024-03-31 19:57:23 -04:00
mmmkkaaayy
a4ae9352bd delete irrelevant JIT regression test (#4024) 2024-03-31 19:35:35 -04:00
chenyu
d3f27761b0 move const folding of ADD/SUB/MUL from tensor to lazy (#4020)
* move const folding of ADD/SUB/MUL from tensor to lazy

will do div and pow separately.

* fix onnx adding with None
2024-03-31 16:35:36 -04:00
chenyu
7f859593b8 fix _to_const_val and const folding around it (#4017)
* fix _to_const_val and const folding around it

is_unrealized_contiguous_const is too strict and almost never hit if const is expanded.
suffice to check if there's no pad

* that test is folded

* test_const_folding
2024-03-31 13:09:23 -04:00
chenyu
c71627fee6 move GlobalCounter to helpers (#4002)
break circular import between ops and buffer
2024-03-30 00:30:30 -04:00
George Hotz
9eef44521b ScheduleItem uses Buffer (#3995)
* schedule Buffer

* update

* update tests

* master

* works

* remove LoadOps.WAIT

* fix compile2

* bad test

* rename and note
2024-03-29 20:50:27 -07:00
George Hotz
8f1e34a2a0 early src delete (#3996)
* early src delete

* fix bad test

* fix test_linearizer
2024-03-29 19:46:07 -07:00
George Hotz
f916aadaea external that test 2024-03-29 19:35:50 -07:00
George Hotz
c42ed8e99c don't reschedule 2024-03-29 19:17:37 -07:00
chenyu
b43e470f80 always use f32 for rand source of randn (#3998)
* always use f32 for source of randn

fixed bfloat16 randn to not have inf.
don't really care about float64. threefry is float32 based too

* HSA is broken
2024-03-29 17:04:34 -04:00
chenyu
6b6461122e test case Tensor.randn should be finite (#3994)
* test case Tensor.randn should be finite

there's a hack to fix float16, need a generic solution that works with bf16 and threefry

* skip not supported

* bfloat16 local is wrong

* skip RHIP
2024-03-29 14:51:02 -04:00
chenyu
d9ff636cf5 use is to compare with enum (#3993)
* use is to compare with enum

currently it's mixed between `==` and `is`, moved all to `is`

* more
2024-03-29 13:02:56 -04:00
chenyu
7bc560ec49 remove outdated bf16 comments in test_dtype (#3987) 2024-03-29 00:56:18 -04:00
uuuvn
8a40d7d423 Shape changing bitcast and assert bitcast in disk (#3973)
* Shape changing bitcast

* only support it on disk

* basic test

* more tests

* RuntimeError instead of assert

* create unique temp files

* move tests that use disk to test_disk_tensor

* linter

* remove assert on error messages

* that's RuntimeError now

---------

Co-authored-by: George Hotz <72895+geohot@users.noreply.github.com>
2024-03-28 21:49:10 -07:00
chenyu
793ab0512e use ctypes to truncate float64 and float32 in uops (#3986)
this fixed the softmax.argmax bug for ops_python as the float is truncated to float32
2024-03-28 23:56:50 -04:00
chenyu
c4c243f79d update test_uops _equal to use assert_allclose (#3981)
it handles nan
2024-03-28 22:14:45 -04:00
reddyn12
9b5e15db6e Mamba Implementation (#3456)
* first commit

* state back to orig

* mamba comparisions

* rm file

* rename file

* use Tensor.einsum and mke default model 370M

* Cleaned code and made a comparision test

* Simplyfy pull request. Only has 1 mamba implementation now.

* Update prompt

* rm whitespaces

* last space

* remove Einops dependency

* rm unused code

* add tests

* rm print statement

* rm imports

* skip CLANG

* Update skipIf description

* skip model test in CI and add CLANG fix

* rm Device import

* don't be stupid

* Fix conv assign

When the prompt is too short, the logic for conv_state assign messes up. This can be fixed when padding the tokenized array to min length of 4. I padded using the empty string token, but idk if proper practice is to use the PAD token

* fix p1

* temp

* fix jit import

---------

Co-authored-by: schlimeszn <schlimeszn@gmail.com>
Co-authored-by: reddyn <nikidsniper@gmail.com>
Co-authored-by: George Hotz <72895+geohot@users.noreply.github.com>
2024-03-28 17:49:12 -07:00
chenyu
1fa0351acb fix DEFINE_ACC invalid_value to have same type as localtype (#3980) 2024-03-28 19:21:17 -04:00
chenyu
b47f6cebb2 LinearizerOptions -> CompilerOptions (#3978) 2024-03-28 17:50:23 -04:00
George Hotz
42b9d999ea Buffer isn't always allocated (#3974)
* buffer alloc

* allocate

* missing allocates

* last one
2024-03-28 13:33:47 -07:00
chenyu
bfcaa2f70e assert __setitem__ if used other than disk (#3972)
* assert `__setitem__` if used other than disk

* that is not implemented
2024-03-28 12:16:38 -04:00
Francis Lam
7c5729a3bd wmma: refactor to remove wmma_func and create TC funcs as needed (#3945)
* wmma: refactor to remove wmma_func and create TC funcs as needed

* test_linearizer: disable bf16 CUDA during emulation testing

* cstyle: clean up creation of CUDA vec dtypes

* extra/gemm: add option to accumulate to bfloat16

* cleanups

* benchmark: add CUDA bfloat16 matmul

* more cleanups
2024-03-27 16:43:09 -04:00
George Hotz
60639cccac hotfix: RuntimeError for assign 2024-03-27 11:18:48 -07:00
qazal
9fb573d73c DAG cycle asserts (#3955)
* assert cycles

* these are cycle errors

* flip to positive
2024-03-27 11:09:59 -07:00
geohotstan
bd3a7d068c correct device for validation test in model benchmark CI (#3960)
* fix tests

* add clang back for only metal

* change the name to reflect CLANG being ran

* add back cuda
2024-03-27 13:40:06 -04:00
chenyu
6c7df1445b enforce UOps.CONST arg has python type based on dtype (#3952)
added an assert in uops, remove the cast in renderer
2024-03-27 01:41:38 -04:00
George Hotz
68ca4d4276 split to schedule.py (#3949)
* split to schedule.py

* split
2024-03-26 21:02:46 -07:00
George Hotz
150ea2eb76 create engine folder and move code (#3948)
* retry

* older tf

* that
2024-03-26 20:38:03 -07:00
Francis Lam
5530b0cbed fuzz_linearizer: reduce debug verbosity and make easier for CI usage (#3942)
* fuzz_linearizer: reduce debug verbosity and make easier for CI usage

* rename FUZZ_BEAM to FUZZ_ALL_ACTIONS (not choosing a subset)
* skip simple ASTs (easier to use with LOGOPS output)
* don't fuzz a previously seen AST
* add options to allow non-zero --expected-failures

* clean up naming and use set
2024-03-26 16:25:24 -04:00
nimlgen
e2d6f76723 _alloc and _free with options (#3934)
* _alloc has options

* linter

* fix hsa
2024-03-26 09:11:41 -07:00
chenyu
72d617a37d opencl on OSX does not support fp16 extension (#3931)
running `GPU=1 python -m pytest -rA test/test_dtype.py::TestHalfDtype::test_casts_from` on mac would fail.
2024-03-25 19:50:17 -04:00
chenyu
4ecd5789ab #include <tgmath.h> in ops_clang (#3927)
* different clang sqrt/log2/exp2/sin function based on dtype

fixed softmax_argmax issue in #3552 for clang.

* tgmath.h

* revert those
2024-03-25 17:48:57 -04:00
Arseny Kapoulkine
514c43201d Fix issues with pointer provenance in load/store through ALU (#3916)
* Track pointer provenance in load/store through ALU

Previously load/store could be incorrectly rendered into
ld.global/st.global when the input was an ALU op that performed an
address computation with DEFINE_LOCAL on one of the arguments.

* Simplify the load provenance workaround

The issue is that we can render the same code twice, and on the second
run the opstream is already modified so that vin[0] isn't a DEFINE_*,
which overwrites initially correct .shared wth .global.

* Add a couple tests for basic local use

* Skip local tests on LLVM since it doesn't implement DEFINE_LOCAL
2024-03-25 14:41:05 -07:00
chenyu
83f39a8ceb env var to change default float (#3902)
* env var to change default float to fp16 or bf16

looking for standard names for these. we have FLOAT16 that does something to IMAGE and HALF to convert weights.

working on default bf16 too.
```
RuntimeError: compile failed: <null>(6): error: identifier "__bf16" is undefined
    __bf16 cast0 = (nv_bfloat16)(val0);
```

remove that in cifar

* DEFAULT_FLOAT

* default of default

* unit test

* don't check default

* tests work on linux
2024-03-24 20:33:57 -04:00
George Hotz
03899a74bb increase atol on reset train 2024-03-24 15:17:31 -07:00
qazal
d8fafca13a assign regression (#3907)
* infra

* track mutations

* assign levels

* add seen back

* add test

* infra 2.0

* add assign targets

* dont need levels

* delete

* Update test_assign.py

---------

Co-authored-by: George Hotz <72895+geohot@users.noreply.github.com>
2024-03-24 15:12:31 -07:00
Patrick Tsai
e27129a798 Fix linearizer failure 26 test (#3906)
* Adjust adds between WHERE and PHI

* Not much better

* undo recursive change

* hm

* iterate over where, not factored op

* oo

* consts only for loop

* UNdo var name change

* update

---------

Co-authored-by: Patrick Tsai <patosai@users.noreply.github.com>
2024-03-24 16:34:13 -04:00
wozeparrot
9a9cac58f9 add lars to nn (#3750)
* feat: add lars

* feat: don't remove this comment

* clean: smaller diff

* clean: shorter line

* feat: remove mlperf lars, switch resnet

* fix: fully remove mlperf lars

* clean: comment

* feat: contiguous

* feat: no weight decay on skip params

* feat: optimizergroup

* feat: classic momentum

* fix: pylint

* clean: move comment

* fix: correct algo

* feat: lrschedulergroup

* feat: skip list tests

* feat: :| forgot that params are a thing

* feat: remove skip_list params from main params

* feat: set moment

---------

Co-authored-by: chenyu <chenyu@fastmail.com>
2024-03-24 11:43:12 -04:00
chenyu
2c69888654 include negative float in test_dtype (#3884)
* include negative float in test_dtype

* that is ub

* too annoying

* pack can overflow
2024-03-24 02:39:15 -04:00
Francis Lam
0145366323 wmma: fix the AMD TC threads to split the first 16 threads (#3904)
previously it was incorrectly aliasing 16 into the size 8 upcast
on the store alias.  now it splits it properly into 8 and the
remaining 2 into the correct local stride
2024-03-23 21:17:42 -04:00
chenyu
a2b2597fc2 replace dtype.name str with render_dtype (#3903)
fixed some bf16 cast issue since it does not have `.name`.
also more robust if there are lang specific type override
2024-03-23 19:25:48 -04:00
Alejandro F Queiruga
556dcfb8f2 Fix the result permutation in einsum (#3895)
* Fix permutation of result indices in einsum.

* Delete stray line used for breaking tests

* Fix linter error by renaming twice-used variable

---------

Co-authored-by: chenyu <chenyu@fastmail.com>
2024-03-23 15:48:19 -04:00
chenyu
2d3ce53348 touchup test_dtype.test_gradient_dtype (#3887)
add back bad merge from #3613 and add float.double and float.bfloat16 to test
2024-03-22 20:56:45 -04:00