Commit Graph

10633 Commits

Author SHA1 Message Date
Patrick Tsai
0147174ad6 Embedding in one kernel (#4036)
* Embedding is in one kernel

* embedding is one kernel

* rm extra line

* newline

* bert test counts state vars?

* add a test?

* move items around

---------

Co-authored-by: Patrick Tsai <patosai@users.noreply.github.com>
2024-04-02 11:38:21 -04:00
George Hotz
506b1c5892 multigpu works (#4040) 2024-04-02 08:29:37 -07:00
chenyu
05e7f930ee use clang as default instead of llvm for cpu (#4035)
llvm has problems with fp16 and comparing with nan
2024-04-02 00:02:18 -04:00
Dan Hoffman
5311b45053 re-enable has_local check for linearizer test (#4034)
Co-authored-by: Dan Hoffman <daniel.hoffman@intel.com>
2024-04-02 00:02:03 -04:00
George Hotz
bec2aaf404 add beautiful_mnist_multigpu example 2024-04-02 00:54:04 +00:00
George Hotz
7425a0c646 CommandQueue is the future (#3950)
* start of command queue

* cq work

* runs

* cleanup

* outs set

* read is gone

* future buffer work

* command queue is better

* command queue works

* loadops

* delete unneeded

* command queue works

* upd

* fix tests

* use CommandQueue in compile

* delay sync
2024-04-01 17:35:48 -07:00
chenyu
0a34d6016b move exec_alu from uops to ops (#4033)
will use this for const folding in lazy too
2024-04-01 17:20:53 -07:00
chenyu
82440d3416 don't call contiguous for unpadded const into multi tensor (#4032)
* don't call contiguous for unpadded const into multi tensor

fixed multi const folding for sharded const.
still wip, need to be careful that this does not break multi device cache somewhere

* ehh need a memory test for that

* simple sharded memory test
2024-04-01 19:22:14 -04:00
nimlgen
d6ba44bc1e kfd free buffers (#4027)
* kfd free buffers

* unmap

* all test passes

* better pm4

* forgot these

* invalidate only range

* better cache

* forgot

* comments

* fixes
2024-04-01 15:50:58 -07:00
chenyu
77a68fc52f test examples for multi tensor const folding (#4031)
works with literal const operand now because it's copied to each shard and handled by lazy.
does not work for sharded const
2024-04-01 16:53:43 -04:00
chenyu
379d52548d const fold left const operand for ADD and MUL (#4029)
* const fold left const operand for ADD and MUL

* neg have dtype issue
2024-04-01 15:09:04 -04:00
chenyu
0e02d074bd fix Tensor.pow folding for exponent 0 and 1 (#4025) 2024-03-31 19:57:23 -04:00
mmmkkaaayy
a4ae9352bd delete irrelevant JIT regression test (#4024) 2024-03-31 19:35:35 -04:00
chenyu
23c912e338 use *0+1 for Tensor.pow base case, remove function.Zero (#4023)
one less mlops!
2024-03-31 19:20:44 -04:00
chenyu
276ef8eb87 move div folding from tensor to lazy (#4022) 2024-03-31 18:07:39 -04:00
nimlgen
7fa233e8c9 kfd fix kernels with private memory (#4018)
* kfd fix kernels with private memory

* linter
2024-04-01 00:01:30 +03:00
Francis Lam
dcb58d3bed extra/gemm/simple_matvec: add simple_matvec.py (#4021)
we can test with this or add it to CI for benchmarks
2024-03-31 16:38:52 -04:00
chenyu
d3f27761b0 move const folding of ADD/SUB/MUL from tensor to lazy (#4020)
* move const folding of ADD/SUB/MUL from tensor to lazy

will do div and pow separately.

* fix onnx adding with None
2024-03-31 16:35:36 -04:00
chenyu
7f859593b8 fix _to_const_val and const folding around it (#4017)
* fix _to_const_val and const folding around it

is_unrealized_contiguous_const is too strict and almost never hit if const is expanded.
suffice to check if there's no pad

* that test is folded

* test_const_folding
2024-03-31 13:09:23 -04:00
George Hotz
2abb474d43 kfd driver wip (#3912)
* kfd driver wip

* cleanups

* kfd almost ready to ring doorbell

* ding dong?

* issues with signals

* something

* works

* ops kfd

* add amd_signal_t

* works...sometimes

* program runs

* _gpu_alloc cleanup

* cleanups

* work

* header + enable profiling (#3959)

* header + enable profiling

* just cleaner

* measure

* only local time domain

* remove old comments

* fix with master

* elf parsing (#3965)

* elf parsing

* fix kernels with private

* not used

* clean up

* clean up 2

* add flags

* kfd sdma (#3970)

* working sdma

* remove driver, shorter

* all commands we might need

* svm

* kfd remove hardcoded values (#4007)

* remove hardcoded values

* match above line

* 7k lines + revert hsa

* update that from origin

* fix sdma reg gen

* not the updated SDMA

* compiler_opts

* don't require kfd_ioctl

* get ioctls from python

* get ioctls from python

* remove build_sdma_command

* merge into 64-bit fields

* shorter

* fix property spelling and off by one

---------

Co-authored-by: nimlgen <138685161+nimlgen@users.noreply.github.com>
2024-03-30 15:08:12 -07:00
chenyu
bee8eeae55 Revert "don't simplify st in _recursive_lazyop when unbind (#4011)" (#4013)
This reverts commit 2b704d7452.
2024-03-30 17:36:17 -04:00
chenyu
2b704d7452 don't simplify st in _recursive_lazyop when unbind (#4011)
st here should be the same, calling simplify.unbind generates a different st and break cache
2024-03-30 17:03:40 -04:00
Francis Lam
04746022b1 extra/gemm/hip_matmul: fix to use new HSA devices and no headers (#3999)
* extra/gemm/hip_matmul: fix to use new HSA devices and no headers

* remove compile_hip import
2024-03-30 15:42:23 -04:00
nimlgen
478c040e1c hsa terminate without exceptions (#4006)
* hsa terminate without exceptions

* cleaner

* linter
2024-03-30 16:03:46 +03:00
chenyu
aa76d566c2 cleanup mamba (#4004)
make it read nicer and cleanup some movement methods and math simplification.
790m, 1.4b, 2.8b model does not really run.
sampling is not implemented.
jit is incorrect.
some deadcode / wrong code path and copied from torch stuff stuff.
2024-03-30 02:50:13 -04:00
George Hotz
f35f9d32f2 rename mlops to function (#4003) 2024-03-29 21:49:00 -07:00
chenyu
c71627fee6 move GlobalCounter to helpers (#4002)
break circular import between ops and buffer
2024-03-30 00:30:30 -04:00
George Hotz
9eef44521b ScheduleItem uses Buffer (#3995)
* schedule Buffer

* update

* update tests

* master

* works

* remove LoadOps.WAIT

* fix compile2

* bad test

* rename and note
2024-03-29 20:50:27 -07:00
George Hotz
1bd4f01da2 size instead of st.size (#4001) 2024-03-29 19:59:02 -07:00
George Hotz
8f1e34a2a0 early src delete (#3996)
* early src delete

* fix bad test

* fix test_linearizer
2024-03-29 19:46:07 -07:00
Szymon Ożóg
31c8ba8b84 Move transformations to PatternMatcher + clean up existing patterns (#3997) 2024-03-29 19:42:39 -07:00
George Hotz
f916aadaea external that test 2024-03-29 19:35:50 -07:00
George Hotz
c42ed8e99c don't reschedule 2024-03-29 19:17:37 -07:00
chenyu
ecf38f498e beam search resnet eval too in BENCHMARK (#4000) 2024-03-29 21:07:23 -04:00
chenyu
b43e470f80 always use f32 for rand source of randn (#3998)
* always use f32 for source of randn

fixed bfloat16 randn to not have inf.
don't really care about float64. threefry is float32 based too

* HSA is broken
2024-03-29 17:04:34 -04:00
chenyu
6b6461122e test case Tensor.randn should be finite (#3994)
* test case Tensor.randn should be finite

there's a hack to fix float16, need a generic solution that works with bf16 and threefry

* skip not supported

* bfloat16 local is wrong

* skip RHIP
2024-03-29 14:51:02 -04:00
chenyu
d9ff636cf5 use is to compare with enum (#3993)
* use is to compare with enum

currently it's mixed between `==` and `is`, moved all to `is`

* more
2024-03-29 13:02:56 -04:00
Akshit Talwar
0affbbf81c update amx gemm (#3991) 2024-03-29 11:45:03 -04:00
chenyu
4abb8245a6 rhs_order in einsum is argsort twice (#3990)
* rhs_order in einsum is argsort twice

* comment
2024-03-29 11:42:04 -04:00
chenyu
7bc560ec49 remove outdated bf16 comments in test_dtype (#3987) 2024-03-29 00:56:18 -04:00
uuuvn
8a40d7d423 Shape changing bitcast and assert bitcast in disk (#3973)
* Shape changing bitcast

* only support it on disk

* basic test

* more tests

* RuntimeError instead of assert

* create unique temp files

* move tests that use disk to test_disk_tensor

* linter

* remove assert on error messages

* that's RuntimeError now

---------

Co-authored-by: George Hotz <72895+geohot@users.noreply.github.com>
2024-03-28 21:49:10 -07:00
chenyu
793ab0512e use ctypes to truncate float64 and float32 in uops (#3986)
this fixed the softmax.argmax bug for ops_python as the float is truncated to float32
2024-03-28 23:56:50 -04:00
chenyu
101a0c683d use ctyles for uops truncate (#3985) 2024-03-28 23:31:34 -04:00
George Hotz
1bf0a7a2d1 move assign logic into lazy.py (#3984)
* move assign logic into lazy.py

* don't check the buffer
2024-03-28 20:26:38 -07:00
chenyu
3fee689ded fix ops_python for test_uops (#3982) 2024-03-28 22:48:55 -04:00
George Hotz
9a6ac2a50a create the buffer with the LazyBuffer (#3977)
* create the buffer with the LazyBuffer

* fixes

* hack underlying buffer when we change dtype

* we only care about allocated buffers

* asserts
2024-03-28 19:31:28 -07:00
chenyu
c4c243f79d update test_uops _equal to use assert_allclose (#3981)
it handles nan
2024-03-28 22:14:45 -04:00
reddyn12
9b5e15db6e Mamba Implementation (#3456)
* first commit

* state back to orig

* mamba comparisions

* rm file

* rename file

* use Tensor.einsum and mke default model 370M

* Cleaned code and made a comparision test

* Simplyfy pull request. Only has 1 mamba implementation now.

* Update prompt

* rm whitespaces

* last space

* remove Einops dependency

* rm unused code

* add tests

* rm print statement

* rm imports

* skip CLANG

* Update skipIf description

* skip model test in CI and add CLANG fix

* rm Device import

* don't be stupid

* Fix conv assign

When the prompt is too short, the logic for conv_state assign messes up. This can be fixed when padding the tokenized array to min length of 4. I padded using the empty string token, but idk if proper practice is to use the PAD token

* fix p1

* temp

* fix jit import

---------

Co-authored-by: schlimeszn <schlimeszn@gmail.com>
Co-authored-by: reddyn <nikidsniper@gmail.com>
Co-authored-by: George Hotz <72895+geohot@users.noreply.github.com>
2024-03-28 17:49:12 -07:00
George Hotz
d085837179 hotfix: that mem_used was in the wrong place 2024-03-28 17:09:04 -07:00
chenyu
1fa0351acb fix DEFINE_ACC invalid_value to have same type as localtype (#3980) 2024-03-28 19:21:17 -04:00