Commit Graph

4051 Commits

Author SHA1 Message Date
Szymon Ożóg
ba118abfec improved caching for pointer arithmetics in ptx (#3922)
* improved caching for pointer arithmetics

* Add test for pointer arithmetics caching

* Refactor test
2024-04-04 07:33:48 -07:00
Szymon Ożóg
68fe3527f1 Tensor core ptx (#3894)
* tensor cores

* Merge from master

* faster program start in llvm (#3897)

* Fix the result permutation in einsum (#3895)

* Fix permutation of result indices in einsum.

* Delete stray line used for breaking tests

* Fix linter error by renaming twice-used variable

---------

Co-authored-by: chenyu <chenyu@fastmail.com>

* touchup einsum (#3900)

don't need rhs_letters

* hotfix check ckpts before writing achieved model (#3901)

this killed tinybox green run

* replace dtype.name str with render_dtype (#3903)

fixed some bf16 cast issue since it does not have `.name`.
also more robust if there are lang specific type override

* add --minimal flag to nvrtc (#3899)

* wmma: fix the AMD TC threads to split the first 16 threads (#3904)

previously it was incorrectly aliasing 16 into the size 8 upcast
on the store alias.  now it splits it properly into 8 and the
remaining 2 into the correct local stride

* training cifar with BF16 on CUDA (#3905)

* training cifar with BF16 on CUDA

memory usage is between float and half due to numpy calls on dataset preprocessing, which converts into float.

* simpler bf16 functions

* bf16 cifar works for HSA too just very slow

* simpler bf16 functions, we love cuda

* include negative float in test_dtype (#3884)

* include negative float in test_dtype

* that is ub

* too annoying

* pack can overflow

* add to benchmark

* change var name to satisfy mypy

* spacing

* Update to new TensorCore format

* Spacing

---------

Co-authored-by: nimlgen <138685161+nimlgen@users.noreply.github.com>
Co-authored-by: Alejandro F Queiruga <33233447+afqueiruga@users.noreply.github.com>
Co-authored-by: chenyu <chenyu@fastmail.com>
Co-authored-by: sekstini <127142660+sekstini@users.noreply.github.com>
Co-authored-by: Francis Lam <flam@alum.mit.edu>
Co-authored-by: George Hotz <72895+geohot@users.noreply.github.com>
2024-04-04 07:32:31 -07:00
Szymon Ożóg
92378fb5b6 Ptx mulacc (#3937)
* mulacc

* Move more stuff to pattern matcher

* disable callable from the == check

* disable function passing in pattern matcher

* Add set of dtypes pattern matching + refactor mulacc pattern
2024-04-04 00:15:25 -07:00
George Hotz
3e72d745ea hotfix: make KFD timings right 2024-04-04 05:55:29 +00:00
George Hotz
58d162315c Continuing KFD work (#4065)
* cleanups

* fix kernargs ptr

* mypy passes
2024-04-03 22:48:13 -07:00
chenyu
d219aba962 prepend CLANG_PROGRAM_HEADER in ClangCompiler.render instead of compile (#4063)
src header should be part of the rendered output, and DEBUG=4 includes the header this way
2024-04-03 23:17:56 -04:00
George Hotz
7181ffd630 HWCopyQueue in KFD (#4042)
* HWCopyQueue in KFD

* hw compute queue

* test

* move test

* more tests

* fix wait

* fix multimap

* mes crash

* tests pass but slow

* stuff is working

* one more test
2024-04-03 20:14:24 -07:00
chenyu
e3c0ac9fbf remove old envvar "OPT" (#4060) 2024-04-03 14:55:21 -04:00
chenyu
406cb5fd90 const fold ReduceOps (#4059) 2024-04-03 14:39:28 -04:00
chenyu
fe03725b21 const fold cast unrealized_unpadded_const (#4047)
* const fold unrealized_unpadded_const

changed the underlying arg directly

* CAST_BEFORE_VIEW folds some

* fix const index in getitem
2024-04-03 12:31:24 -04:00
Szymon Ożóg
e5a9bff899 Add pattern matcher tests, move uop transforms from assembly to pattern (#4056)
matcher
2024-04-03 09:06:43 -07:00
qazal
1ea8fcbe1b graph schedule items (#4054) 2024-04-03 08:52:37 -07:00
George Hotz
52ee5b73b2 update logo (#4055)
* update logo

* update svg

* put svg in file

* Revert "put svg in file"

This reverts commit 735528047a.

* better

* move a tag

* remove extra
2024-04-03 07:16:57 -07:00
chenyu
f61ed869f5 Use exec_alu for lazy const folding (#4039) 2024-04-02 20:52:05 -04:00
Francis Lam
88dcdae485 search: fix counting of upcasts to ignore TC upcasts (#4045)
TC upcasts don't impact the size or complexity of the kernel code
2024-04-02 19:52:05 -04:00
Szymon Ożóg
ccf3c16d6a Refactor the use of pattern matcher in ptx (#4043) 2024-04-02 14:19:51 -07:00
chenyu
85edc493b0 uops const fold rules to prevent tautological compare warnings (#4041)
* uops const fold rules to prevent tautological compare warnings

`bool < false` is false, `true < bool` is false, `a == a` is true, `a != a` is false

* not true for nan

* and nan does not work with llvm

* full truth table test

* revert a==a

* comments and indents
2024-04-02 16:45:58 -04:00
Léo
e879e16c48 docs: add warning message for conda users when using METAL (#3917)
* docs: add warning message for conda users when using METAL

* fix: conda metal warning too long. disabled line length check

* docs: changed conda METAL warning to include DISABLE_COMPILER_CACHE=1

* fix(metal): now detecting invalid library magic

* format: removed noqa E501

* fix(metal): conda error line len

* fix: typo

---------

Co-authored-by: Léo Paillé <leo.paille@enseirb-matmeca.fr>
2024-04-02 09:22:24 -07:00
Patrick Tsai
0147174ad6 Embedding in one kernel (#4036)
* Embedding is in one kernel

* embedding is one kernel

* rm extra line

* newline

* bert test counts state vars?

* add a test?

* move items around

---------

Co-authored-by: Patrick Tsai <patosai@users.noreply.github.com>
2024-04-02 11:38:21 -04:00
George Hotz
506b1c5892 multigpu works (#4040) 2024-04-02 08:29:37 -07:00
chenyu
05e7f930ee use clang as default instead of llvm for cpu (#4035)
llvm has problems with fp16 and comparing with nan
2024-04-02 00:02:18 -04:00
Dan Hoffman
5311b45053 re-enable has_local check for linearizer test (#4034)
Co-authored-by: Dan Hoffman <daniel.hoffman@intel.com>
2024-04-02 00:02:03 -04:00
George Hotz
bec2aaf404 add beautiful_mnist_multigpu example 2024-04-02 00:54:04 +00:00
George Hotz
7425a0c646 CommandQueue is the future (#3950)
* start of command queue

* cq work

* runs

* cleanup

* outs set

* read is gone

* future buffer work

* command queue is better

* command queue works

* loadops

* delete unneeded

* command queue works

* upd

* fix tests

* use CommandQueue in compile

* delay sync
2024-04-01 17:35:48 -07:00
chenyu
0a34d6016b move exec_alu from uops to ops (#4033)
will use this for const folding in lazy too
2024-04-01 17:20:53 -07:00
chenyu
82440d3416 don't call contiguous for unpadded const into multi tensor (#4032)
* don't call contiguous for unpadded const into multi tensor

fixed multi const folding for sharded const.
still wip, need to be careful that this does not break multi device cache somewhere

* ehh need a memory test for that

* simple sharded memory test
2024-04-01 19:22:14 -04:00
nimlgen
d6ba44bc1e kfd free buffers (#4027)
* kfd free buffers

* unmap

* all test passes

* better pm4

* forgot these

* invalidate only range

* better cache

* forgot

* comments

* fixes
2024-04-01 15:50:58 -07:00
chenyu
77a68fc52f test examples for multi tensor const folding (#4031)
works with literal const operand now because it's copied to each shard and handled by lazy.
does not work for sharded const
2024-04-01 16:53:43 -04:00
chenyu
379d52548d const fold left const operand for ADD and MUL (#4029)
* const fold left const operand for ADD and MUL

* neg have dtype issue
2024-04-01 15:09:04 -04:00
chenyu
0e02d074bd fix Tensor.pow folding for exponent 0 and 1 (#4025) 2024-03-31 19:57:23 -04:00
mmmkkaaayy
a4ae9352bd delete irrelevant JIT regression test (#4024) 2024-03-31 19:35:35 -04:00
chenyu
23c912e338 use *0+1 for Tensor.pow base case, remove function.Zero (#4023)
one less mlops!
2024-03-31 19:20:44 -04:00
chenyu
276ef8eb87 move div folding from tensor to lazy (#4022) 2024-03-31 18:07:39 -04:00
nimlgen
7fa233e8c9 kfd fix kernels with private memory (#4018)
* kfd fix kernels with private memory

* linter
2024-04-01 00:01:30 +03:00
Francis Lam
dcb58d3bed extra/gemm/simple_matvec: add simple_matvec.py (#4021)
we can test with this or add it to CI for benchmarks
2024-03-31 16:38:52 -04:00
chenyu
d3f27761b0 move const folding of ADD/SUB/MUL from tensor to lazy (#4020)
* move const folding of ADD/SUB/MUL from tensor to lazy

will do div and pow separately.

* fix onnx adding with None
2024-03-31 16:35:36 -04:00
chenyu
7f859593b8 fix _to_const_val and const folding around it (#4017)
* fix _to_const_val and const folding around it

is_unrealized_contiguous_const is too strict and almost never hit if const is expanded.
suffice to check if there's no pad

* that test is folded

* test_const_folding
2024-03-31 13:09:23 -04:00
George Hotz
2abb474d43 kfd driver wip (#3912)
* kfd driver wip

* cleanups

* kfd almost ready to ring doorbell

* ding dong?

* issues with signals

* something

* works

* ops kfd

* add amd_signal_t

* works...sometimes

* program runs

* _gpu_alloc cleanup

* cleanups

* work

* header + enable profiling (#3959)

* header + enable profiling

* just cleaner

* measure

* only local time domain

* remove old comments

* fix with master

* elf parsing (#3965)

* elf parsing

* fix kernels with private

* not used

* clean up

* clean up 2

* add flags

* kfd sdma (#3970)

* working sdma

* remove driver, shorter

* all commands we might need

* svm

* kfd remove hardcoded values (#4007)

* remove hardcoded values

* match above line

* 7k lines + revert hsa

* update that from origin

* fix sdma reg gen

* not the updated SDMA

* compiler_opts

* don't require kfd_ioctl

* get ioctls from python

* get ioctls from python

* remove build_sdma_command

* merge into 64-bit fields

* shorter

* fix property spelling and off by one

---------

Co-authored-by: nimlgen <138685161+nimlgen@users.noreply.github.com>
2024-03-30 15:08:12 -07:00
chenyu
bee8eeae55 Revert "don't simplify st in _recursive_lazyop when unbind (#4011)" (#4013)
This reverts commit 2b704d7452.
2024-03-30 17:36:17 -04:00
chenyu
2b704d7452 don't simplify st in _recursive_lazyop when unbind (#4011)
st here should be the same, calling simplify.unbind generates a different st and break cache
2024-03-30 17:03:40 -04:00
Francis Lam
04746022b1 extra/gemm/hip_matmul: fix to use new HSA devices and no headers (#3999)
* extra/gemm/hip_matmul: fix to use new HSA devices and no headers

* remove compile_hip import
2024-03-30 15:42:23 -04:00
nimlgen
478c040e1c hsa terminate without exceptions (#4006)
* hsa terminate without exceptions

* cleaner

* linter
2024-03-30 16:03:46 +03:00
chenyu
aa76d566c2 cleanup mamba (#4004)
make it read nicer and cleanup some movement methods and math simplification.
790m, 1.4b, 2.8b model does not really run.
sampling is not implemented.
jit is incorrect.
some deadcode / wrong code path and copied from torch stuff stuff.
2024-03-30 02:50:13 -04:00
George Hotz
f35f9d32f2 rename mlops to function (#4003) 2024-03-29 21:49:00 -07:00
chenyu
c71627fee6 move GlobalCounter to helpers (#4002)
break circular import between ops and buffer
2024-03-30 00:30:30 -04:00
George Hotz
9eef44521b ScheduleItem uses Buffer (#3995)
* schedule Buffer

* update

* update tests

* master

* works

* remove LoadOps.WAIT

* fix compile2

* bad test

* rename and note
2024-03-29 20:50:27 -07:00
George Hotz
1bd4f01da2 size instead of st.size (#4001) 2024-03-29 19:59:02 -07:00
George Hotz
8f1e34a2a0 early src delete (#3996)
* early src delete

* fix bad test

* fix test_linearizer
2024-03-29 19:46:07 -07:00
Szymon Ożóg
31c8ba8b84 Move transformations to PatternMatcher + clean up existing patterns (#3997) 2024-03-29 19:42:39 -07:00
George Hotz
f916aadaea external that test 2024-03-29 19:35:50 -07:00