Commit Graph

10633 Commits

Author SHA1 Message Date
andresgit
7fd12aba85 graph remove input buffer references (#4100)
Co-authored-by: chenyu <chenyu@fastmail.com>
2024-04-08 16:49:16 -04:00
chenyu
078d841479 add SPLIT_REDUCEOP to disable reduce split (#4115)
verify with `SPLIT_REDUCEOP=0 BIG=2 MPS=1 python3 -m pytest -rA test/test_speed_v_torch.py -k sum`. 10X slower on mac
2024-04-08 16:31:08 -04:00
qazal
eea42d864f account for all outputs (#4113) 2024-04-08 10:04:19 -07:00
chenyu
dbd39ab78a setitem support setting python const (#4111) 2024-04-08 11:37:50 -04:00
chenyu
f8dc82a8a7 use single tensor for llama kv chache (#4108)
similar to optimization in gpt2
2024-04-08 00:38:32 -04:00
chenyu
92c0675ccf setitem initial support (#4093)
* wip setitem

it's an eager assign to output shapetracker view

* cleanups and tests

* more cleanups
2024-04-07 20:35:22 -04:00
geohotstan
183708b3fd broadcast expand to match torch (#4085)
* initial version

* heh gimme grrrreen

* version 2

* clean ups

* some test confusion

* fix onnx

* rename to _broadcast_tensors

* improved errors and test

* fixed?

* some test fixup

* version 3 lol

* comments

* cleaner

* add failure test for expand to 0 test

* 1 more assertRaises test

* make err msg better

* also rewrite the expand onnx op? :s
2024-04-07 16:23:13 -04:00
uuuvn
2b81d9b334 Fix broken test (#4104) 2024-04-07 12:02:12 -04:00
chenyu
9a95d87366 metal CI run llama with 4 shards (#4103)
this can catch multi tensor issue on mac.
2024-04-07 11:04:08 -04:00
George Hotz
444d2a7487 hotfix: fix SDMA read_pointer_address in KFD 2024-04-07 13:13:15 +00:00
uuuvn
bb7567b365 Fix metal (#4101) 2024-04-07 05:21:19 -07:00
chenyu
bdbcac67f1 assign jit test case with other tensor as input (#4098)
hmm it works
2024-04-06 14:41:14 -04:00
George Hotz
e4a1858471 revert command queue (#4097) 2024-04-06 08:58:18 -07:00
George Hotz
97c402d69e use imagenet spawn (#4096) 2024-04-06 08:34:10 -07:00
George Hotz
fffd9b05f5 mock mnist data for imagenet trainer (#4095)
* mock mnist data for imagenet

* move print and test

* needed to reshape
2024-04-06 08:08:40 -07:00
George Hotz
8739d33fe9 kfd: disable copy_from_fd while debugging (#4091)
* kfd: disable copy_from_fd while debugging

* increase timeout to a minute
2024-04-05 18:02:58 -07:00
George Hotz
93824e59eb support MOCKDATA=1 for resnet (#4090)
* mockdata for resnet

* fix eval, revert hsa
2024-04-05 17:19:18 -07:00
George Hotz
164329a8ea address kfd feedback (#4087)
* address kfd feedback

* signals cleanup

* signals cleanup

* handle 2 doorbell pages correctly

* signal reset cleanup

* signals cleanup

* more GTT

* cleanups

* minor cleanups
2024-04-05 15:24:41 -07:00
geohotstan
dafa42e864 clean up (#4081)
Co-authored-by: chenyu <chenyu@fastmail.com>
2024-04-05 11:57:44 -04:00
Akshit Talwar
750ecf8fef replace slice by pad/shrink in _pool (#4082) 2024-04-05 11:47:22 -04:00
George Hotz
a337922c44 more work on kfd (#4079)
* more work on kfd

* fix multitensor test on kfd

* stuff
2024-04-05 08:36:36 -07:00
chenyu
e7ff5102cf failed test in test_pattern_matcher (#4080)
something about the PTX rewrite is incorrect that it has duplicated rewritten uops
2024-04-05 02:53:50 -04:00
chenyu
a023a1ed87 update github action to actions/cache@v4 (#4077)
get rid of warning `Node.js 16 actions are deprecated. Please update the following actions to use Node.js 20: actions/cache@v3.`
2024-04-04 22:24:26 -04:00
George Hotz
28ec6c67be hotfix: hlb_cifar KFD works 2024-04-05 02:19:14 +00:00
chenyu
1de9778949 import Buffer and BufferOption from tinygrad.buffer (#4076) 2024-04-04 22:12:23 -04:00
chenyu
9e0ebf8979 remove dtype from FlopCounter (#4075)
the annoying thing to remove all FlopCounter is that for device that does not support local, matmul index alu is huge.
we can remove the dtype first.

sneak in updating `ruff` command to `ruff check`
2024-04-04 21:23:28 -04:00
George Hotz
3de855ea50 don't use SVM memory in KFD (#4072)
* don't use SVM memory in KFD

* copy from fd

* cleanups

* transfer

* hacks

* ops_hsa

* tighter API
2024-04-04 17:33:21 -07:00
chenyu
5e6e6c9a67 use ConstType in various const function type hint (#4074) 2024-04-04 20:32:07 -04:00
chenyu
c1cffed1df add LazyOp.dtype (#4073)
an inferred cached_property.
removed all cases that use get_lazyop_info just to get the dtype of an op.
prereq to remove InterpretedFlopCounter
2024-04-04 17:38:19 -04:00
chenyu
f836d6a03f is_unrealized_unpadded_const -> is_unrealized_unmasked_const (#4071)
realized #3580 was doing the same thing. unmasked is more accurate
2024-04-04 14:25:17 -04:00
Szymon Ożóg
82b7b9655f test for dtype set (#4069) 2024-04-04 11:24:33 -04:00
geohotstan
1a1dd1c1a7 add and enable tests for indexing const folding (#4068)
* enable test in test_indexing

* added tests

* rename stuff

* del a test case cuz it's loadops.copy
2024-04-04 10:46:28 -04:00
Szymon Ożóg
ba118abfec improved caching for pointer arithmetics in ptx (#3922)
* improved caching for pointer arithmetics

* Add test for pointer arithmetics caching

* Refactor test
2024-04-04 07:33:48 -07:00
Szymon Ożóg
68fe3527f1 Tensor core ptx (#3894)
* tensor cores

* Merge from master

* faster program start in llvm (#3897)

* Fix the result permutation in einsum (#3895)

* Fix permutation of result indices in einsum.

* Delete stray line used for breaking tests

* Fix linter error by renaming twice-used variable

---------

Co-authored-by: chenyu <chenyu@fastmail.com>

* touchup einsum (#3900)

don't need rhs_letters

* hotfix check ckpts before writing achieved model (#3901)

this killed tinybox green run

* replace dtype.name str with render_dtype (#3903)

fixed some bf16 cast issue since it does not have `.name`.
also more robust if there are lang specific type override

* add --minimal flag to nvrtc (#3899)

* wmma: fix the AMD TC threads to split the first 16 threads (#3904)

previously it was incorrectly aliasing 16 into the size 8 upcast
on the store alias.  now it splits it properly into 8 and the
remaining 2 into the correct local stride

* training cifar with BF16 on CUDA (#3905)

* training cifar with BF16 on CUDA

memory usage is between float and half due to numpy calls on dataset preprocessing, which converts into float.

* simpler bf16 functions

* bf16 cifar works for HSA too just very slow

* simpler bf16 functions, we love cuda

* include negative float in test_dtype (#3884)

* include negative float in test_dtype

* that is ub

* too annoying

* pack can overflow

* add to benchmark

* change var name to satisfy mypy

* spacing

* Update to new TensorCore format

* Spacing

---------

Co-authored-by: nimlgen <138685161+nimlgen@users.noreply.github.com>
Co-authored-by: Alejandro F Queiruga <33233447+afqueiruga@users.noreply.github.com>
Co-authored-by: chenyu <chenyu@fastmail.com>
Co-authored-by: sekstini <127142660+sekstini@users.noreply.github.com>
Co-authored-by: Francis Lam <flam@alum.mit.edu>
Co-authored-by: George Hotz <72895+geohot@users.noreply.github.com>
2024-04-04 07:32:31 -07:00
Szymon Ożóg
92378fb5b6 Ptx mulacc (#3937)
* mulacc

* Move more stuff to pattern matcher

* disable callable from the == check

* disable function passing in pattern matcher

* Add set of dtypes pattern matching + refactor mulacc pattern
2024-04-04 00:15:25 -07:00
George Hotz
3e72d745ea hotfix: make KFD timings right 2024-04-04 05:55:29 +00:00
George Hotz
58d162315c Continuing KFD work (#4065)
* cleanups

* fix kernargs ptr

* mypy passes
2024-04-03 22:48:13 -07:00
chenyu
d219aba962 prepend CLANG_PROGRAM_HEADER in ClangCompiler.render instead of compile (#4063)
src header should be part of the rendered output, and DEBUG=4 includes the header this way
2024-04-03 23:17:56 -04:00
George Hotz
7181ffd630 HWCopyQueue in KFD (#4042)
* HWCopyQueue in KFD

* hw compute queue

* test

* move test

* more tests

* fix wait

* fix multimap

* mes crash

* tests pass but slow

* stuff is working

* one more test
2024-04-03 20:14:24 -07:00
chenyu
e3c0ac9fbf remove old envvar "OPT" (#4060) 2024-04-03 14:55:21 -04:00
chenyu
406cb5fd90 const fold ReduceOps (#4059) 2024-04-03 14:39:28 -04:00
chenyu
fe03725b21 const fold cast unrealized_unpadded_const (#4047)
* const fold unrealized_unpadded_const

changed the underlying arg directly

* CAST_BEFORE_VIEW folds some

* fix const index in getitem
2024-04-03 12:31:24 -04:00
Szymon Ożóg
e5a9bff899 Add pattern matcher tests, move uop transforms from assembly to pattern (#4056)
matcher
2024-04-03 09:06:43 -07:00
qazal
1ea8fcbe1b graph schedule items (#4054) 2024-04-03 08:52:37 -07:00
George Hotz
52ee5b73b2 update logo (#4055)
* update logo

* update svg

* put svg in file

* Revert "put svg in file"

This reverts commit 735528047a.

* better

* move a tag

* remove extra
2024-04-03 07:16:57 -07:00
chenyu
f61ed869f5 Use exec_alu for lazy const folding (#4039) 2024-04-02 20:52:05 -04:00
Francis Lam
88dcdae485 search: fix counting of upcasts to ignore TC upcasts (#4045)
TC upcasts don't impact the size or complexity of the kernel code
2024-04-02 19:52:05 -04:00
Szymon Ożóg
ccf3c16d6a Refactor the use of pattern matcher in ptx (#4043) 2024-04-02 14:19:51 -07:00
chenyu
85edc493b0 uops const fold rules to prevent tautological compare warnings (#4041)
* uops const fold rules to prevent tautological compare warnings

`bool < false` is false, `true < bool` is false, `a == a` is true, `a != a` is false

* not true for nan

* and nan does not work with llvm

* full truth table test

* revert a==a

* comments and indents
2024-04-02 16:45:58 -04:00
Léo
e879e16c48 docs: add warning message for conda users when using METAL (#3917)
* docs: add warning message for conda users when using METAL

* fix: conda metal warning too long. disabled line length check

* docs: changed conda METAL warning to include DISABLE_COMPILER_CACHE=1

* fix(metal): now detecting invalid library magic

* format: removed noqa E501

* fix(metal): conda error line len

* fix: typo

---------

Co-authored-by: Léo Paillé <leo.paille@enseirb-matmeca.fr>
2024-04-02 09:22:24 -07:00