Commit Graph

10417 Commits

Author SHA1 Message Date
George Hotz
56f44bd10e move the compiler cache to be global (#2957)
* move the compiler cache to be global

* remove non robust test

* remove dead code
2024-01-01 10:59:56 -08:00
George Hotz
063f465604 simpler webgpu (#2956)
* simpler webgpu

* skip that test
2024-01-01 10:28:59 -08:00
Shawn Hagler
fea20d71b3 add /opt/cuda/include directory (#2920) 2023-12-30 08:16:42 -08:00
chenyu
0d6e264c48 cleanup Tensor.triu and Tensor.tril (#2953)
`.where` does the dtype and shape conversions for 0, no need to use zeros_like
2023-12-29 22:27:18 -05:00
chenyu
e53b96fdbb fix TC=2 tensor core op test (#2951)
* print DEBUG for TC=2 in CI

* enable TC=2

* no need to check src type

* LOAD has side effect

* don't push any local buffer

* update comment

* and BARRIER
2023-12-29 21:39:49 -05:00
chenyu
ad4472e6e8 cleanup llama apply_rotary_emb and other helpers (#2950)
* cleanup llama apply_rotary_emb and other helpers

used ellipsis and other higher level tensor function.
disabled the half @ half -> half tensor core as it fails uop dtype checks

* keep hip 8x8->8 wmma
2023-12-29 11:39:15 -05:00
chenyu
61e255d197 use max for gpt2 and llama (#2949)
not using argmax yet because there's a multinomial outside of function.
2023-12-28 23:26:00 -05:00
chenyu
c7b106bf9c hotfix float4 only supports float and half (#2948)
#2942 broke coder
2023-12-28 20:23:52 -05:00
chenyu
2f67f1e580 remove obsolete TODO in beautiful_mnist (#2946)
the compiler error was due to `error: call to 'max' is ambiguous` when we have max(int, float) in kernel.
it was first fixed in 4380ccb1 the non fp32 math PR, and further solidified with dtype refactor
2023-12-28 17:09:23 -05:00
chenyu
50f2e31d26 cleanup float4 grouping in global_load and global_store (#2942)
* cleanup float4 grouping in global_load and global_store

* fix test decorator
2023-12-27 14:10:04 -05:00
chenyu
54629b56d2 minor cleanup in kernel and linearizer (#2937)
* minor cleanup in kernel and linearizer

less long line, spaces and colocate variables

* no deadline in hypothesis test
2023-12-26 12:05:32 -05:00
chenyu
820f2e054e fix PADTO optimization (#2935)
the correct condition is that PADTO cannot be applied to reduce axis, not Reduce.MAX in ops.
even for Reduce.SUM it's possible that the reduce axis had a div before, and the padded 0 became inf then sum over it is incorrect.
2023-12-25 22:52:49 -05:00
qazal
dca5e4fe74 tensor == tensor should be bool (#2916)
* return bool

* add tests to the type spec

* fix multinomial

* fix tril

* fix round

* fix NegativeLogLikelihoodLoss

* rm debug

* webgpu

* more webgpu

* bitwise or for adding two bools

* onnx ops dont need to cast anymore

* Revert "bitwise or for adding two bools"

This reverts commit b413babffa.

* workaround for metal neg

* just the tests in the type spec
2023-12-25 12:38:47 -05:00
chenyu
8a8aed23d2 test dtypes of return values of cumsum, argmax/min, multinomial (#2933)
* test dtypes of return values of cumsum, argmax/min, multinomial

cumsum behaves like sum, and functions that return an index return in dtypes.default_int

* because webgpu is different
2023-12-25 11:33:17 -05:00
qazal
12996d3a7d green linearizer asserts for ops (#2800)
* these asserts should pass

* fix that assert

* ALU dtypes

* acc dtype for group_for_reduce

* cast image ALUs to the base dtype

* remove all casts from linearizer

* fix argmax

* fix multinomial

* fix __getitem__

* Revert "fix __getitem__"

This reverts commit 62ad719bfa.

* fix MemBuffer outputs being wrong when there is an arange + ALU with a different dtype

eg. fancy slicing (int, float), bert embeddings (int, long)

this should be fixed in lazy instead of having to break the kernel

* cleanup argmax fix

* fix matmul in ints

cast in the end

* fix llama

* skip wrong hardcoded asts in the worlds dataset

* fix llama p2

* cleanup missing parts of the diff

---------

Co-authored-by: George Hotz <geohot@gmail.com>
2023-12-25 10:41:54 -05:00
chenyu
1fb815e77e hotfix fix coder. RMSNorm cannot have float16 input (#2932)
* hotfix fix coder. RMSNorm cannot have float16 input

* update real world test due to new kernels

* more type casts
2023-12-25 02:28:11 -05:00
chenyu
b469fe3723 add CMPEQ (#2931)
* CMPEQ

* work

* fix onnx

* fix round

* fix webgpu

* prettier

* no PADTO in actions
2023-12-25 00:15:55 -05:00
Will
016aebcd84 Fixed Tensor.randint() not accepting tuple shapes (#2923)
* ww/Fixed Tensor.randint() to accept shape tuples ()

* ww/Wrote a test to cover this typo

* ww/Updated Tensor random objects to optionally take (,) or *() to be more consistent

* ww/no lint no worries

* ww/Made peace with linter

* ww/Added new line can't reduce line size without reducing readablitity

* ww/reverted to using .mul
2023-12-24 20:32:26 -05:00
chenyu
2dc99af169 clean up manual cast dtypes in tensor.py (#2930)
don't need contiguous between two casts, and don't need cast bool into float before mul
2023-12-24 13:03:41 -05:00
Isalia20
8de1fc2539 Einsum space fix (#2927)
* space removal in formula and a single test to cover it

* space in torch einsum as well

* replacing spaces in a var formula to support truncating all the spaces
2023-12-24 01:23:27 -05:00
chenyu
b55b55d56e use at least int32 and uint32 for sum output (#2926)
* use at least int32 and uint32 for sum output

* use the correct type for acc

* fix opencl

* llvm mulacc
2023-12-24 01:14:54 -05:00
chenyu
d424babe2c tensor.py cleanup around Tensor.slice (#2921)
use None for no-op slice and pad
2023-12-22 19:46:39 -05:00
chenyu
089703a390 cleanup test_dtype_alu (#2919)
wrapped long lines and lowered atol for METAL.sin to 2 since atol of two sins are bounded by 2
2023-12-22 17:29:31 -05:00
chenyu
3ba591c3fd less outdated abstraction.py (#2917)
removed some old terms and updated types and code pointers
2023-12-22 15:31:02 -05:00
chenyu
50927defad s/lazydata.realized/lazydata.base.realized/g (#2914)
* s/lazydata.realized/lazydata.base.realized/g

* not that
2023-12-22 14:45:13 -05:00
chenyu
2783e1b50d bugfix Tensor.item when it's unbased (#2913)
it's possible for numel 1 tensor lazydata to be unbased and should call lazydata.base.realized
2023-12-22 13:50:06 -05:00
Oleg Rybalko
c3133adb8c Disk shm refactor (#2912)
* better support for platform dependent flags

* osx test support

* removed unused import and made line length <150

* changed osx ci shm

* lstrip in case SharedMemory._name is passed
2023-12-22 09:23:37 -08:00
chenyu
3855432265 don't use numpy to create Tensor(None) (#2909)
* don't use numpy to create Tensor(None)

empty suffices

* parentheses
2023-12-22 01:07:44 -05:00
chenyu
50cfb1fb3a update onnx model links (#2908)
updated in https://github.com/onnx/models/pull/644
2023-12-22 00:19:41 -05:00
chenyu
1bbeb3fe2f remove the different rtol / atol for openpilot CUDA in benchmark (#2907)
not sure what the issue was but seems to be fixed on master
2023-12-21 22:23:39 -05:00
chenyu
a543d8bea8 fuzz default dtypes for some test_dtype tests (#2906)
* fuzz default dtypes for some test_dtype tests

* ocd

* setUp and tearDown
2023-12-21 22:00:21 -05:00
wozeparrot
5f3d5cfb02 catch cycles in print_tree (#2891)
* feat: smaller tree on references

* fix: shorter line

* fix: huh

* fix: should be all

* feat: cleaner

* fix: extra imports

* fix: pass by reference
2023-12-21 18:40:37 -08:00
George Hotz
4432cb17bb minor cleanups / remove that op (#2905) 2023-12-21 18:24:20 -08:00
chenyu
fd0ba33b38 onnx_ops formatting cleanup (#2904)
also removed a case in safe_numpy that always convert 0-dim array to 1-dim
2023-12-21 20:06:06 -05:00
George Hotz
5cac6338a4 apply the multitensor optimizations in lazy.py (#2901)
* apply the multitensor optimizations in lazy.py

* less lines

* hack for webgpu

* save a line
2023-12-21 13:55:49 -08:00
chenyu
5bf43c9634 reenable one onnx test failed due to dtype (#2902) 2023-12-21 15:50:02 -05:00
chenyu
677ae7673d use np.less and torch.lt for CMPLT (#2899)
also removed one unused output_type
2023-12-21 14:37:24 -05:00
qazal
d2e9245de8 render_locals takes a dtype (#2873)
Co-authored-by: chenyu <chenyu@fastmail.com>
2023-12-21 14:15:28 -05:00
chenyu
6116039f7b don't match dtype with first input in where (#2898)
* don't match dtype with first input in where

`Tensor([1, 2, 3]).where(1.2, 2.3)` the first `[1, 2, 3]` can directly cast into bool without casting float (in broadcasted) first

* cast in one place
2023-12-21 13:02:15 -05:00
chenyu
7dc3352877 increase stable diffusion validation threshold 1e-4 -> 3e-4 (#2897)
saw a flaky CI failure with 1.1e-4, and 3e-4 is a good number
2023-12-21 11:45:25 -05:00
qazal
24e79e0f53 Move the webgpu CMPLT hack to one place (#2895)
* move hacks to one place

* no casting in mlops, move to tensor

* ruff fix
2023-12-21 11:14:56 -05:00
George Hotz
852ef57ba4 fix readme typo 2023-12-21 08:06:24 -08:00
George Hotz
193109a88c hotfix: compare on ids 2023-12-20 23:47:50 -08:00
George Hotz
f6c7833f9f fast compare for lazyop (#2893) 2023-12-20 23:32:27 -08:00
chenyu
1500aca43d remove output_type in ops_cpu and ops_torch (#2892)
now the input types are matched and checked in lazy, we can remove these output_type.
also remove the usage of least_upper_dtype in ops.py since we can just use the input type
2023-12-21 02:11:27 -05:00
chenyu
2d2c4980fe assert for elementwise dtypes in lazy (#2888)
* assert for elementwise dtypes in lazy

* no image hack

* check dtype of scalar for IMAGE=2
2023-12-21 01:42:32 -05:00
George Hotz
41b2a25be6 Fix exponential behavior in lazyops (#2890)
* add cache to ast_parse and lazyop builder

* add caches
2023-12-20 22:06:50 -08:00
George Hotz
8c4a0f8e15 Fix int child count (#2882)
* pad ops broke coder

* that contiguous fixes it

* Update lazy.py

* recursive add

* fix all

* revert that

* todo test
2023-12-20 21:06:27 -08:00
chenyu
8a04107d30 move the op casting logic from mlops to tensor try 2 (#2887)
* unary works

* where works

* add sub mul

* xor div

* CMPLT

* sparse_categorical_crossentropy

* image const

* sparse_categorical_crossentropy
2023-12-20 23:50:37 -05:00
George Hotz
7da2325dc7 get_lazyops() -> lazyops (#2884)
* get_lazyops() -> lazyops

* don't compare empty mem
2023-12-20 18:04:49 -08:00