Commit Graph

3769 Commits

Author SHA1 Message Date
George Hotz
2b089bfd18 rewrite recip to div (#3690)
* rewrite recip to div

* fix bug in uops add
2024-03-11 15:52:24 -07:00
qazal
aec4c4f01b linearizer ast as a tuple of lazyops (#3689)
* multi store op linearizer

* currently we do only one output per kernel

* named opts
2024-03-11 15:39:04 -07:00
chenyu
d0bcc9a66b replace all if dim < 0: dim += self.ndim with _resolve_dim (#3688) 2024-03-11 17:33:36 -04:00
George Hotz
bcf6fbd3b2 bring reciprocal back (#3687)
* bring reciprocal back

* better

* explicit dtype for recip

* llvm tighter

* sigmoid can use RECIP
2024-03-11 14:19:54 -07:00
Francis Lam
9f13960f72 search: catch RuntimeError when timing acted_lins (#3664)
when compilation succeeds, but runtime fails due to thread limits
on METAL, this allows a beam search to proceed, treating this
the same way as a compile failure.
2024-03-11 16:14:03 -04:00
rnxyfvls
490c5a3ec3 examples/stable_diffusion: support model checkpoints without alphas_cumprod key (#3681)
* examples/stable_diffusion: support model checkpoints without alphas_cumprod key

(which is most models on civitai)

* fix indent

---------

Co-authored-by: a <a@a.aa>
2024-03-11 16:05:52 -04:00
Francis Lam
3219a527d6 search: add a tool that beam searches one or more kernels (#3685) 2024-03-11 16:02:17 -04:00
chenyu
b68fbd7d81 View.__add__ to merge_view (#3686)
verified the cases that used real_strides are redundant
2024-03-11 15:52:34 -04:00
nimlgen
76ade20b89 hsa driver tiny cleanups (#3684) 2024-03-11 22:32:43 +03:00
chenyu
d69170e27e add llama 2 70B in ci and verify output (#3682)
* add llama 2 70B in ci and verify output

* ln -s llama2 dir
2024-03-11 12:48:22 -04:00
chenyu
e10ee2ed3f llama beam tinybox ci (#3680) 2024-03-11 01:35:39 -04:00
George Hotz
3415b0ee54 hotfix: mixtral copies norms together for 2% speed 2024-03-11 01:28:03 +00:00
Skosh
e8c350fdac fix: make Tensor.rand produce correct values for float16 (#3654)
* fix: make Tensor.rand produce correct values for float16

Due to precision loss when casting to float16, the data distribution created by custom_random isnt correctly in the interval ]0, 1[, but instead in the interval ]0, 1], which causes the Tensor.randn to incorrectly generate values of infinity.

The solution uses a scaling value to make sure the values stay under 1, when using half precision.

Closes #3611

* update implementation to truncate to closest f16 value to 1

* chore: fix whitespace

* test larger distribution

---------

Co-authored-by: chenyu <chenyu@fastmail.com>
2024-03-10 18:48:00 -04:00
chenyu
bad6adaf8c add mixtral and 6 gpus cifar to tinybox ci (#3676)
* add mixtral and 6 gpus cifar to tinybox ci

* print total ram used at the end of loading
2024-03-10 18:25:31 -04:00
George Hotz
44a67bf783 constant folding (#3675)
* constant fold

* bool math

* fix ptx
2024-03-10 14:47:24 -07:00
George Hotz
25aede6fd9 truncate for exec_alu (#3674) 2024-03-10 14:19:04 -07:00
Francis Lata
957ae9b594 Fix Tensor's __repr__ for printing out grad (#3673)
* update check for Tensor's __repr__ with grad

* add test for repr with grad bugfix
2024-03-10 17:04:29 -04:00
George Hotz
0f16729023 RDNA3: restore launch bounds (#3672)
* bring launch bounds back

* works

* that second flag didn't do anything

* fix linter
2024-03-10 10:27:52 -07:00
chenyu
d7452c2a20 clean up llvmir builder (#3671)
```
_block -> block
builder._block.module -> builder.module
var_dtype -> dtype
```
2024-03-09 21:19:36 -05:00
George Hotz
1143c62519 tensor.py touchups (#3667)
* tensor.py touchups

* put back
2024-03-09 16:12:20 -08:00
George Hotz
69ca7f7bf9 changes for teenygrad (#3665)
* changes for teenygrad

* upd

* simpler test
2024-03-09 15:30:34 -08:00
Quentin Wach
89b8b5d549 Fix missing import. (#3666) 2024-03-09 14:55:23 -08:00
Maximilian Wolf
8ae85b2cf5 add inference_mode context manager with decorator support (#3621)
* add inference_mode context manager with decorator support

* change val to mode for train and inference_mode

* fix wrong rename
2024-03-09 08:38:26 -08:00
Obada Khalili
b5cbf1792a Fix Tensor.cumsum when axis of length 0 is selected (#3473)
* fix Tensor.cumsum when axis of length 0 is selected

* add cumsum regression test

* define padding left size in a seperate line
2024-03-09 08:26:41 -08:00
chenyu
915f98791c use custom KernelOptError in kernel opt (#3661)
be more specific about invalid kernel opt, used that in test_linearizer_failures.

make BEAM kernel search work even with assertion disabled.

`BEAM=2 python3 -O examples/llama.py  --temperature=0 --count=10 --prompt="Hello." --timing`
2024-03-08 15:36:16 -05:00
George Hotz
ac02e7347d ptx timing vs cuda timing (#3659) 2024-03-08 10:17:49 -08:00
uuuvn
daa4034e80 No more metal flakiness (#3643) 2024-03-08 08:54:44 -08:00
chenyu
e25879d50e don't get new var_val for the same ast in fuzz_linearizer (#3657)
fixed result comparison for kernels with variables
2024-03-08 09:49:24 -05:00
chenyu
1130c73844 add FUZZ_NTH to fuzz_linearizer (#3656)
* add FUZZ_NTH to fuzz_linearizer

also update tests in test_linearizer_failures to not just run on METAL

* update failures for HIP/HSA

* test_failure_21 LLVM PADTO
2024-03-08 09:16:49 -05:00
David Hou
9f66dcf718 PolynomialDecayWithWarmup + tests (#3649)
* working PolynomialDecayWithWarmup + tests.......

add lars_util.py, oops

* keep lars_util.py as intact as possible, simplify our interface

* whitespace

* clean up

* clean up

* asserts

* test polylr for full resnet training run

* add comment

* rename

* fix do_optim

* don't cast lr

* info

* calculate from train_files

* skip it
2024-03-07 18:53:36 -05:00
chenyu
57df8e8d82 update fuzz_linearizer (#3648)
included non-reduce kernel and kernel with variables. green msg when everything passed
it's possible that creating rawbufs failed due to memory error, included that in failure cases
2024-03-07 18:41:22 -05:00
chenyu
b282a45e39 fix direct store float4 with same vin (#3652)
In a kernel that stores expanded value, the vin of float4 can come from same source, and we only remove once in that case.
2024-03-07 18:11:50 -05:00
chenyu
a66ffec6d3 update kernel dataset to exclude the disktensor ones (#3651)
disk tensor load contains big offset and is not meant to be run by gpu.

repro steps
```
time ./extra/optimization/generate_dataset.sh
gzip /tmp/sops
mv /tmp/sops.gz extra/datasets/
```
2024-03-07 17:35:19 -05:00
chenyu
fcf4a5ccf2 fix example that calls Tensor.__bool__ (#3650)
also removed `.cpu()` calls in mask_rcnn so `python3 examples/mlperf/model_spec.py` runs
2024-03-07 16:59:26 -05:00
George Hotz
6e50582e62 working to improve ptx (#3647)
* working to improve ptx

* fix compile fail
2024-03-07 12:39:31 -08:00
Zaffer
1853ec9a02 add tests for bfloat16 on HIP (#3638)
* Fix bug in login functionality

* Remove HSA backend test and add bfloat16 dtype tests that run in CI

* Skip tests on HIPCPU

* skip tests causing segfault on LLVM backend

* Exclude bfloat16 tests causing segfaults in LLVM backend

* move bf16 cast tests to only test on HIP
2024-03-07 10:45:36 -08:00
chenyu
0cef284aac fix typing FlopCounter.flops can be sint (#3646) 2024-03-07 12:49:17 -05:00
chenyu
906cc3a69b cleanup tests Device[Device.DEFAULT] is always Compiled (#3645) 2024-03-07 11:15:42 -05:00
qazal
bdd62c7fd8 make the bf16 include dynamic (#3642)
* dynamic prefix

* add common ones above

these are common dtypes

aesthetics

* regression test

fuzz it

test

* run in CI

* use .append

* faster
2024-03-07 10:31:35 -05:00
chenyu
4552248c84 fix Tensor.to preserves grad.data (#3636) 2024-03-06 21:44:49 -05:00
chenyu
d33311ebe0 remove parens of ALU if it has associative property (#3635)
need to remove SUB since it's possible to have (const - (const - const)) in test/test_ops.py::TestOps::test_cos,
in which case cannot remove the parens of children
2024-03-06 21:12:11 -05:00
chenyu
fe6b6e38c1 remove parentheses of GEP if it's from SSA (#3634)
fixed some bracket nesting level exceeded maximum of 256 errors
2024-03-06 20:22:46 -05:00
David Hou
0afaf70d57 lars optimizer + tests (#3631)
* lars optimizer + tests

* fix skip list!

* use id to compare in skip list

* go back to using set

* Tensor(bool) * Tensor(bool) is and

* don't lint external/mlperf_resnet

* whitespace

* add external_test_optim to opencl tests

* give mlperf task a name

* mlperf under onnx

* remove track_gnorm

* contiguous instead of realize

* assert momentum and weight decay positive

---------

Co-authored-by: chenyu <chenyu@fastmail.com>
2024-03-06 18:11:01 -05:00
chenyu
b2e92d44fa skip METAL sin test in test_dtype_alu (#3633)
revert this part of #3629. this is flaky
2024-03-06 17:29:19 -05:00
chenyu
8f10bfa2ff ban __bool__ on Tensor (#3632)
* ban __bool__ on Tensor

avoid misuse

* test case

* fix tests

* fix more tests
2024-03-06 17:12:35 -05:00
George Hotz
81baf3eed3 bring ptx back (#3623)
* bring ptx back

* ptx back

* fix define var

* fix a few bugs

* bugfixes

* fixes

* fix llvm bug

* fix test bug
2024-03-06 13:34:21 -08:00
chenyu
c270d54c32 update test_dtype_alu for METAL (#3629) 2024-03-06 14:55:19 -05:00
qazal
abc5f3a6a0 hip bf16 hotfix (#3630)
* hip bf16

* remu dev mac

* Revert "remu dev mac"

This reverts commit 465069a0dc3c7f2045f3348b312a1dcbf1587acd.

* skip disk tests in CI

* bring float8 back
2024-03-06 11:42:30 -08:00
chenyu
bc2a13a5f7 test case to show clang and python doing math in double (#3628) 2024-03-06 13:49:03 -05:00
George Hotz
568353fa84 hotfix: bump line count to 6500 2024-03-06 07:52:18 -08:00