Commit Graph

10633 Commits

Author SHA1 Message Date
David Hou
2befdf86d9 dataloader worker/shm cleanup (#3710) 2024-03-12 21:44:24 -04:00
chenyu
e1b2a82d89 fix st.real_size can be nagative if valid is always false (#3708)
two followups after this. (1) if a buffer is never accessed in kernel, it can be removed from input (2) real_size can be smaller conditional on valid being true (the old validhack stuff)
2024-03-12 20:34:07 -04:00
chenyu
b13457e4a7 explicit dtypes in hlb_cifar (#3707)
prepared bfloat16 change. added float() and cast(default_float) in whiteing, explicitly set dtype in various places that convert between numpy and Tensor
2024-03-12 18:20:23 -04:00
Francis Lam
b6e2495fdd kernel: limit shared memory usage when adding opts (#3705)
* kernel: limit shared memory usage when adding opts

* search: remove unnecessary limit on search space

apply_opt will do the more correct check
2024-03-12 17:06:21 -04:00
George Hotz
2024b24f35 add some graph tests (#3702)
* add some graph tests

* PatternMatcher class

* speedup

* const cast test

* fix tests

* itertools chain
2024-03-12 09:49:47 -07:00
chenyu
f599c6e7f4 test output dtypes matche in test_ops (#3703)
need to cast some torch output to int32 because torch default returns int64 for index related function

close #2797
2024-03-12 12:44:40 -04:00
nimlgen
798970cfad fix gpu hangs when exiting while aql queues are executing (#3700) 2024-03-12 19:23:23 +03:00
chenyu
02ca067bdf use default_float.np to construct test data in test_ops (#3701)
first step of #2797
2024-03-12 11:58:20 -04:00
George Hotz
6755a9254f constant fold pattern match (#3696)
* constant fold pattern match

* match

* better match

* fix bug in pattern

* more folding
2024-03-12 08:48:07 -07:00
nimlgen
dd1a1c12df rocm path in autogen (#3697) 2024-03-12 14:06:43 +03:00
Patrick Tsai
971d7f5d7c O(n) arange attempt (#3530)
* It works?

* Clamp correctly

* Refactor

* Make code better

* Undo some stuff

* First step to trying to make floats work

* Floats work in Python op but not metal because int div is different

Python integerdivision was implemented as // which rounds towards
negative infinity, but C integer division rounds towards 0 so there
is an off-by-1 division error

* arange does cumsum with ints and then multiplies by step

This is so loop optimization can remain int only

* Undo a lot of symbolic changes

* Final check

* Cleanup

* There can be multiple phis

* Fix multiple phi op removal

* const sets dtype correctly

* Fix bugs

* Fix a couple bugs and add loop vars to resolve

* missed one

* Don't trim too many ops

* Fix symbolic test

* Use ones instead of full

* Delete test

* Lint passes

* max node error

* Small updates to loop logic

* Remove unnecessary changes

* We are getting somewhere

* Simple case

* Fix

* rm, prn

* Better

* If NumNode doesn't work then continue

* clamp is needed for arange(256)

* Move everything into the optim fn

* Replace correctly

* Order optimizations better

* Delete

* mypy

* Test for simplification

* Rename

* Fix test

* update test description

* Undo more

* Cleanup

* No replaced_ops map

* Fix lint

* AssertionError

* back again

* Reinstate assertion

* Return true and make diff not as big

* Bigger range for test

* Change cumsum impl

* fix bug

* make big cumsum work

* lint

* Undo cumsum 2-stage removal

* No while helper

* optional min/max clamping

* floats work

* rm giant arange test

* fix python cast None

* Check phi parents

* one phi allowed per where

* Fix one phi per where

* Rework iteration

* Delete assertions

* convert to int

* Try mul -1 instead of neg for hip..?

* Remove one phi per where requirements

* one accum only

* Lint

* should simplify a loop at a time

* Don't get rid of loop explcitly

* Need to iterate backwards

* lint

* unary neg

* Make optim work for onnx and sum_pad_collapse

* Better message

* filter alu ops correctly

* Fix the limiter

* lint and simplify

* Add it back

* off by one error

* test wheres and phis

* test max ops and non-if stuff

* <=

* cast_scalar

* Oops

* Change test

* Pass loop uops instead of a modified map

* Cut param transfer between linearizer and uops

* Fix issues

* Fix lint

* fix efficientnet python 3.8 invalid syntax

* distinct vars in seen_vars

* accurate var names

---------

Co-authored-by: Patrick Tsai <patosai@users.noreply.github.com>
Co-authored-by: George Hotz <72895+geohot@users.noreply.github.com>
2024-03-11 16:09:20 -07:00
George Hotz
a5d023dff8 reciprocal mlop (#3694) 2024-03-11 16:08:46 -07:00
George Hotz
3af1c1051a Revert "bring reciprocal back (#3687)" (#3692)
This reverts commit bcf6fbd3b2.
2024-03-11 15:55:14 -07:00
George Hotz
ef44c8959b Revert "rewrite recip to div (#3690)" (#3691)
This reverts commit 2b089bfd18.
2024-03-11 15:54:58 -07:00
George Hotz
2b089bfd18 rewrite recip to div (#3690)
* rewrite recip to div

* fix bug in uops add
2024-03-11 15:52:24 -07:00
qazal
aec4c4f01b linearizer ast as a tuple of lazyops (#3689)
* multi store op linearizer

* currently we do only one output per kernel

* named opts
2024-03-11 15:39:04 -07:00
chenyu
d0bcc9a66b replace all if dim < 0: dim += self.ndim with _resolve_dim (#3688) 2024-03-11 17:33:36 -04:00
George Hotz
bcf6fbd3b2 bring reciprocal back (#3687)
* bring reciprocal back

* better

* explicit dtype for recip

* llvm tighter

* sigmoid can use RECIP
2024-03-11 14:19:54 -07:00
Francis Lam
9f13960f72 search: catch RuntimeError when timing acted_lins (#3664)
when compilation succeeds, but runtime fails due to thread limits
on METAL, this allows a beam search to proceed, treating this
the same way as a compile failure.
2024-03-11 16:14:03 -04:00
rnxyfvls
490c5a3ec3 examples/stable_diffusion: support model checkpoints without alphas_cumprod key (#3681)
* examples/stable_diffusion: support model checkpoints without alphas_cumprod key

(which is most models on civitai)

* fix indent

---------

Co-authored-by: a <a@a.aa>
2024-03-11 16:05:52 -04:00
Francis Lam
3219a527d6 search: add a tool that beam searches one or more kernels (#3685) 2024-03-11 16:02:17 -04:00
chenyu
b68fbd7d81 View.__add__ to merge_view (#3686)
verified the cases that used real_strides are redundant
2024-03-11 15:52:34 -04:00
nimlgen
76ade20b89 hsa driver tiny cleanups (#3684) 2024-03-11 22:32:43 +03:00
chenyu
d69170e27e add llama 2 70B in ci and verify output (#3682)
* add llama 2 70B in ci and verify output

* ln -s llama2 dir
2024-03-11 12:48:22 -04:00
chenyu
e10ee2ed3f llama beam tinybox ci (#3680) 2024-03-11 01:35:39 -04:00
George Hotz
3415b0ee54 hotfix: mixtral copies norms together for 2% speed 2024-03-11 01:28:03 +00:00
Skosh
e8c350fdac fix: make Tensor.rand produce correct values for float16 (#3654)
* fix: make Tensor.rand produce correct values for float16

Due to precision loss when casting to float16, the data distribution created by custom_random isnt correctly in the interval ]0, 1[, but instead in the interval ]0, 1], which causes the Tensor.randn to incorrectly generate values of infinity.

The solution uses a scaling value to make sure the values stay under 1, when using half precision.

Closes #3611

* update implementation to truncate to closest f16 value to 1

* chore: fix whitespace

* test larger distribution

---------

Co-authored-by: chenyu <chenyu@fastmail.com>
2024-03-10 18:48:00 -04:00
chenyu
bad6adaf8c add mixtral and 6 gpus cifar to tinybox ci (#3676)
* add mixtral and 6 gpus cifar to tinybox ci

* print total ram used at the end of loading
2024-03-10 18:25:31 -04:00
George Hotz
44a67bf783 constant folding (#3675)
* constant fold

* bool math

* fix ptx
2024-03-10 14:47:24 -07:00
George Hotz
25aede6fd9 truncate for exec_alu (#3674) 2024-03-10 14:19:04 -07:00
Francis Lata
957ae9b594 Fix Tensor's __repr__ for printing out grad (#3673)
* update check for Tensor's __repr__ with grad

* add test for repr with grad bugfix
2024-03-10 17:04:29 -04:00
George Hotz
0f16729023 RDNA3: restore launch bounds (#3672)
* bring launch bounds back

* works

* that second flag didn't do anything

* fix linter
2024-03-10 10:27:52 -07:00
chenyu
d7452c2a20 clean up llvmir builder (#3671)
```
_block -> block
builder._block.module -> builder.module
var_dtype -> dtype
```
2024-03-09 21:19:36 -05:00
George Hotz
1143c62519 tensor.py touchups (#3667)
* tensor.py touchups

* put back
2024-03-09 16:12:20 -08:00
George Hotz
69ca7f7bf9 changes for teenygrad (#3665)
* changes for teenygrad

* upd

* simpler test
2024-03-09 15:30:34 -08:00
Quentin Wach
89b8b5d549 Fix missing import. (#3666) 2024-03-09 14:55:23 -08:00
Maximilian Wolf
8ae85b2cf5 add inference_mode context manager with decorator support (#3621)
* add inference_mode context manager with decorator support

* change val to mode for train and inference_mode

* fix wrong rename
2024-03-09 08:38:26 -08:00
Obada Khalili
b5cbf1792a Fix Tensor.cumsum when axis of length 0 is selected (#3473)
* fix Tensor.cumsum when axis of length 0 is selected

* add cumsum regression test

* define padding left size in a seperate line
2024-03-09 08:26:41 -08:00
chenyu
915f98791c use custom KernelOptError in kernel opt (#3661)
be more specific about invalid kernel opt, used that in test_linearizer_failures.

make BEAM kernel search work even with assertion disabled.

`BEAM=2 python3 -O examples/llama.py  --temperature=0 --count=10 --prompt="Hello." --timing`
2024-03-08 15:36:16 -05:00
George Hotz
ac02e7347d ptx timing vs cuda timing (#3659) 2024-03-08 10:17:49 -08:00
uuuvn
daa4034e80 No more metal flakiness (#3643) 2024-03-08 08:54:44 -08:00
chenyu
e25879d50e don't get new var_val for the same ast in fuzz_linearizer (#3657)
fixed result comparison for kernels with variables
2024-03-08 09:49:24 -05:00
chenyu
1130c73844 add FUZZ_NTH to fuzz_linearizer (#3656)
* add FUZZ_NTH to fuzz_linearizer

also update tests in test_linearizer_failures to not just run on METAL

* update failures for HIP/HSA

* test_failure_21 LLVM PADTO
2024-03-08 09:16:49 -05:00
David Hou
9f66dcf718 PolynomialDecayWithWarmup + tests (#3649)
* working PolynomialDecayWithWarmup + tests.......

add lars_util.py, oops

* keep lars_util.py as intact as possible, simplify our interface

* whitespace

* clean up

* clean up

* asserts

* test polylr for full resnet training run

* add comment

* rename

* fix do_optim

* don't cast lr

* info

* calculate from train_files

* skip it
2024-03-07 18:53:36 -05:00
chenyu
57df8e8d82 update fuzz_linearizer (#3648)
included non-reduce kernel and kernel with variables. green msg when everything passed
it's possible that creating rawbufs failed due to memory error, included that in failure cases
2024-03-07 18:41:22 -05:00
chenyu
b282a45e39 fix direct store float4 with same vin (#3652)
In a kernel that stores expanded value, the vin of float4 can come from same source, and we only remove once in that case.
2024-03-07 18:11:50 -05:00
chenyu
a66ffec6d3 update kernel dataset to exclude the disktensor ones (#3651)
disk tensor load contains big offset and is not meant to be run by gpu.

repro steps
```
time ./extra/optimization/generate_dataset.sh
gzip /tmp/sops
mv /tmp/sops.gz extra/datasets/
```
2024-03-07 17:35:19 -05:00
chenyu
fcf4a5ccf2 fix example that calls Tensor.__bool__ (#3650)
also removed `.cpu()` calls in mask_rcnn so `python3 examples/mlperf/model_spec.py` runs
2024-03-07 16:59:26 -05:00
George Hotz
6e50582e62 working to improve ptx (#3647)
* working to improve ptx

* fix compile fail
2024-03-07 12:39:31 -08:00
Zaffer
1853ec9a02 add tests for bfloat16 on HIP (#3638)
* Fix bug in login functionality

* Remove HSA backend test and add bfloat16 dtype tests that run in CI

* Skip tests on HIPCPU

* skip tests causing segfault on LLVM backend

* Exclude bfloat16 tests causing segfaults in LLVM backend

* move bf16 cast tests to only test on HIP
2024-03-07 10:45:36 -08:00