Commit Graph

3957 Commits

Author SHA1 Message Date
George Hotz
629cbc5587 only abstractions 2 (#3947) 2024-03-26 20:02:18 -07:00
chenyu
77589bc7a5 rename Scalar to ConstType and cast_scalar to as_const (#3946)
prereq cleanup to make const arg same python type as dtype
2024-03-26 22:39:58 -04:00
uuuvn
d6d902afe9 wtf (#3944) 2024-03-26 17:49:28 -07:00
Francis Lam
5530b0cbed fuzz_linearizer: reduce debug verbosity and make easier for CI usage (#3942)
* fuzz_linearizer: reduce debug verbosity and make easier for CI usage

* rename FUZZ_BEAM to FUZZ_ALL_ACTIONS (not choosing a subset)
* skip simple ASTs (easier to use with LOGOPS output)
* don't fuzz a previously seen AST
* add options to allow non-zero --expected-failures

* clean up naming and use set
2024-03-26 16:25:24 -04:00
chenyu
8df6587c41 hotfix 97.3 for beautiful_mnist (#3941) 2024-03-26 15:02:53 -04:00
chenyu
b1e3817e18 correctly handle Tensor.rand whwn default_float = bf16 (#3940)
always casting to float32 makes default half to be slow
2024-03-26 14:56:16 -04:00
chenyu
f6ff76be21 check only upcast int amount in upcasted_axis (#3938)
fixed typing and fixed #3932
2024-03-26 12:54:57 -04:00
nimlgen
e2d6f76723 _alloc and _free with options (#3934)
* _alloc has options

* linter

* fix hsa
2024-03-26 09:11:41 -07:00
nimlgen
739f47eb0f check on cuEventSynchronize (#3933) 2024-03-26 16:14:38 +03:00
George Hotz
778d17fbd3 intel matmul (#3830)
* almost right

* intel xmx
2024-03-25 22:37:20 -07:00
chenyu
ef537672bf bf16 support in metal (#3929)
it runs if device gpu supports bfloat. updated ci benchmark too
2024-03-25 23:17:36 -04:00
chenyu
72d617a37d opencl on OSX does not support fp16 extension (#3931)
running `GPU=1 python -m pytest -rA test/test_dtype.py::TestHalfDtype::test_casts_from` on mac would fail.
2024-03-25 19:50:17 -04:00
Arseny Kapoulkine
cb6e7b57a6 examples: Fix parameter bandwidth accounting for quantized LLama (#3930)
Instead of assuming every parameter is 2 bytes, just add up tensor sizes
in bytes
2024-03-25 18:41:05 -04:00
chenyu
4ecd5789ab #include <tgmath.h> in ops_clang (#3927)
* different clang sqrt/log2/exp2/sin function based on dtype

fixed softmax_argmax issue in #3552 for clang.

* tgmath.h

* revert those
2024-03-25 17:48:57 -04:00
Arseny Kapoulkine
514c43201d Fix issues with pointer provenance in load/store through ALU (#3916)
* Track pointer provenance in load/store through ALU

Previously load/store could be incorrectly rendered into
ld.global/st.global when the input was an ALU op that performed an
address computation with DEFINE_LOCAL on one of the arguments.

* Simplify the load provenance workaround

The issue is that we can render the same code twice, and on the second
run the opstream is already modified so that vin[0] isn't a DEFINE_*,
which overwrites initially correct .shared wth .global.

* Add a couple tests for basic local use

* Skip local tests on LLVM since it doesn't implement DEFINE_LOCAL
2024-03-25 14:41:05 -07:00
chenyu
d651835ef5 verify beautiful_mnist.py eval acc and put into benchmark ci (#3926)
* verify beautiful_mnist and put in ci

* 97.5 for eval verification
2024-03-25 16:47:49 -04:00
chenyu
dc508022a9 clean up clang src header (#3925)
don't need to define int64 and uchar
2024-03-25 15:18:35 -04:00
uuuvn
2080325e8d output_buffer isn't used anymore (#3919) 2024-03-25 16:03:56 +03:00
nimlgen
f2a9ea4ea9 lru allocator for copyin host buffers (#3918)
* lru allocator for copyin host buffers

* linter happy
2024-03-25 15:57:18 +03:00
George Hotz
e0e234bf94 hotfix, str compare version for cuda 2024-03-24 20:35:24 -07:00
Arseny Kapoulkine
715850aef9 Fix sm89 PTX=1 compilation (#3915)
* Fix sm89 PTX=1 compilation

The minimum PTX version that supports sm89 is 7.8 (same version also
supports sm90); without this ptxas fails when running tinygrad with
PTX=1 on RTX 4090.

* Use int(arch[3:]) for forward compat with SM10.0 if that happens
2024-03-24 20:32:29 -07:00
chenyu
83f39a8ceb env var to change default float (#3902)
* env var to change default float to fp16 or bf16

looking for standard names for these. we have FLOAT16 that does something to IMAGE and HALF to convert weights.

working on default bf16 too.
```
RuntimeError: compile failed: <null>(6): error: identifier "__bf16" is undefined
    __bf16 cast0 = (nv_bfloat16)(val0);
```

remove that in cifar

* DEFAULT_FLOAT

* default of default

* unit test

* don't check default

* tests work on linux
2024-03-24 20:33:57 -04:00
George Hotz
03899a74bb increase atol on reset train 2024-03-24 15:17:31 -07:00
qazal
d8fafca13a assign regression (#3907)
* infra

* track mutations

* assign levels

* add seen back

* add test

* infra 2.0

* add assign targets

* dont need levels

* delete

* Update test_assign.py

---------

Co-authored-by: George Hotz <72895+geohot@users.noreply.github.com>
2024-03-24 15:12:31 -07:00
Szymon Ożóg
2d0bfdf01c ptx cleanup (#3893) 2024-03-24 14:54:45 -07:00
chenyu
2e39f57594 move lines around in ops_python wmma (#3911) 2024-03-24 17:14:26 -04:00
Patrick Tsai
e27129a798 Fix linearizer failure 26 test (#3906)
* Adjust adds between WHERE and PHI

* Not much better

* undo recursive change

* hm

* iterate over where, not factored op

* oo

* consts only for loop

* UNdo var name change

* update

---------

Co-authored-by: Patrick Tsai <patosai@users.noreply.github.com>
2024-03-24 16:34:13 -04:00
chenyu
10673d1447 tiny search cleanup (#3910)
* tiny search cleanup

removed some `assert isinstance(dev, Compiled)` and lines

* remove import
2024-03-24 14:20:55 -04:00
wozeparrot
9a9cac58f9 add lars to nn (#3750)
* feat: add lars

* feat: don't remove this comment

* clean: smaller diff

* clean: shorter line

* feat: remove mlperf lars, switch resnet

* fix: fully remove mlperf lars

* clean: comment

* feat: contiguous

* feat: no weight decay on skip params

* feat: optimizergroup

* feat: classic momentum

* fix: pylint

* clean: move comment

* fix: correct algo

* feat: lrschedulergroup

* feat: skip list tests

* feat: :| forgot that params are a thing

* feat: remove skip_list params from main params

* feat: set moment

---------

Co-authored-by: chenyu <chenyu@fastmail.com>
2024-03-24 11:43:12 -04:00
chenyu
8c8b57fd5f cleanup ops python (#3908)
i just want to merge lars!
2024-03-24 11:36:31 -04:00
chenyu
2c69888654 include negative float in test_dtype (#3884)
* include negative float in test_dtype

* that is ub

* too annoying

* pack can overflow
2024-03-24 02:39:15 -04:00
chenyu
e22d78b3d2 training cifar with BF16 on CUDA (#3905)
* training cifar with BF16 on CUDA

memory usage is between float and half due to numpy calls on dataset preprocessing, which converts into float.

* simpler bf16 functions

* bf16 cifar works for HSA too just very slow

* simpler bf16 functions, we love cuda
2024-03-24 01:37:47 -04:00
Francis Lam
0145366323 wmma: fix the AMD TC threads to split the first 16 threads (#3904)
previously it was incorrectly aliasing 16 into the size 8 upcast
on the store alias.  now it splits it properly into 8 and the
remaining 2 into the correct local stride
2024-03-23 21:17:42 -04:00
sekstini
7c3632fd1e add --minimal flag to nvrtc (#3899) 2024-03-23 16:38:31 -07:00
chenyu
a2b2597fc2 replace dtype.name str with render_dtype (#3903)
fixed some bf16 cast issue since it does not have `.name`.
also more robust if there are lang specific type override
2024-03-23 19:25:48 -04:00
chenyu
24d004a89b hotfix check ckpts before writing achieved model (#3901)
this killed tinybox green run
2024-03-23 17:16:38 -04:00
chenyu
4d566f12b1 touchup einsum (#3900)
don't need rhs_letters
2024-03-23 16:46:39 -04:00
Alejandro F Queiruga
556dcfb8f2 Fix the result permutation in einsum (#3895)
* Fix permutation of result indices in einsum.

* Delete stray line used for breaking tests

* Fix linter error by renaming twice-used variable

---------

Co-authored-by: chenyu <chenyu@fastmail.com>
2024-03-23 15:48:19 -04:00
nimlgen
4e18dd78d3 faster program start in llvm (#3897) 2024-03-23 15:20:15 +03:00
George Hotz
46a3501cec nv ioctl sniffer (#3892)
* nv ioctl sniffer

* unused import

* Update __init__.py

* that work

* that fix it
2024-03-23 00:29:30 -07:00
chenyu
18e0cef14d cheap less lines in ptx (#3890)
enought to merge lars
2024-03-23 01:12:31 -04:00
George Hotz
f0c4e06ffd fix cuda sync (#3888) 2024-03-22 19:02:30 -07:00
chenyu
2d3ce53348 touchup test_dtype.test_gradient_dtype (#3887)
add back bad merge from #3613 and add float.double and float.bfloat16 to test
2024-03-22 20:56:45 -04:00
David Hou
fc11808a79 initialize Tensor grad same type as self (#3613)
* initialize Tensor grad same type as self

* also test different default float

* check dtype + try/finally

* don't test_gradient_dtype if f16 is not supported

* fix bad merge

---------

Co-authored-by: chenyu <chenyu@fastmail.com>
2024-03-22 20:33:18 -04:00
Francis Lam
8db7a6bbcc debug: add optional detailed BEAM_LOG logging (#3883)
* debug: add optional detailed BEAM_LOG logging

show uop count, compile and run times for each candidate in search

also add --timing to verify_kernel.py to make it easier to explore
hand-crafted applied opts

* fix linter
2024-03-22 19:23:31 -04:00
chenyu
f7f67e0cc5 simple fix llama shard with quantize (#3882)
copy scale on all device for now. naive sharding does not work because scale needs expand to really save memory.

70B does not work due to HSA_STATUS_ERROR_OUT_OF_RESOURCES.

`python3 examples/llama.py --gen 2 --size 13B --shard 6 --prompt "Hello." --count 10 --temperature 0 --timing --quantize`

13B on 6 gpus uses 47 GB v.s. 34 GB quantized
2024-03-22 18:15:37 -04:00
chenyu
ee502c8055 fixup to_movement_ops and add back to CI (#3881) 2024-03-22 18:14:49 -04:00
nimlgen
16e31f7f0d init multidevice cuda graph (#3858)
* init multidevice cuda graph

* cuda just works!

* clean

* linter happier

* liners happy

* update transfer inputs

* do not change free

* useless check for cuda

---------

Co-authored-by: George Hotz <72895+geohot@users.noreply.github.com>
2024-03-22 13:49:48 -07:00
George Hotz
0c197b9cf3 hotfix: hip bfloat formatting 2024-03-22 11:52:05 -07:00
George Hotz
54dc48aa47 fix assign (#3878)
* fix assign

* remove terrible optimizer hack

* oops, not realized assigns
2024-03-22 11:48:48 -07:00