Commit Graph

4618 Commits

Author SHA1 Message Date
chenyu
4ecd5789ab #include <tgmath.h> in ops_clang (#3927)
* different clang sqrt/log2/exp2/sin function based on dtype

fixed softmax_argmax issue in #3552 for clang.

* tgmath.h

* revert those
2024-03-25 17:48:57 -04:00
Arseny Kapoulkine
514c43201d Fix issues with pointer provenance in load/store through ALU (#3916)
* Track pointer provenance in load/store through ALU

Previously load/store could be incorrectly rendered into
ld.global/st.global when the input was an ALU op that performed an
address computation with DEFINE_LOCAL on one of the arguments.

* Simplify the load provenance workaround

The issue is that we can render the same code twice, and on the second
run the opstream is already modified so that vin[0] isn't a DEFINE_*,
which overwrites initially correct .shared wth .global.

* Add a couple tests for basic local use

* Skip local tests on LLVM since it doesn't implement DEFINE_LOCAL
2024-03-25 14:41:05 -07:00
chenyu
83f39a8ceb env var to change default float (#3902)
* env var to change default float to fp16 or bf16

looking for standard names for these. we have FLOAT16 that does something to IMAGE and HALF to convert weights.

working on default bf16 too.
```
RuntimeError: compile failed: <null>(6): error: identifier "__bf16" is undefined
    __bf16 cast0 = (nv_bfloat16)(val0);
```

remove that in cifar

* DEFAULT_FLOAT

* default of default

* unit test

* don't check default

* tests work on linux
2024-03-24 20:33:57 -04:00
George Hotz
03899a74bb increase atol on reset train 2024-03-24 15:17:31 -07:00
qazal
d8fafca13a assign regression (#3907)
* infra

* track mutations

* assign levels

* add seen back

* add test

* infra 2.0

* add assign targets

* dont need levels

* delete

* Update test_assign.py

---------

Co-authored-by: George Hotz <72895+geohot@users.noreply.github.com>
2024-03-24 15:12:31 -07:00
Patrick Tsai
e27129a798 Fix linearizer failure 26 test (#3906)
* Adjust adds between WHERE and PHI

* Not much better

* undo recursive change

* hm

* iterate over where, not factored op

* oo

* consts only for loop

* UNdo var name change

* update

---------

Co-authored-by: Patrick Tsai <patosai@users.noreply.github.com>
2024-03-24 16:34:13 -04:00
wozeparrot
9a9cac58f9 add lars to nn (#3750)
* feat: add lars

* feat: don't remove this comment

* clean: smaller diff

* clean: shorter line

* feat: remove mlperf lars, switch resnet

* fix: fully remove mlperf lars

* clean: comment

* feat: contiguous

* feat: no weight decay on skip params

* feat: optimizergroup

* feat: classic momentum

* fix: pylint

* clean: move comment

* fix: correct algo

* feat: lrschedulergroup

* feat: skip list tests

* feat: :| forgot that params are a thing

* feat: remove skip_list params from main params

* feat: set moment

---------

Co-authored-by: chenyu <chenyu@fastmail.com>
2024-03-24 11:43:12 -04:00
chenyu
2c69888654 include negative float in test_dtype (#3884)
* include negative float in test_dtype

* that is ub

* too annoying

* pack can overflow
2024-03-24 02:39:15 -04:00
Francis Lam
0145366323 wmma: fix the AMD TC threads to split the first 16 threads (#3904)
previously it was incorrectly aliasing 16 into the size 8 upcast
on the store alias.  now it splits it properly into 8 and the
remaining 2 into the correct local stride
2024-03-23 21:17:42 -04:00
chenyu
a2b2597fc2 replace dtype.name str with render_dtype (#3903)
fixed some bf16 cast issue since it does not have `.name`.
also more robust if there are lang specific type override
2024-03-23 19:25:48 -04:00
Alejandro F Queiruga
556dcfb8f2 Fix the result permutation in einsum (#3895)
* Fix permutation of result indices in einsum.

* Delete stray line used for breaking tests

* Fix linter error by renaming twice-used variable

---------

Co-authored-by: chenyu <chenyu@fastmail.com>
2024-03-23 15:48:19 -04:00
chenyu
2d3ce53348 touchup test_dtype.test_gradient_dtype (#3887)
add back bad merge from #3613 and add float.double and float.bfloat16 to test
2024-03-22 20:56:45 -04:00
David Hou
fc11808a79 initialize Tensor grad same type as self (#3613)
* initialize Tensor grad same type as self

* also test different default float

* check dtype + try/finally

* don't test_gradient_dtype if f16 is not supported

* fix bad merge

---------

Co-authored-by: chenyu <chenyu@fastmail.com>
2024-03-22 20:33:18 -04:00
Francis Lam
8db7a6bbcc debug: add optional detailed BEAM_LOG logging (#3883)
* debug: add optional detailed BEAM_LOG logging

show uop count, compile and run times for each candidate in search

also add --timing to verify_kernel.py to make it easier to explore
hand-crafted applied opts

* fix linter
2024-03-22 19:23:31 -04:00
George Hotz
54dc48aa47 fix assign (#3878)
* fix assign

* remove terrible optimizer hack

* oops, not realized assigns
2024-03-22 11:48:48 -07:00
Francis Lam
5587594a00 fuzz_linearizer: add --ast and --file params to read kernels (#3877)
also fix up ast_str_to_str to support the new tuple of LazyOps
2024-03-22 14:27:40 -04:00
chenyu
c5467e5bd6 diverse test value in test_dtype DATA based on dtype (#3864)
* diverse test value in test_dtype DATA based on dtype

* eh fix typo

* that too?

* PTX does not support i8 and s8

* skip that

* unused line

* pus the hack back

* remove that
2024-03-22 14:22:06 -04:00
George Hotz
86ee36e697 preschedule all (#3875) 2024-03-22 11:20:06 -07:00
Szymon Ożóg
d8c3f1894a Use UOpGraph in test (#3876) 2024-03-22 14:12:38 -04:00
qazal
fe6ceff15f proposal: multioutput JIT spec (#3856)
* corealize JIT

* requirements
2024-03-21 21:28:30 -07:00
uuuvn
6729f20aab Ring allreduce try 2 (#3852)
* Ring allreduce v3

* Configurable size, number of gpus and jit in benchmark

* ScheduleBarrier v0

* GB/s that make sense

* ScheduleBarrier v0.1

* Fallback on 2 GPUs

* ScheduleBarrier v0.2

* ScheduleBarrier v0.3

* ScheduleBarrier v0.3.1

* ScheduleBarrier v0.3.2

* Replace ScheduleBarrier with automatic optimization

* unused import

* fix comment

* typing

* better fallback

* python 3.8

* RING=2 and use ContextVar

* DEBUG >= 2 and change name

* linter

* type

---------

Co-authored-by: George Hotz <72895+geohot@users.noreply.github.com>
Co-authored-by: chenyu <chenyu@fastmail.com>
Co-authored-by: nimlgen <138685161+nimlgen@users.noreply.github.com>
2024-03-21 19:17:51 -04:00
Francis Lam
3c0478bfab fuzz_linearizer: add additional DEBUG info for comparison errors (#3866) 2024-03-21 18:58:10 -04:00
chenyu
e50b7abe4f diversed buf inputs based on dtype in fuzz_linearizer (#3863) 2024-03-21 16:23:11 -04:00
chenyu
30fa03243e reuse fuzz_linearizer.compare_linearizer in test_linearizer_failures (#3861) 2024-03-21 14:12:27 -04:00
chenyu
33dd99acf4 remove helper_add_store from test_linearizer_failures (#3860) 2024-03-21 12:53:31 -04:00
chenyu
6bf0b82267 alloc new output in fuzz_linearizer between baseline and real one (#3859)
if the kernel is an assign `a += 1`, the rawbufs[0] is updated twice and gives false compare_error
2024-03-21 11:36:05 -04:00
nimlgen
85691c8e20 fix hsa sync issue (#3847)
* fix hsa sync issue

* linter
2024-03-21 04:00:30 +03:00
chenyu
f271cd682b user _resolve_dim in argmax (#3846)
also added comment of the behavior if there are multple, and more tests
2024-03-20 20:17:30 -04:00
Francis Lam
6d5dec2fef log optimized kernels and a script to compare with non-optimized ones (#3829)
* search: add BEAM_VERIFY option to validate search results

refactor fuzz_linearizer comparison to allow it to be used in for
BEAM_VERIFY in device.py

* search: fix to verify the beam_search result and not the fastest

* search: fix typing and clean up

* device: remove imports from test and add LOGKERN options

LOGKERN output can be used with test/external/verify_kernel.py
to validate correctness

* fix example in verify_kernel.py

* cleanup fixes

* fix to use f-strings
2024-03-20 19:22:08 -04:00
chenyu
519336cfea factor out partial in SumNode div int (#3841)
* factor out partial in SumNode div int

* div not rem

* space
2024-03-20 16:34:33 -04:00
George Hotz
8cb5215885 Revert "Ring allreduce in multitensor (#3000)" (#3840)
This reverts commit c5bf9e4c96.
2024-03-20 11:41:49 -07:00
uuuvn
c5bf9e4c96 Ring allreduce in multitensor (#3000)
* Ring allreduce v3

* Configurable size, number of gpus and jit in benchmark

* ScheduleBarrier v0

* GB/s that make sense

* ScheduleBarrier v0.1

* Fallback on 2 GPUs

* ScheduleBarrier v0.2

* ScheduleBarrier v0.3

* ScheduleBarrier v0.3.1

* ScheduleBarrier v0.3.2

* Replace ScheduleBarrier with automatic optimization

* unused import

* fix comment

* typing

* better fallback

* python 3.8

---------

Co-authored-by: George Hotz <72895+geohot@users.noreply.github.com>
Co-authored-by: chenyu <chenyu@fastmail.com>
Co-authored-by: nimlgen <138685161+nimlgen@users.noreply.github.com>
2024-03-20 11:20:01 -07:00
chenyu
455f7bea9b test example from half resnet that idx has number outside of int32 (#3838)
* test example from half resnet that idx has number outside of int32

* ruff
2024-03-20 13:44:20 -04:00
chenyu
d17900bc45 use int32 instead of default_int in simplify_phi_loops (#3828)
* use int32 instead of default_int in simplify_phi_loops

indices are in int32 now and is separated from buffer dtype. fix #3823

* return early if not supported

* it's not that

* why is it failing for RHIP
2024-03-19 17:49:58 -04:00
chenyu
99cbc24390 use dtypes.int32 as return dtype for functions that return indices (#3827)
behavior matches jax. It's fine to have a tensor greater than max int8 size even if we set default int to int8
2024-03-19 17:06:57 -04:00
chenyu
fa1921ec7d move test_dtype tests to test dtype and output value (#3826) 2024-03-19 16:31:27 -04:00
Francis Lam
131bbb6563 test_linearizer_failure: add failure 27 from a gpt2 kernel (#3825)
* test_linearizer_failure: add failure 27 from a gpt2 kernel

found during a full fuzz test of applied_opts combos to a
depth of 4 on the gpt2 kernels w/o GROUPTOP.

added additional examples to failure 26 that don't have GROUPTOP

* add other platform failure
2024-03-19 16:29:50 -04:00
Francis Lam
9851e2c3b9 test_linearizer_failure: add failure 26 from a gpt2 kernel (#3821)
found during a full fuzz test of all applied_opts combos to a
depth of 3 on the gpt2 kernels
2024-03-19 13:19:54 -04:00
Patrick Tsai
b436c9792f Fix factoring bug (O(n) arange related) (#3817)
* Factoring bug

* Another one in case

* It works now so change tests back

* large arange cumsum optimization

* More cleanup

* symbolic no factor div test

* name change

* Rename test

---------

Co-authored-by: Patrick Tsai <patosai@users.noreply.github.com>
2024-03-19 11:49:42 -04:00
chenyu
a6ed2ae3c6 use old cumsum optimization for arange (#3813)
revert to old cumsum opt while phi simplification is disabled.

added a flops complexity test for this
2024-03-18 20:01:03 -04:00
chenyu
ac866eaf5a disable simplify_phi_loops (#3812)
* disble simplify_phi_loops

this breaks BEAM search GPT2.

* skip that
2024-03-18 19:25:26 -04:00
George Hotz
4c4d3cb3e3 restrict assignment to base (#3809)
* restrict assignment to base

* add some restrictions there

* more restrictions
2024-03-18 15:33:06 -07:00
chenyu
20681d5c4a remove old dist multigpu (#3811) 2024-03-18 18:31:05 -04:00
chenyu
5dd048a378 remove HIP in core tinygrad (#3810)
* remove HIP in core tinygrad

ci test uses device RHIP and HSA compiler (LinearizerOpt), so fine to remove HIP from tc.
Also updated README and EMULATE tc test flag

* EMULATE_CUDA
2024-03-18 18:19:27 -04:00
Francis Lam
a7afd2f6bf test_linearizer_failures: add failing kernel from GPT2 CUDA (#3808)
* test_linearizer_failures: add failing kernel from GPT2 CUDA

* test_linearizer_failure: remove "HIP" from failed_platforms
2024-03-18 17:16:40 -04:00
George Hotz
d8296d4a3f simple assign tests (#3807) 2024-03-18 13:57:01 -07:00
wozeparrot
a0ab755317 threefry again (#3785)
* feat: initial xor

* feat: initial threefly

* feat: remove custom random

* fix: really need to install precommit

* feat: lmao forgot that this is rotate not a shift

* clean: put that there

* feat: numpy xor

* feat: quick test for xor

* feat: llvm xor

* feat: slightly working xor in torch

* feat: rand works in jit

* clean: save a line

* feat: match jax

* feat: maybe test against jax

* feat: requires_grad

* fix: fix test_symbolic_ops

* feat: lower alpha

* feat: just pad

* fix: maybe fix training tests?

* fix: fix some llvm stuff

* feat: cursed realize on the way out

* feat: testing jax

* fix: why is the jax install process not simple

* fix: maybe passing test

* fix: symbolic workarounds

* clean: still need that precommit

* fix: aaaa

* fix: more test fixes

* fix: quick fix for wgsl

* feat: need to set requires_grad on the final tensor

* feat: one more tensor

* feat: don't take forever

* feat: seeing y ci is brok

* feat: can't allocate 64GiB lmao

* fix: fix this

* feat: hope this doesn't break smth before i go to bed

* feat: don't destroy ram

* feat: int

* feat: remove jax

* feat: properish workaround?

* feat: skip slow webgpu tests

* feat: no longer fails

* feat: use dtypes

* feat: real number

* fix: torch

* fix: don't test against reference for torch

* feat: to device

* feat: fix advanced indexing

* feat: correct casting

* feat: even rng_counter

* feat: match master

* feat: this was actually bad

* fix: maybe?

* feat: store

* feat: remove realizes

* feat: somehow this is important

* feat: somehow this is also important

* feat: save a line

* fix: don't need that anymore

* feat: restore this

* fix: linter

* feat: remove realizes

* fix: realized is in base now

* fix: add back cast

* fix: bump deadline

* fix: bump deadline

* fix: bump deadline

* fix: bump deadline

* fix: bump deadline

* fix: :(

* fix: :(

* fix: not being dumb

* feat: try changing less tests

* feat: shouldn't have to change that

* feat: contiguous bumps it by one

* fix: hmm

* fix: numpy memory moment

* fix: cl_khr_fp16

* fix: torch has different tensor count

* fix: missing contiguous

* hmm: hmm

* fix: some fixes

* fix: typing

* feat: dont do that

* feat: typing fixes

* feat: why is this realize required?

* feat: ngl kinda odd typing

* feat: oh

* feat: remove realizes

* feat: why is this realize required?

* fix: hacky patch for cudacpu

* fix: without this realize pytest crashes?????

* fix: shorter line

* fix: cudacpu fixes

* fix: cudacpu fixes

* feat: real buffer

* feat: don't search when searching lmao

* fix: can't use contiguous things

* fix: no more 100GB arrays

* fix: revert

* fix: skip 7 and 10

* feat: working ish beam

* feat: minimize changes

* feat: seed 0 stable diffusion example changed

* fix: different on ci

* fix: no beam

* feat: make threefry optional

* fix: check value

* fix: unused import

* feat: threefry default

* fix: 5d

* feat: allow non upcast div

* fix: 5d better

* fix: 5d better

* fix: save all dtype

* feat: proper error

* feat: lazyop key

* fix: check float

* feat: try removing this realize now

* feat: disable threefry for uops hip tensor cores

* feat: don't need that

* feat: only check upcast

* fix: disable threefry for some metal tests

* feat: disable for metal tensor uops as well

* feat: disable for most uops

* fix: disable threefry for new uops tests

* feat: multitensor

* fix: typing

* feat: threefry default off

* feat: skip threefry half rand

* feat: restore old

* fix: bad git

* clean: ruff

* feat: bfloat16 fix

* fix: :|

* feat: restore old

---------

Co-authored-by: chenyu <chenyu@fastmail.com>
2024-03-18 16:47:07 -04:00
nimlgen
629757eaa1 hotfix: update inputs of correct transfers in hsagraph (#3800)
* hotfix: update inputs of correct transfers in hsagraph

* test it

* run in ci?
2024-03-18 15:52:27 -04:00
George Hotz
0183a05f0a test assign (#3798)
* Reapply "add failing assign test (#3796)" (#3797)

This reverts commit 1e1beb888c.

* no realized check
2024-03-18 08:58:04 -07:00
George Hotz
1e1beb888c Revert "add failing assign test (#3796)" (#3797)
This reverts commit 2dea12832c.
2024-03-18 08:55:36 -07:00