Commit Graph

10490 Commits

Author SHA1 Message Date
wozeparrot
8845a5dbfd feat: begin immediate (#5539) 2024-07-17 16:11:21 -07:00
George Hotz
a6e70f8a71 clean up expand function [run_process_replay] (#5538)
* clean up expand function [run_process_replay]

* lil cleaner

* add a type
2024-07-17 15:02:00 -07:00
qazal
61ee02e93d start multireduce lowerer work (var/std) (#5537)
* multireduce no-opts works

* passed test_var_multireduce

* cleanup

* double reduce

* extra check for range_group

* more checking for range_groups

* cleaning up debug prints

* cleanup diff

* linters

* revert kernel changes

* these are uops toposort

---------

Co-authored-by: timmy <timmy0x@proton.me>
2024-07-17 23:43:46 +03:00
qazal
67ea4af01f depth first recurse_reduceops (#5536)
* early recurse

p2

* yea cache shouldnt be there
2024-07-17 23:27:53 +03:00
Francis Lam
c4eb30a04c test/test_linearizer_failures: add a new beautiful_mnist one (#5531)
* test/test_linearizer_failures: add a new beautiful_mnist one

this one is from a DEPTH=2 fuzz_linearizer search

* add GPU to test_failure_40

---------

Co-authored-by: chenyu <chenyu@fastmail.com>
2024-07-17 16:27:04 -04:00
qazal
0259d76183 use Context only in replaying Kernel [run_process_replay] (#5535) 2024-07-18 03:46:14 +08:00
George Hotz
1a68854766 PatternMatcher add (#5532)
* PatternMatcher add [run_process_replay]

* f4 dynamic

* test_failure_36 is fixed

* fix PTX
2024-07-17 12:44:42 -07:00
qazal
d3c137d478 utility for computing reduceop output_shape (#5534)
* refactor to reduce_st

* update lazy
2024-07-17 22:40:07 +03:00
qazal
0a7872a62f use exec_alu in uops flop counting (#5511)
* use exec_alu for uops flop counting

* deal with sint
2024-07-17 22:39:27 +03:00
qazal
a7706e05f9 option to [skip_process_replay] (#5533) 2024-07-17 22:30:46 +03:00
chenyu
4193095f67 fix handcode_opt.py with DEBUG=2 (#5530)
only one ast per kernel now
2024-07-17 14:50:47 -04:00
chenyu
466555cd17 touchup Tensor.interpolate (#5525)
* touchup Tensor.interpolate and Tensor.lerp

rewrite lerp to save one sub and thus flops.
use Tensor.lerp for interpolate and some minor cleanups

* revert lerp change
2024-07-17 13:35:57 -04:00
George Hotz
1242b302fa expand UOps with rewrite rules (#5501)
* expand UOps with rewrite rules [run_process_replay]

* progress

* much closer

* close, way less bugs

* bunch of expander tests

* fix contract

* ops tests pass

* fix barrier

* mostly passing

* bitcast in expanded ops

* support more expand merges

* all tests pass maybe

* fix empty EXPAND

* fix LIN fuzzing

* add ALL_SAME assert

* all same

* all same work

* raise CompileError

* pass fuzz linearizer

* revert whitespace

* fix nv tensor core test

* fix mypy

* bug fix

* fuzzer passes

* put tests back

* expand arg to idx
2024-07-17 10:17:50 -07:00
George Hotz
158221b36b expand tests from uop_expander [run_process_replay] (#5524)
* expand tests from uop_expander

* more changes from the branch
2024-07-17 09:22:36 -07:00
George Hotz
42c25cc961 fix fixup_ast (#5523)
* fix fixup_ast

* these lin failures are fixed
2024-07-17 08:52:21 -07:00
qazal
fbe0233be3 infra for multi reduce asts (#5522)
* add reduce_info

* _recurse_reduceops base

* derive output shape

* refactor

* delete reduce_for_op

* save lines

* more line saving
2024-07-17 17:23:46 +03:00
nimlgen
dcd462860f elf loader (#5508)
* elf loader

* cleanup

* cleaner

* cleaner

* fixes

* revert this

* fix div 0

* fix nv

* amd fix

* fix mockgpu

* amd better?

* restore relocs for <12.4

* linter

* this is fixed now

* revert this

* process cdefines as function

* cleaner

* align

* save lines

* revert this change
2024-07-17 17:09:34 +03:00
nimlgen
661da32aff nv do not map regions twice (#5521) 2024-07-17 11:20:02 +03:00
Francis Lam
2d53abb04a test/external/fuzz_linearizer: fix for new AST changes (#5519)
* test/external/fuzz_linearizer: fix for new AST changes

also add beautiful_mnist failures

* add CLANG and LLVM to test_failure_35 failed_platforms

* fix test_linearizer_failure names
2024-07-17 00:08:07 -04:00
Tobias Fischer
85d4ca7caa FID Inception Model (#5516)
* added model impl

* minor cleanups

* extracted weights loading into from_pretrained

* reorganized model for better weight loading

* removed lru cache for state dict loading
2024-07-16 23:12:03 -04:00
chenyu
4ad83d032e remove Kernel.lazyops [run_process_replay] (#5517)
always use Kernel.ast.lazyops
2024-07-16 19:47:42 -04:00
wozeparrot
1c1d6d3a4a feat: show caller when tracemeta >= 2 (#5514) 2024-07-16 15:06:02 -07:00
chenyu
5aad043522 cleanup fixup_ast local shape long line [run_process_replay] (#5513) 2024-07-16 17:29:38 -04:00
chenyu
6e405b0a2b add 0d tensor to trunc/floor/ceil/round tests (#5512)
existing trunc test passes backward but its backward is incorrect in general. added tests that would fail
2024-07-16 16:48:25 -04:00
chenyu
0afcbfae84 docs: add Tensor.interpolate to doc page (#5510) 2024-07-16 14:17:19 -04:00
Tobias Fischer
87a2ef2bc2 Add Interpolate Function (#5482)
* add interpolate function

* fixed linter issue

* reduced sizes in test

---------

Co-authored-by: wozeparrot <wozeparrot@gmail.com>
2024-07-16 09:44:01 -07:00
gswangg
203161c75d refactor VECTORIZE/GEP rules (#5507) 2024-07-16 09:41:23 -07:00
qazal
173064c69c (re)start multireduce in codegen/* (#5391)
* test_var_multireduce

* run verify_lazyop

* test_var_multireduce

* assert lazyop

* add test_indexing_multireduce

* arange fuses (crude)

* note: extra reshape

* start readble

* test_arange_simple

* test_arange_expanded

* test_indexing_multireduce

* cleanups

* skip ptx

* skip nv and amd ci

* skip arange expanded too

* GPU=1 is slow too in CI
2024-07-16 14:20:48 +03:00
chenyu
07ff4b7d24 test_failure_33 ast that has UOps.UNMUL after linearize (#5504)
* test_failure_33 ast that has UOps.UNMUL after linearize

* smaller
2024-07-15 22:54:23 -04:00
chenyu
1ccd987e6a simpler tc permaxis in fixup_ast.fix_st [run_process_replay] (#5502) 2024-07-15 21:35:32 -04:00
George Hotz
9d4c3c553c prepare expand to support multiexpand [run_process_replay] (#5503) 2024-07-15 18:21:24 -07:00
chenyu
fd43d33b7d shave some lines from transcend math [run_process_replay] (#5500)
* shave some lines from transcend math [run_process_replay]

* put input_dtype back
2024-07-15 21:02:24 -04:00
chenyu
63990705b5 test kernel opts case for 4 local and 4 groups (#5499)
make sure local grouped dim is correct
2024-07-15 20:09:38 -04:00
Alessandro Benetti
13e200b437 add strict mkdocs check (#5497) 2024-07-15 14:21:37 -07:00
nimlgen
8dfd11c1d8 docs: hcq add types (#5495)
* docs: hcq add types

* linter
2024-07-15 22:14:48 +03:00
George Hotz
aab1e8c6dc uniform init to match torch (#5494) 2024-07-15 12:07:44 -07:00
George Hotz
338b7590b9 hotfix: docs for BatchNorm 2024-07-15 12:04:17 -07:00
nimlgen
c9ec7ce070 start hcq docs (#5411)
* start hcq docs

* more hcq docs

* docs

* docs

* linter

* correct args

* linter

* ts returns int
2024-07-15 21:31:11 +03:00
Edward Wang
9a7d5a148e move colorize_float to helpers.py (#5490)
* add colorize_float to helpers.py

* update references
2024-07-15 11:29:03 -07:00
P4ssenger
a347d91e0e remove outdated thread local aliases (#5493) 2024-07-15 11:28:11 -07:00
qazal
ac08f0eb00 reshape rawbufs in test_linearizer (#5492)
* reshape rawbufs in test_linearizer

* fix helper_linearizer_ast
2024-07-15 19:14:38 +03:00
qazal
ae4cb7994e run process replay with DEBUG=0 (#5491)
* process replay with DEBUG=0

* graceful shutdown

* use and
2024-07-15 16:30:57 +03:00
Tobias Fischer
e219103677 Add Pad to Pooling (#5488) 2024-07-14 21:50:20 -07:00
chenyu
eef43c9f49 include dims in kernel/nv invalid err msg (#5487) 2024-07-14 22:51:30 -04:00
chenyu
c80801c266 len(full_shape)-ki.upcasted -> first_upcasted (#5485)
[run_process_replay]
2024-07-14 20:21:18 -04:00
Tobias Fischer
5849130cbb gather negative dim fix (#5486) 2024-07-14 20:20:53 -04:00
qazal
3c378efcb6 process replay docs improvements (#5481)
* minor cleanups

* docs and logs

* shorter

* comma

* s/print/logging.info [run_process_replay]

* use logging.warn

* process name is noise

* revert lowerer change [run_process_replay]
2024-07-15 00:09:28 +03:00
chenyu
613a1dbeed render lidx starting with 0 (#5478)
* render lidx starting with 0

changed from
```
  int gidx0 = gid.x; /* 4096 */
  int lidx4 = lid.x; /* 8 */
  int gidx1 = gid.y; /* 7 */
  int lidx5 = lid.y; /* 8 */
  int gidx2 = gid.z; /* 7 */
  int lidx6 = lid.z; /* 2 */
```
to
```
  int gidx0 = gid.x; /* 4096 */
  int lidx0 = lid.x; /* 8 */
  int gidx1 = gid.y; /* 7 */
  int lidx1 = lid.y; /* 8 */
  int gidx2 = gid.z; /* 7 */
  int lidx2 = lid.z; /* 2 */
```

the existing one started from pre-limited global dims which skip number if there are more than 3 global dims

* don't need start_dim

---------

Co-authored-by: qazal <77887910+Qazalin@users.noreply.github.com>
2024-07-14 16:34:04 -04:00
qazal
671779f280 limit process replay diff to ~20% of kernels (#5480)
* render lidx starting with 0

changed from
```
  int gidx0 = gid.x; /* 4096 */
  int lidx4 = lid.x; /* 8 */
  int gidx1 = gid.y; /* 7 */
  int lidx5 = lid.y; /* 8 */
  int gidx2 = gid.z; /* 7 */
  int lidx6 = lid.z; /* 2 */
```
to
```
  int gidx0 = gid.x; /* 4096 */
  int lidx0 = lid.x; /* 8 */
  int gidx1 = gid.y; /* 7 */
  int lidx1 = lid.y; /* 8 */
  int gidx2 = gid.z; /* 7 */
  int lidx2 = lid.z; /* 2 */
```

the existing one started from pre-limited global dims which skip number if there are more than 3 global dims

* don't need start_dim

* add changed

* env var

* more early exit

* simpler?

* Revert "Merge branch 'lidx0' into process_replay_limit"

This reverts commit cbadcfa5e9, reversing
changes made to fc9bf37ee7.

* minor cleanup

---------

Co-authored-by: chenyu <chenyu@fastmail.com>
2024-07-14 23:10:08 +03:00
chenyu
f8a47608cc test dtype.min and dtype.max (#5479)
compared with np.iinfo for integer dtype
2024-07-14 15:31:37 -04:00