Commit Graph

4897 Commits

Author SHA1 Message Date
hikettei
ad1ca7da64 [Feature] Added BinaryOps.AND/BinaryOps.OR (#5223)
* [Feature] Added BinaryOps.AND/BinaryOps.OR

* Add: __rand__, __ror__
2024-06-29 17:20:25 -07:00
chenyu
50b05dd3f4 tqdm minor cleanup (#5229)
combined some if branches
2024-06-29 18:58:24 -04:00
chenyu
b2ea610df8 fix tqdm unit_scale and support hours in time (#5227)
* fix tqdm unit_scale and support hours in time

previously it only supports MM:SS.
more chars to unitscales, strip trailing "." and " " in formatting, and more tests

* simpler
2024-06-29 14:48:51 -04:00
qazal
f374fb77af assert bool dtype for valid [run_process_replay] (#5214)
* valid is always bool

* prevent NumNode to begin with

* part 2

* test: disable pattern matchers, asserts should pass

* test: store without cast

* test: if (0)

* cleanup time

* only pattern match bool literal

* better for upstream debug
2024-06-29 21:20:32 +03:00
qazal
3f4eeb8b54 late UOps.IF generation [run_process_replay] [no_assert] (#5027)
* find all places

* test gates

* test

* gate based on depths

* add ctx

* that cache was so wrong

* delete useless things

* dont double write if

* self.if_cond

* move UOps.IF to gated store

* test_padto_where_multioutput

* test_padto_group

* minor cleanup

* hmm this actually works?

* need a good barrier

* merge 2

* delete ctx

* p1

* maybe p2

* p3

* minor fixup

* fixup 2

* smart thing from the Lowerer branch

* refactoring

* refactoring 2

* maybe before graph_rewrite

* slightly more acceptable Linearizer diff

* more correct

* [run_process_replay] [no_assert]
2024-06-29 12:22:14 -04:00
chenyu
42d1f92fc1 simpler tqdm (#5221)
can do more, but many cases are not tested
2024-06-29 07:41:46 -04:00
nimlgen
dd7eef7d71 libc defs to autogen (#5217)
* libc defs to autogen

* amd import libc

* linter

* better a bit

* remove comment, check this

* not hardcoded path
2024-06-29 14:37:33 +03:00
nimlgen
6b08cb5e38 ptx runs on nv in benchmarks (#5224) 2024-06-29 11:06:44 +03:00
nimlgen
b4c49ae3fa remove cudacpu in favour of mockgpu (#5225)
* remove cudacpu in favour of mockgpu

* remove unused import

* not used as well
2024-06-29 11:05:16 +03:00
nimlgen
ee02dcb98e nv supports PTX=1 (#5222)
* nv supports PTX=1

* not needed

* split nv compiler into nvrtc autogen

* remove to_c_array

* test

* Revert "test"

This reverts commit f0b56f308b.
2024-06-29 10:46:29 +03:00
wozeparrot
7bcb74ab23 feat: tag 0.9.1 (#5220) v0.9.1 2024-06-28 20:16:14 -07:00
George Hotz
7f46bfa587 hotfix: docs touchup 2024-06-28 14:36:20 -07:00
nimlgen
c941a58581 amd refactor queue creation (#5216)
* amd refactor queue creation

* fixes

* use data64_le

* fix linter
2024-06-28 23:24:49 +03:00
chenyu
7ba4938510 simplify View.permute arg check [run_process_replay] (#5218)
it checks if `axis` is a valid permutation, which is the same as `sorted(axis) == list(range(len(self.shape)))`
2024-06-28 16:18:46 -04:00
George Hotz
80ac21200b hotfix: linearizer test fixup 2024-06-28 10:52:25 -07:00
George Hotz
c9714dfcf4 rename graph to children [run_process_replay] (#5215) 2024-06-28 09:53:52 -07:00
kormann
6c456b6d66 remove uopgraph dedup + slight speedup (#5199)
* rm dedup

* rm dedup

* tests

* reduce diff

* oups

* reduce diff

* rm UOp.tuple
2024-06-28 09:26:32 -07:00
nimlgen
9b08a9397c amd inline bf16 funcs (#5212) 2024-06-28 18:45:00 +03:00
chenyu
7090eac8cb validate sdxl output and put it in benchmark (#5211)
* validate sdxl output and put it in benchmark

* don't print fetch progress_bar in CI
2024-06-28 11:40:52 -04:00
chenyu
63fa4e2a0e fix seed = 0 in sdxl (#5209)
removed a few unneeded realize and contiguous too
2024-06-28 08:48:59 -04:00
Tobias Fischer
4688f97d48 Add SDXL Inference to Examples (#5206)
* added sdxl inference code

* fixed trailing whitespace

* use original impl code, removed uneeded numpy calls
2024-06-28 07:42:28 -04:00
qazal
3e56c8422c remu err handling (#5208)
* add error handling

* use pre release

* minor

* works
2024-06-28 13:15:18 +03:00
nimlgen
7f7fa26e03 allow hugepage failure in memadvise (#5207) 2024-06-28 11:41:10 +03:00
chenyu
73395b998b better error msg for TinyJit inside TinyJit (#5202)
it's possible to support TinyJit inside TinyJit, but there are edge cases like two TinyJit functions shared another TinyJit function. so just give a more precise error for now
2024-06-27 18:09:19 -04:00
nimlgen
ac748cccdb nv apply relocs (#5165)
* nv do reloc

* a bit cleaner
2024-06-27 23:54:16 +03:00
Roelof van Dijk
540ebdf47c missing init files (#5196) 2024-06-27 15:30:02 -04:00
chenyu
d8dc43ad06 remove JIT_BATCH_SIZE=4 from gpt2 NV benchmark (#5198)
this no longer helps
2024-06-27 15:20:34 -04:00
George Hotz
345bcc2099 move graph_dedup out of class [run_process_replay] (#5197) 2024-06-27 12:04:00 -07:00
George Hotz
d094a6828f single pass rewrite (#5159)
* single pass rewrite

* claude cleanups

* claude cleanups

* skip those tests

* restrict that to ints

* comment

* asserts i don't expect to fail do fail

* simplest...rewrite...ever

* simplest...rewrite...ever

* add that rule back

* tests pass?

* only collapse reduce loops

* second SHL/SHR arg must be 4 bytes

* fix verify

* no SHL/SHR in ptx

* put that back

* skip them in PTX...bad tests
2024-06-27 11:36:05 -07:00
Roelof van Dijk
1ff9bbaa61 ruff: close file handle (#5180)
* close file handle

* some more open file handles

* must stay open

* remove this close, stays open
2024-06-27 11:29:47 -07:00
chenyu
83da8b3558 use NV instead of CUDA in benchmark (#5192)
also reenabled mixtral on green
2024-06-27 13:52:58 -04:00
chenyu
0c6c7c5f7b CACHELEVEL=0 -> IGNORE_BEAM_CACHE=1 in benchmark (#5191)
ignoring beam cache but using compile cache should be fine, saved some benchmark time.

also updated `beam_search` to check flag value before accessing diskcache
2024-06-27 13:15:18 -04:00
chenyu
c12de4f47d benchmark use JITBEAM for llama and gpt2 (#5189) 2024-06-27 12:56:02 -04:00
chenyu
ad91962dcf CACHECOLLECTING -> CAPTURING and don't capture clear_l2 (#5190)
fixed first time BEAM slowness
2024-06-27 12:32:28 -04:00
Roelof van Dijk
01e8838b65 ruff: suppressible-exception (#5182)
* fix: use contextlib to suppress errors

* enable rule SIM105

---------

Co-authored-by: George Hotz <72895+geohot@users.noreply.github.com>
2024-06-27 08:23:44 -07:00
Roelof van Dijk
9704c7d4d4 ruff rule if-exp-instead-of-or-operator (FURB110) (#5178)
Co-authored-by: chenyu <chenyu@fastmail.com>
2024-06-27 08:22:19 -07:00
chenyu
5b8fda3c65 fix: JIT=0 means no JIT (#5188) 2024-06-27 10:31:37 -04:00
qazal
3af17849bf safely parse quoted titles [run_process_replay] (#5183) 2024-06-27 16:39:48 +03:00
Roelof van Dijk
975b811ad9 names shadowing builtins (#5179)
Co-authored-by: chenyu <chenyu@fastmail.com>
2024-06-27 08:15:01 -04:00
Roelof van Dijk
26e254c42b ruff: else-raise and else-return (#5175)
* ruff: enable else-raise and else-return

* ruff: add error names

* fix order

---------

Co-authored-by: chenyu <chenyu@fastmail.com>
2024-06-27 07:54:59 -04:00
Roelof van Dijk
f88f71d73a ruff: unnecessary-comprehension (#5174)
* enable ruff C416 unnecessary-comprehension

* already a list
2024-06-27 07:45:29 -04:00
reddyn12
f1c7944c44 Fix batchnorm shapes for resnet.load_pretrained (#5167)
* Fix batchnorm shapes

* make it general reshape
2024-06-26 18:44:10 -04:00
George Hotz
396ce6cfc9 clean up graph dedup function [run_process_replay] (#5169) 2024-06-26 15:07:34 -07:00
kormann
3a04e518ec print_tree UPat +fix (#5132)
* fix and extend print_tree

* typing

* typing

* fix upat

* fix none

* ws

* rm prefix

* mv luop dag

* typo

* test print_tree
2024-06-26 15:02:19 -07:00
chenyu
0ba093dea0 hotfix: only validate stable diffusion when using threefry (#5166) 2024-06-26 16:50:38 -04:00
chenyu
e4a5870b36 validate stable_diffusion output (#5163)
changed default steps, forgot to update validation
2024-06-26 16:42:21 -04:00
nimlgen
21b225ac45 llama3 download works (#5160) 2024-06-26 22:45:13 +03:00
wozeparrot
c91b3c4079 shard llama3 on 0 sometimes (#5157) 2024-06-26 11:50:57 -07:00
Roelof van Dijk
294bd1a9ff refactor: name check [run_process_replay] (#5158) 2024-06-26 11:39:41 -07:00
Roelof van Dijk
2c80583e14 perf: cache const UOp creation [run_process_replay] (#5156) 2024-06-26 11:13:14 -07:00