Commit Graph

4923 Commits

Author SHA1 Message Date
Vyacheslav Pachkov
d3e4e21759 add return type for HCQCompatAllocator _alloc (#5267)
Co-authored-by: nimlgen <138685161+nimlgen@users.noreply.github.com>
2024-07-03 10:25:44 +03:00
chenyu
191463a919 add timing to SDXL (#5273) 2024-07-02 23:29:54 -04:00
chenyu
b2c3a28a5e nn.RMSNorm (#5272)
the norm itself has no significant value to add to Tensor method, but we would want Tensor.normalize
2024-07-02 21:39:01 -04:00
chenyu
9a2a82a77f test stable diffusion unet in ci (#5268)
unet is parameterized now so can test a smaller one is ci
2024-07-02 21:37:52 -04:00
chenyu
ce52b10f6f add a flag DISABLE_LOOP_COLLAPSE (#5270)
workaround if user encountered UNMUL error
2024-07-02 20:01:11 -04:00
George Hotz
e53b164e1a small changes from lowerer (#5266) 2024-07-02 15:03:54 -07:00
nimlgen
7be776f9af add _alloc_signal/_free_signal to hcq (#5264)
* add _alloc_signal/_free_signal api

* oops, revert this

* linter
2024-07-02 23:35:39 +03:00
Tobias Fischer
9a25ee0b9a pixed unet call params (#5262) 2024-07-02 12:40:27 -04:00
qazal
59bc837ad1 refactor gated load rendering [run_process_replay] (#5259)
* refactor gated load rendering [run_process_replay]

* hotfix: extra line

* remove llvm diff
2024-07-02 15:13:10 +03:00
nimlgen
e050603b4b nv close fds after mapping (#5246) 2024-07-02 13:57:46 +03:00
qazal
d3cfb6c2e3 refactor UOps.LOAD barrier [run_process_replay] (#5258) 2024-07-02 13:48:47 +03:00
qazal
a1044e6063 iterate over scoped uops once [run_process_replay] (#5255) 2024-07-02 09:21:09 +03:00
wozeparrot
dfbee4f0f5 feat: add blobfile to testing (#5254) 2024-07-01 19:33:58 -07:00
Tobias Fischer
8c9c1cf62f Pulled CLIP and UNet into Seperate Files (#5253)
* pulled clip and unet into seperate files

* reference cleanup, lru cache fix

* better pool indexing
2024-07-01 22:33:01 -04:00
chenyu
5808c37302 hotfix disable flaky llama3 beam benchmark on green (#5249) 2024-07-01 15:00:47 -04:00
chenyu
b9122ecdaf revert stable diffusion validation with threefry (#5248)
* Revert "use threefry in stable diffusion benchmark (#4988)"

This reverts commit 44dfa37c70.

* sdxl and validation fix

* relax threshold
2024-07-01 14:43:47 -04:00
nimlgen
57e89645cd hcq spec test (#5226)
* start hcq spec test

* more test

* fixes

* run on amd as well

* test amdgpu exec

* fix amd

* amd mockgpu support sdma timestamp
2024-07-01 17:36:37 +03:00
Carson Powers
d7839fdc5f Add x!=0 -> (bool)x pattern [run_process_replay] [no_assert] (#5237)
* x!=0 -> (bool)x pattern

* bool != bool pattern

* redundant upat
2024-06-30 17:48:45 -07:00
George Hotz
14980f79dd hotfix: unbreak llama 2024-06-30 15:27:54 -07:00
George Hotz
146eb3a811 hotfix: add repeat_interleave docs 2024-06-30 15:25:18 -07:00
George Hotz
3df47bc21e OpenELM + repeat_interleave (#5234)
* start writing openelm

* progress...hit bug

* repeat_interleave support

* gqa

* add rotary embedding

* spp

* i think it runs correctly

* broken

* output is good now

* cleanups

* no io_uring on android
2024-06-30 15:18:39 -07:00
nimlgen
7b7b751513 simple hip backend for debugging (#5201)
* hip backend

* fix mypy

* shorter

* fixes

* tiny changes
2024-06-30 23:00:11 +03:00
chenyu
88763eb9ff fix stable_diffusion with fp16 (#5239) 2024-06-30 12:59:31 -04:00
chenyu
649641a2f2 fix tqdm with generator without __len__ (#5238)
it should be treated as total = 0 (just show iteration count).
also removed duplicated ": " in fetch and fixed unit scale with total = 0
2024-06-30 12:20:59 -04:00
chenyu
fd53b6d901 tqdm supports fractional blocks (#5233)
enabled progress bar match in test, it matched perfectly now
2024-06-29 22:30:18 -04:00
chenyu
ae10ae4722 simplify tqdm scale math (#5231)
expand the log of log stuff
2024-06-29 21:17:40 -04:00
hikettei
ad1ca7da64 [Feature] Added BinaryOps.AND/BinaryOps.OR (#5223)
* [Feature] Added BinaryOps.AND/BinaryOps.OR

* Add: __rand__, __ror__
2024-06-29 17:20:25 -07:00
chenyu
50b05dd3f4 tqdm minor cleanup (#5229)
combined some if branches
2024-06-29 18:58:24 -04:00
chenyu
b2ea610df8 fix tqdm unit_scale and support hours in time (#5227)
* fix tqdm unit_scale and support hours in time

previously it only supports MM:SS.
more chars to unitscales, strip trailing "." and " " in formatting, and more tests

* simpler
2024-06-29 14:48:51 -04:00
qazal
f374fb77af assert bool dtype for valid [run_process_replay] (#5214)
* valid is always bool

* prevent NumNode to begin with

* part 2

* test: disable pattern matchers, asserts should pass

* test: store without cast

* test: if (0)

* cleanup time

* only pattern match bool literal

* better for upstream debug
2024-06-29 21:20:32 +03:00
qazal
3f4eeb8b54 late UOps.IF generation [run_process_replay] [no_assert] (#5027)
* find all places

* test gates

* test

* gate based on depths

* add ctx

* that cache was so wrong

* delete useless things

* dont double write if

* self.if_cond

* move UOps.IF to gated store

* test_padto_where_multioutput

* test_padto_group

* minor cleanup

* hmm this actually works?

* need a good barrier

* merge 2

* delete ctx

* p1

* maybe p2

* p3

* minor fixup

* fixup 2

* smart thing from the Lowerer branch

* refactoring

* refactoring 2

* maybe before graph_rewrite

* slightly more acceptable Linearizer diff

* more correct

* [run_process_replay] [no_assert]
2024-06-29 12:22:14 -04:00
chenyu
42d1f92fc1 simpler tqdm (#5221)
can do more, but many cases are not tested
2024-06-29 07:41:46 -04:00
nimlgen
dd7eef7d71 libc defs to autogen (#5217)
* libc defs to autogen

* amd import libc

* linter

* better a bit

* remove comment, check this

* not hardcoded path
2024-06-29 14:37:33 +03:00
nimlgen
6b08cb5e38 ptx runs on nv in benchmarks (#5224) 2024-06-29 11:06:44 +03:00
nimlgen
b4c49ae3fa remove cudacpu in favour of mockgpu (#5225)
* remove cudacpu in favour of mockgpu

* remove unused import

* not used as well
2024-06-29 11:05:16 +03:00
nimlgen
ee02dcb98e nv supports PTX=1 (#5222)
* nv supports PTX=1

* not needed

* split nv compiler into nvrtc autogen

* remove to_c_array

* test

* Revert "test"

This reverts commit f0b56f308b.
2024-06-29 10:46:29 +03:00
wozeparrot
7bcb74ab23 feat: tag 0.9.1 (#5220) v0.9.1 2024-06-28 20:16:14 -07:00
George Hotz
7f46bfa587 hotfix: docs touchup 2024-06-28 14:36:20 -07:00
nimlgen
c941a58581 amd refactor queue creation (#5216)
* amd refactor queue creation

* fixes

* use data64_le

* fix linter
2024-06-28 23:24:49 +03:00
chenyu
7ba4938510 simplify View.permute arg check [run_process_replay] (#5218)
it checks if `axis` is a valid permutation, which is the same as `sorted(axis) == list(range(len(self.shape)))`
2024-06-28 16:18:46 -04:00
George Hotz
80ac21200b hotfix: linearizer test fixup 2024-06-28 10:52:25 -07:00
George Hotz
c9714dfcf4 rename graph to children [run_process_replay] (#5215) 2024-06-28 09:53:52 -07:00
kormann
6c456b6d66 remove uopgraph dedup + slight speedup (#5199)
* rm dedup

* rm dedup

* tests

* reduce diff

* oups

* reduce diff

* rm UOp.tuple
2024-06-28 09:26:32 -07:00
nimlgen
9b08a9397c amd inline bf16 funcs (#5212) 2024-06-28 18:45:00 +03:00
chenyu
7090eac8cb validate sdxl output and put it in benchmark (#5211)
* validate sdxl output and put it in benchmark

* don't print fetch progress_bar in CI
2024-06-28 11:40:52 -04:00
chenyu
63fa4e2a0e fix seed = 0 in sdxl (#5209)
removed a few unneeded realize and contiguous too
2024-06-28 08:48:59 -04:00
Tobias Fischer
4688f97d48 Add SDXL Inference to Examples (#5206)
* added sdxl inference code

* fixed trailing whitespace

* use original impl code, removed uneeded numpy calls
2024-06-28 07:42:28 -04:00
qazal
3e56c8422c remu err handling (#5208)
* add error handling

* use pre release

* minor

* works
2024-06-28 13:15:18 +03:00
nimlgen
7f7fa26e03 allow hugepage failure in memadvise (#5207) 2024-06-28 11:41:10 +03:00
chenyu
73395b998b better error msg for TinyJit inside TinyJit (#5202)
it's possible to support TinyJit inside TinyJit, but there are edge cases like two TinyJit functions shared another TinyJit function. so just give a more precise error for now
2024-06-27 18:09:19 -04:00