Commit Graph

5354 Commits

Author SHA1 Message Date
chenyu
2cadf21684 include "mkdocs" in setup docs (#5798) 2024-07-29 15:54:52 -04:00
chenyu
471b188d79 fix mypy errors in latest mypy (#5794)
* fix mypy errors in latest mypy

mypy has stricter partial and api arg checks now

* PYTHONPATH="."
2024-07-29 14:53:30 -04:00
samm393
573e0f9a48 remove float division from idiv in python_alu (#5777)
* removes float division from idiv in python_alu

* add test

* cleaner logic

* pass clang unsigned literals correctly

* suffix ULL instead of U

---------

Co-authored-by: chenyu <chenyu@fastmail.com>
2024-07-29 12:14:12 -04:00
samm393
2c94316bd2 ull literal support and test (#5789)
* ull literal support and test

* missing .numpy()
2024-07-29 11:50:49 -04:00
nimlgen
71e1472290 hcq more types (#5791)
* mhcq more types

* linter

* pylint

* docs: bind
2024-07-29 18:03:23 +03:00
P4ssenger
9c80f9adf9 fix bug in assert message (#5787) 2024-07-29 15:46:23 +03:00
nimlgen
ab3839a80a cleanup nv/cuda compilers (#5767)
* cleanup nv/cuda compilers

* destroy prog

* small test

* fix test

* nv ptx rewrite key

* jitlink free

* ptx is part of cuda
2024-07-29 13:50:03 +03:00
chenyu
76840fd65a minor ops cleanup [run_process_replay] (#5786) 2024-07-29 02:30:38 -04:00
chenyu
e7a14f398e more uop_symbolic tests for divmod pairs (#5785) 2024-07-28 21:27:06 -04:00
George Hotz
76d191ab94 move consts to end of add (#5783)
* move consts to end of add

* better

* fix infinite loop
2024-07-28 17:38:57 -07:00
George Hotz
5b84a7db1a hotfix: ptx threads match cuda threads 2024-07-28 16:53:24 -07:00
chenyu
460b120d62 apply more .alu syntactic sugar [run_process_replay] (#5782) 2024-07-28 19:43:48 -04:00
George Hotz
0392123e6e TC=2 still sets tensor cores (and TC=3 support for locals) (#5780)
* TC=2 still sets tensor cores

* add TC=3 support for using locals

* bugfix

* lines + TC=3 tests

* CUDA can use threads, fix fuzz linearizer
2024-07-28 16:16:53 -07:00
chenyu
71a64d8252 UOps.MUL bound when one is negative (#5781)
* UOps.MUL bound when one is negative

also one more distribute_mul rule

* don't always expand
2024-07-28 19:02:47 -04:00
qazal
b775db6b60 high-level benchmark timing diff (#5776)
* high level timings

benchmark times

fix defs

* use the name map

* skip last task
2024-07-28 23:42:57 +03:00
chenyu
600a39771d fix Tensor.arange if (stop-start) and step have different signs (#5775) 2024-07-28 14:34:10 -04:00
David González Martínez
d0fd84e617 feat: allow passing gradient to .backward() to compute vjp (#5771)
* feat: allow passing gradient to .backward() to compute vjp

* fix

* refactor

* fix trailing whitespace
2024-07-28 11:13:18 -07:00
qazal
e0e7293b0a make process replay unique in retries [run_process_replay] (#5773) 2024-07-28 20:44:15 +03:00
nimlgen
ea27ec4cd0 nv switch classlist_v2 to classlist (#5763)
* nv switch classlist_v2 to classlist

* support in mockgpu

* fix mockgpu
2024-07-28 20:24:42 +03:00
nimlgen
73fda023d3 amd better comments for ENABLE_SGPR_DISPATCH_PTR (#5768)
* amd better comments for ENABLE_SGPR_DISPATCH_PTR

* fix lkinter
2024-07-28 16:23:38 +03:00
qazal
95dda8dadf more unmatching vectorize/gep asserts [run_process_replay] (#5760)
* merge vectorize/gep rules [run_process_replay]

* assert dtypes

* src=

* float2=(float4.x,float4.y)
2024-07-28 15:08:54 +08:00
chenyu
bfbd7c5461 more generic UOp mul mod folding (#5765) 2024-07-27 20:20:35 -04:00
chenyu
80c6475757 update test_uop_symbolic to test UOp min and max (#5764)
covers #5750, #5748, #5741
2024-07-27 19:53:21 -04:00
nimlgen
1903542c2d nv/cuda compilers touchup (#5759)
* nv/cuda compilers touchup

* fix cuda check + move nv disasm

* remove includes

* fix nvrtc_check
2024-07-28 00:15:28 +03:00
chenyu
3c79faaf77 remove redundant UOps max folding [run_process_replay] (#5762)
all covered by generic max folding
2024-07-27 16:46:51 -04:00
chenyu
05748e5a84 fix vmax of Uop.RANGE off by 1 (#5750)
with this, can remove several redundant max folding rules, do it separately to check kernel diff
2024-07-27 16:30:46 -04:00
nimlgen
fff19b961b docs: user runtime docs (#5756) 2024-07-27 23:21:54 +03:00
nimlgen
5d53fa491b amd autogened kfd ioctls (#5757)
* amd autogened kio

* unused import

* linter
2024-07-27 22:49:48 +03:00
nimlgen
ed1d784077 test profiler timer sync across devs (#5751)
* test profiler timer sync across devs

* more correct

* typo
2024-07-27 16:47:37 +03:00
qazal
e5fb08acbc simpler expand UOps acc [run_process_replay] (#5754) 2024-07-27 15:20:56 +03:00
gswangg
de66d93859 PTX render vec CONST (#5729)
* dedupe PTX vec CONST render

* fix linter errors

---------

Co-authored-by: qazal <77887910+Qazalin@users.noreply.github.com>
2024-07-27 13:40:19 +03:00
qazal
890e11ce11 fix UOps.STORE folding returning NOp [run_process_replay] (#5753) 2024-07-27 13:32:54 +03:00
qazal
3e49d86c01 process replay diffs 3 things now (#5731)
* github api infra

* process replay is 3 parts now

* parse benchmarks

* add gh_token

* complete diff

* move process replay tests

* last successful run

* add tempdir

* skip master
2024-07-27 12:52:20 +03:00
qazal
57b4a8e98d assert process replay asserts (#5737)
* assert process replay asserts

* one ci job is fine

* test: Revert "separate process replay main loop (#5734)"

This reverts commit 94d578396f.

* mac sed needs that

* Revert "test: Revert "separate process replay main loop (#5734)""

This reverts commit e4ad7684d5.

* disable process replay capture

* save time

* amd is tiny

* send to /dev/null
2024-07-27 12:07:50 +03:00
George Hotz
f8972ace38 test flops (and allow wide ALU in UOps) [run_process_replay] (#5749)
* flops test in external_test_speed_theoretical.py

* test speed theo

* min SZMAX

* allow wide ALU for things that support it

* needed for mypy
2024-07-26 21:07:28 -07:00
George Hotz
2fde2d2914 hotfix: external_test_speed_theoretical works on 24GB 2024-07-26 18:41:52 -07:00
chenyu
b75d1e8793 UOp._min_max for IDIV (#5748) 2024-07-26 21:40:16 -04:00
George Hotz
829262a5ee add external_test_speed_theoretical 2024-07-26 17:45:22 -07:00
chenyu
5f168e7499 remove the optimization in AndNode.substitute (#5747)
was used in the old linearizer but longer needed. still need substitute because some fuzz tests calls sym_infer on AndNode
2024-07-26 20:08:07 -04:00
kormann
c50e354936 NOp clean up any_len passing [run_process_replay] (#5743)
* clean allow_any_len

* min
2024-07-26 17:00:31 -07:00
George Hotz
db1d093b29 reenable LLaMA-3 8B BEAM on NV (#5746) 2024-07-26 16:56:41 -07:00
chenyu
c6b2d96474 minor uop uopgraph cleanups (#5745) 2024-07-26 19:23:48 -04:00
chenyu
3686b6726a move GraphException to jit.py (#5744)
same place where GraphRunner is defined
2024-07-26 19:01:12 -04:00
kormann
a5ede535ef NOp field name [run_process_replay] (#5742)
* rm def name

* add field name
2024-07-26 18:45:59 -04:00
chenyu
0d7d4dd731 UOp._min_max for MUL and MOD (#5741) 2024-07-26 18:38:10 -04:00
George Hotz
c50e374bb6 multiple locals + get_kernel_modifier + fix valid (#5739)
* multiple locals + get_kernel_modifier + fix valid

* fix test pattern matcher
2024-07-26 15:10:10 -07:00
nimlgen
f6c0e17a2c optimize symbolic-related updates in graphs (#5727)
* try

* faster

* cleaner

* better?

* better?

* cleaner

* fixes

* unused

* mypy

* fix clang

* remove comment

* better var names

* rename

* fix cuda

* rename
2024-07-27 00:57:59 +03:00
chenyu
dc7483ee6f UOp simple div folding (#5740)
made UOp.divides return the Optional[quotient] and used it for simple div folding
2024-07-26 17:14:32 -04:00
chenyu
671259417f reuse UOp __repr__ for NOp (#5738) 2024-07-26 16:59:55 -04:00
kormann
b0c1dba299 named UOp class "NOP" [run_process_replay] (#5728)
* NOP

* fix const + simplify compile

* rm VAR for NOOP

---------

Co-authored-by: George Hotz <72895+geohot@users.noreply.github.com>
2024-07-26 13:25:53 -07:00