chenyu
2cadf21684
include "mkdocs" in setup docs ( #5798 )
2024-07-29 15:54:52 -04:00
chenyu
471b188d79
fix mypy errors in latest mypy ( #5794 )
...
* fix mypy errors in latest mypy
mypy has stricter partial and api arg checks now
* PYTHONPATH="."
2024-07-29 14:53:30 -04:00
samm393
573e0f9a48
remove float division from idiv in python_alu ( #5777 )
...
* removes float division from idiv in python_alu
* add test
* cleaner logic
* pass clang unsigned literals correctly
* suffix ULL instead of U
---------
Co-authored-by: chenyu <chenyu@fastmail.com >
2024-07-29 12:14:12 -04:00
samm393
2c94316bd2
ull literal support and test ( #5789 )
...
* ull literal support and test
* missing .numpy()
2024-07-29 11:50:49 -04:00
nimlgen
71e1472290
hcq more types ( #5791 )
...
* mhcq more types
* linter
* pylint
* docs: bind
2024-07-29 18:03:23 +03:00
P4ssenger
9c80f9adf9
fix bug in assert message ( #5787 )
2024-07-29 15:46:23 +03:00
nimlgen
ab3839a80a
cleanup nv/cuda compilers ( #5767 )
...
* cleanup nv/cuda compilers
* destroy prog
* small test
* fix test
* nv ptx rewrite key
* jitlink free
* ptx is part of cuda
2024-07-29 13:50:03 +03:00
chenyu
76840fd65a
minor ops cleanup [run_process_replay] ( #5786 )
2024-07-29 02:30:38 -04:00
chenyu
e7a14f398e
more uop_symbolic tests for divmod pairs ( #5785 )
2024-07-28 21:27:06 -04:00
George Hotz
76d191ab94
move consts to end of add ( #5783 )
...
* move consts to end of add
* better
* fix infinite loop
2024-07-28 17:38:57 -07:00
George Hotz
5b84a7db1a
hotfix: ptx threads match cuda threads
2024-07-28 16:53:24 -07:00
chenyu
460b120d62
apply more .alu syntactic sugar [run_process_replay] ( #5782 )
2024-07-28 19:43:48 -04:00
George Hotz
0392123e6e
TC=2 still sets tensor cores (and TC=3 support for locals) ( #5780 )
...
* TC=2 still sets tensor cores
* add TC=3 support for using locals
* bugfix
* lines + TC=3 tests
* CUDA can use threads, fix fuzz linearizer
2024-07-28 16:16:53 -07:00
chenyu
71a64d8252
UOps.MUL bound when one is negative ( #5781 )
...
* UOps.MUL bound when one is negative
also one more distribute_mul rule
* don't always expand
2024-07-28 19:02:47 -04:00
qazal
b775db6b60
high-level benchmark timing diff ( #5776 )
...
* high level timings
benchmark times
fix defs
* use the name map
* skip last task
2024-07-28 23:42:57 +03:00
chenyu
600a39771d
fix Tensor.arange if (stop-start) and step have different signs ( #5775 )
2024-07-28 14:34:10 -04:00
David González Martínez
d0fd84e617
feat: allow passing gradient to .backward() to compute vjp ( #5771 )
...
* feat: allow passing gradient to .backward() to compute vjp
* fix
* refactor
* fix trailing whitespace
2024-07-28 11:13:18 -07:00
qazal
e0e7293b0a
make process replay unique in retries [run_process_replay] ( #5773 )
2024-07-28 20:44:15 +03:00
nimlgen
ea27ec4cd0
nv switch classlist_v2 to classlist ( #5763 )
...
* nv switch classlist_v2 to classlist
* support in mockgpu
* fix mockgpu
2024-07-28 20:24:42 +03:00
nimlgen
73fda023d3
amd better comments for ENABLE_SGPR_DISPATCH_PTR ( #5768 )
...
* amd better comments for ENABLE_SGPR_DISPATCH_PTR
* fix lkinter
2024-07-28 16:23:38 +03:00
qazal
95dda8dadf
more unmatching vectorize/gep asserts [run_process_replay] ( #5760 )
...
* merge vectorize/gep rules [run_process_replay]
* assert dtypes
* src=
* float2=(float4.x,float4.y)
2024-07-28 15:08:54 +08:00
chenyu
bfbd7c5461
more generic UOp mul mod folding ( #5765 )
2024-07-27 20:20:35 -04:00
chenyu
80c6475757
update test_uop_symbolic to test UOp min and max ( #5764 )
...
covers #5750 , #5748 , #5741
2024-07-27 19:53:21 -04:00
nimlgen
1903542c2d
nv/cuda compilers touchup ( #5759 )
...
* nv/cuda compilers touchup
* fix cuda check + move nv disasm
* remove includes
* fix nvrtc_check
2024-07-28 00:15:28 +03:00
chenyu
3c79faaf77
remove redundant UOps max folding [run_process_replay] ( #5762 )
...
all covered by generic max folding
2024-07-27 16:46:51 -04:00
chenyu
05748e5a84
fix vmax of Uop.RANGE off by 1 ( #5750 )
...
with this, can remove several redundant max folding rules, do it separately to check kernel diff
2024-07-27 16:30:46 -04:00
nimlgen
fff19b961b
docs: user runtime docs ( #5756 )
2024-07-27 23:21:54 +03:00
nimlgen
5d53fa491b
amd autogened kfd ioctls ( #5757 )
...
* amd autogened kio
* unused import
* linter
2024-07-27 22:49:48 +03:00
nimlgen
ed1d784077
test profiler timer sync across devs ( #5751 )
...
* test profiler timer sync across devs
* more correct
* typo
2024-07-27 16:47:37 +03:00
qazal
e5fb08acbc
simpler expand UOps acc [run_process_replay] ( #5754 )
2024-07-27 15:20:56 +03:00
gswangg
de66d93859
PTX render vec CONST ( #5729 )
...
* dedupe PTX vec CONST render
* fix linter errors
---------
Co-authored-by: qazal <77887910+Qazalin@users.noreply.github.com >
2024-07-27 13:40:19 +03:00
qazal
890e11ce11
fix UOps.STORE folding returning NOp [run_process_replay] ( #5753 )
2024-07-27 13:32:54 +03:00
qazal
3e49d86c01
process replay diffs 3 things now ( #5731 )
...
* github api infra
* process replay is 3 parts now
* parse benchmarks
* add gh_token
* complete diff
* move process replay tests
* last successful run
* add tempdir
* skip master
2024-07-27 12:52:20 +03:00
qazal
57b4a8e98d
assert process replay asserts ( #5737 )
...
* assert process replay asserts
* one ci job is fine
* test: Revert "separate process replay main loop (#5734 )"
This reverts commit 94d578396f .
* mac sed needs that
* Revert "test: Revert "separate process replay main loop (#5734 )""
This reverts commit e4ad7684d5 .
* disable process replay capture
* save time
* amd is tiny
* send to /dev/null
2024-07-27 12:07:50 +03:00
George Hotz
f8972ace38
test flops (and allow wide ALU in UOps) [run_process_replay] ( #5749 )
...
* flops test in external_test_speed_theoretical.py
* test speed theo
* min SZMAX
* allow wide ALU for things that support it
* needed for mypy
2024-07-26 21:07:28 -07:00
George Hotz
2fde2d2914
hotfix: external_test_speed_theoretical works on 24GB
2024-07-26 18:41:52 -07:00
chenyu
b75d1e8793
UOp._min_max for IDIV ( #5748 )
2024-07-26 21:40:16 -04:00
George Hotz
829262a5ee
add external_test_speed_theoretical
2024-07-26 17:45:22 -07:00
chenyu
5f168e7499
remove the optimization in AndNode.substitute ( #5747 )
...
was used in the old linearizer but longer needed. still need substitute because some fuzz tests calls sym_infer on AndNode
2024-07-26 20:08:07 -04:00
kormann
c50e354936
NOp clean up any_len passing [run_process_replay] ( #5743 )
...
* clean allow_any_len
* min
2024-07-26 17:00:31 -07:00
George Hotz
db1d093b29
reenable LLaMA-3 8B BEAM on NV ( #5746 )
2024-07-26 16:56:41 -07:00
chenyu
c6b2d96474
minor uop uopgraph cleanups ( #5745 )
2024-07-26 19:23:48 -04:00
chenyu
3686b6726a
move GraphException to jit.py ( #5744 )
...
same place where GraphRunner is defined
2024-07-26 19:01:12 -04:00
kormann
a5ede535ef
NOp field name [run_process_replay] ( #5742 )
...
* rm def name
* add field name
2024-07-26 18:45:59 -04:00
chenyu
0d7d4dd731
UOp._min_max for MUL and MOD ( #5741 )
2024-07-26 18:38:10 -04:00
George Hotz
c50e374bb6
multiple locals + get_kernel_modifier + fix valid ( #5739 )
...
* multiple locals + get_kernel_modifier + fix valid
* fix test pattern matcher
2024-07-26 15:10:10 -07:00
nimlgen
f6c0e17a2c
optimize symbolic-related updates in graphs ( #5727 )
...
* try
* faster
* cleaner
* better?
* better?
* cleaner
* fixes
* unused
* mypy
* fix clang
* remove comment
* better var names
* rename
* fix cuda
* rename
2024-07-27 00:57:59 +03:00
chenyu
dc7483ee6f
UOp simple div folding ( #5740 )
...
made UOp.divides return the Optional[quotient] and used it for simple div folding
2024-07-26 17:14:32 -04:00
chenyu
671259417f
reuse UOp __repr__ for NOp ( #5738 )
2024-07-26 16:59:55 -04:00
kormann
b0c1dba299
named UOp class "NOP" [run_process_replay] ( #5728 )
...
* NOP
* fix const + simplify compile
* rm VAR for NOOP
---------
Co-authored-by: George Hotz <72895+geohot@users.noreply.github.com >
2024-07-26 13:25:53 -07:00