George Hotz
bc178d14a9
matmul example on metal showing off tensor core ( #13033 )
...
* matmul example on metal showing off tensor core
* flip the args of placeholder
* mat_idx
* imp
2025-10-31 19:40:36 +08:00
George Hotz
e066b3176b
hotfix: types and names for custom kernel test
2025-10-31 17:34:55 +08:00
George Hotz
54f48f93c6
working backward pass in custom kernel ( #13032 )
...
* working backward pass in custom kernel
* custom_kernel tensor method
* no SPEC=2
2025-10-31 17:26:18 +08:00
George Hotz
b791d70725
support custom UOp kernels ( #13028 )
...
* support custom UOp kernels
* no number
* multioutput works
* backward kernel runs
* move kernel class
* grad later
* work
* no tags in kernel graph
* test arange
* arange + contig
* delete comment
2025-10-31 15:51:39 +08:00
chenyu
f6430a0559
add script for one slow openpilot conv ( #12953 )
...
* add script for one slow openpilot conv
* fix ruff
2025-10-30 18:08:41 -04:00
Sieds Lykles
4c8362128b
New symbolic renderer + strip parens ( #13017 )
...
* new uop renderer
* better tester
* strip parens
* update tests
* split method check_uop_against_string
* use ctx.update instead of add_rendered method
* strip parens based on precedence
* update test
* new symbolic renderer
* add comment
2025-10-30 16:41:32 +01:00
George Hotz
e456f2cb1e
more uop programs ( #13007 )
...
* more uop program
* test_matmul_relu
* tests fix
2025-10-30 14:57:59 +08:00
George Hotz
e64d4b3b44
uops programs ( #13005 )
...
* uops programs
* work
* work
* more syntax
* more syntax
* comments
2025-10-30 12:28:10 +08:00
George Hotz
2da02f1ae1
add loads at the end ( #12988 )
...
* add loads at the end
* simpler
* late load
* tests passing
* fix matvec
* spec test passes
* fix where on load
* fix abs2
* fix more tests
2025-10-30 10:42:19 +08:00
nimlgen
4b001ec723
amd: pmc in mockgpu ( #13000 )
...
* amd: pmc in mockgpu
* fix
* do not open in ci
2025-10-30 01:52:02 +08:00
Sieds Lykles
70bce62c67
dont collapse possibly empty symbolic range ( #12994 )
...
* dont collapse a symbolic range based on min/max
* refactor z3 renderer
* include sink explicitely instead of dtypes.void
* use dtype.scalar()
2025-10-29 12:17:09 +01:00
George Hotz
819592ee67
hotfix: disable DoubleMatmul for PTX
2025-10-29 16:37:17 +08:00
George Hotz
30ca3f2af8
all double matmul ( #12993 )
...
* fix more double matmuls
* a few more
* all double matmul passes
* opts for flash attention
* fix spec
* comment
2025-10-29 16:25:27 +08:00
Sieds Lykles
9f39f6391c
shared_codegen_spec and fix index spec ( #12967 )
...
* split shared_codegen_spec and fix index
* add VCONST to program_spec and move index to shared_codegen_spec
* working ignore_oob=0
* cleanup
* fix spec
* undo that
* move barrier and special earlier
* fix more spec issues
* more updates
* remove special from program_spec
* cleanup and fixes
* move more to shared
* special is not in shared_spec
* some comments
* dont do bounds check there
2025-10-29 09:14:11 +01:00
George Hotz
1c362736aa
fix more double matmuls ( #12991 )
...
* fix more double matmuls
* a few more
2025-10-29 16:09:48 +08:00
George Hotz
8c47cf4323
pcontig double matmul works ( #12899 )
...
* pcontig double matmul works
* tests
* contract
* closer
* works-ish
* add that broadcast
* 2 more work
* something
* disable broken ones
* llvm
* align 16
2025-10-29 13:06:43 +08:00
George Hotz
b147e7e8e6
flatten bufferize ( #12984 )
...
* flatten bufferize
* simpler
* tests pass
* flat
* not flat
2025-10-29 11:23:43 +08:00
chenyu
ef16e6c68c
unwrap instead of cast [pr] ( #12982 )
2025-10-28 21:29:23 -04:00
George Hotz
5e01cc299b
zero len ranges fail ( #12974 )
...
* zero len ranges fail
* fix Python backend
* fix llvm
* fix ptx
* yolo fix nir
* this works...
* always store...
* always store...
* Revert "always store..."
This reverts commit 0816cf344d .
2025-10-28 22:49:55 +08:00
George Hotz
f5a3b33d33
add fun with nhwc convs
2025-10-28 17:12:22 +08:00
George Hotz
907499b02c
clean up GROUP/SINK ( #12969 )
...
* clean up GROUP/SINK
* fix end
* range_str color
2025-10-28 16:08:10 +08:00
Sieds Lykles
e22c5e7e73
process_replay uses opts argument for KernelInfo.opts_to_apply ( #12946 )
...
* opts_to_apply is opts
* skip beamed kernels
* simpler change
* fix the tensor cores tests for process replay
* use opts
2025-10-28 09:00:28 +01:00
George Hotz
b0da173f2f
add unique to const, fix longstanding bug ( #12965 )
...
* add unique to const, fix longstanding bug
* _force_unique=True
* fix tests
* fix more tests
2025-10-28 15:11:37 +08:00
Sieds Lykles
e110f4632a
split cat (on cpu) ( #12864 )
...
* split ranges but only on cpu
* except KernelOptError for threads
* use GROUP and END
* no more flatten_range needed
* remove noop end
* always process replay for openpilot
* update test
* skip test
* fix in outs calculation
With the new linearizer the toposort is a problem, this matches the spec
now
* undo that
2025-10-28 07:55:19 +01:00
George Hotz
4d817a289e
simplify spec ( #12958 )
...
* simplify spec
* more
2025-10-28 09:52:32 +08:00
chenyu
a79832b01f
control_flow.py -> linearizer.py [pr] ( #12948 )
2025-10-27 12:38:13 -04:00
George Hotz
25c2da1579
check SPEC=2 in CI ( #12945 )
...
* check SPEC=2 in CI
* split SPEC=2
* fast enough
2025-10-27 21:53:57 +08:00
George Hotz
701a632907
move VECTORIZE/CONST ( #12942 )
2025-10-27 17:37:13 +08:00
George Hotz
804133cffd
rename RECIP to RECIPROCAL ( #12939 )
2025-10-27 16:53:13 +08:00
chenyu
e18922f111
limit AND const min max to ints [pr] ( #12918 )
2025-10-25 16:07:52 -04:00
George Hotz
b4f6a2c7a3
add kernel spec ( #12911 )
...
* add kernel spec
* fix kernel spec
2025-10-25 11:49:20 +08:00
George Hotz
8a941d95a4
SPEC=2 is full spec, SPEC=1 is default ( #12910 )
...
* SPEC=1 passes all tests
* just use SPEC, not __debug__
2025-10-25 11:10:43 +08:00
chenyu
4b7329001d
clean up test_avg_pool3d ( #12905 )
2025-10-24 14:31:36 -04:00
George Hotz
6b35467f53
stores don't end ranges ( #12902 )
...
* early endrange
* bugfixes
2025-10-24 23:05:03 +08:00
Sieds Lykles
e1f8c82938
Onnx Layer/Group/RMS/Batch-Norm ReduceL2 fp32 intermediates for fp16 ( #12109 )
...
* match onnx spec
* use least_upper_dtype
* promote the square
* just cast before the square
2025-10-24 12:26:11 +02:00
George Hotz
0bde87d8d7
cleanups from flash attention branch ( #12897 )
2025-10-24 14:14:56 +08:00
wozeparrot
9dac505565
variable bs keccak ( #10731 )
2025-10-23 14:10:21 -07:00
Sieds Lykles
c1db62ff7c
move reduce collapse to rangeify ( #12845 )
2025-10-23 15:44:17 +02:00
George Hotz
ff68a6263b
move locals into codegen (dedup works) ( #12885 )
...
* move locals into codegen (dedup works)
* move in optimize
2025-10-23 17:07:39 +08:00
George Hotz
ddb53d1d48
PCONTIG=3 both saves ram and flops ( #12884 )
...
* PCONTIG=3 both saves ram and flops
* group
* gate locals
* should be correct
2025-10-23 16:37:26 +08:00
George Hotz
e85cee0aad
flip Ops.END srcs ( #12882 )
...
* flip Ops.END srcs
* backward
* late end split
2025-10-23 12:47:50 +08:00
George Hotz
74b4cfe44b
Ops.GROUP + range check ( #12880 )
...
* simpler
* fix that
* Ops.GROUP + range check
* fix bugs
* fix linter
* fix test
2025-10-23 12:05:21 +08:00
George Hotz
7762b3558b
clean up the spec ( #12868 )
...
* tighten up the spec
* move validate into a different file
* that moved to validate
* after(barr)
2025-10-22 19:50:42 +08:00
George Hotz
726988fa4b
late ifs try 2 ( #12865 )
...
* late ifs try 2
* fix image
* fix that test
* panic
* ptx fixups
* preserve toposort
* those pass locally
* Revert "those pass locally"
This reverts commit 063409f828 .
* no ls
* make that explicit
2025-10-22 18:49:27 +08:00
Sieds Lykles
8d0256c46b
Move gate to load for loaded index ( #12861 )
...
* change condition
* change test to better represent how the uop looks irl
2025-10-22 09:53:07 +02:00
George Hotz
92778c7a8b
rename opts to ren, add store ranges back ( #12856 )
...
* rename opts to ren
* fix docs and bring store back
2025-10-22 09:15:38 +08:00
b1tg
60d7e232f2
cuda fp8 ( #12782 )
...
* cuda fp8
* tensor core
* tc test
* clean
* clean pm
2025-10-21 15:05:25 -04:00
chenyu
8baa61bd67
use torch 2.9 and its Muon in test ( #12773 )
...
* use torch 2.9 and its Muon in test
* relax and disable
2025-10-21 13:35:17 -04:00
chenyu
f51f9aaa16
muon ns_params -> ns_coefficients ( #12850 )
...
match the official torch one
2025-10-21 12:35:52 -04:00
wozeparrot
62e7b8b870
feat: just use compile3 ( #12849 )
2025-10-21 07:56:50 -07:00