4847 Commits

Author SHA1 Message Date
George Hotz
bc178d14a9 matmul example on metal showing off tensor core (#13033)
* matmul example on metal showing off tensor core

* flip the args of placeholder

* mat_idx

* imp
2025-10-31 19:40:36 +08:00
George Hotz
e066b3176b hotfix: types and names for custom kernel test 2025-10-31 17:34:55 +08:00
George Hotz
54f48f93c6 working backward pass in custom kernel (#13032)
* working backward pass in custom kernel

* custom_kernel tensor method

* no SPEC=2
2025-10-31 17:26:18 +08:00
George Hotz
b791d70725 support custom UOp kernels (#13028)
* support custom UOp kernels

* no number

* multioutput works

* backward kernel runs

* move kernel class

* grad later

* work

* no tags in kernel graph

* test arange

* arange + contig

* delete comment
2025-10-31 15:51:39 +08:00
chenyu
f6430a0559 add script for one slow openpilot conv (#12953)
* add script for one slow openpilot conv

* fix ruff
2025-10-30 18:08:41 -04:00
Sieds Lykles
4c8362128b New symbolic renderer + strip parens (#13017)
* new uop renderer

* better tester

* strip parens

* update tests

* split method check_uop_against_string

* use ctx.update instead of add_rendered method

* strip parens based on precedence

* update test

* new symbolic renderer

* add comment
2025-10-30 16:41:32 +01:00
George Hotz
e456f2cb1e more uop programs (#13007)
* more uop program

* test_matmul_relu

* tests fix
2025-10-30 14:57:59 +08:00
George Hotz
e64d4b3b44 uops programs (#13005)
* uops programs

* work

* work

* more syntax

* more syntax

* comments
2025-10-30 12:28:10 +08:00
George Hotz
2da02f1ae1 add loads at the end (#12988)
* add loads at the end

* simpler

* late load

* tests passing

* fix matvec

* spec test passes

* fix where on load

* fix abs2

* fix more tests
2025-10-30 10:42:19 +08:00
nimlgen
4b001ec723 amd: pmc in mockgpu (#13000)
* amd: pmc in mockgpu

* fix

* do not open in ci
2025-10-30 01:52:02 +08:00
Sieds Lykles
70bce62c67 dont collapse possibly empty symbolic range (#12994)
* dont collapse a symbolic range based on min/max

* refactor z3 renderer

* include sink explicitely instead of dtypes.void

* use dtype.scalar()
2025-10-29 12:17:09 +01:00
George Hotz
819592ee67 hotfix: disable DoubleMatmul for PTX 2025-10-29 16:37:17 +08:00
George Hotz
30ca3f2af8 all double matmul (#12993)
* fix more double matmuls

* a few more

* all double matmul passes

* opts for flash attention

* fix spec

* comment
2025-10-29 16:25:27 +08:00
Sieds Lykles
9f39f6391c shared_codegen_spec and fix index spec (#12967)
* split shared_codegen_spec and fix index

* add VCONST to program_spec and move index to shared_codegen_spec

* working ignore_oob=0

* cleanup

* fix spec

* undo that

* move barrier and special earlier

* fix more spec issues

* more updates

* remove special from program_spec

* cleanup and fixes

* move more to shared

* special is not in shared_spec

* some comments

* dont do bounds check there
2025-10-29 09:14:11 +01:00
George Hotz
1c362736aa fix more double matmuls (#12991)
* fix more double matmuls

* a few more
2025-10-29 16:09:48 +08:00
George Hotz
8c47cf4323 pcontig double matmul works (#12899)
* pcontig double matmul works

* tests

* contract

* closer

* works-ish

* add that broadcast

* 2 more work

* something

* disable broken ones

* llvm

* align 16
2025-10-29 13:06:43 +08:00
George Hotz
b147e7e8e6 flatten bufferize (#12984)
* flatten bufferize

* simpler

* tests pass

* flat

* not flat
2025-10-29 11:23:43 +08:00
chenyu
ef16e6c68c unwrap instead of cast [pr] (#12982) 2025-10-28 21:29:23 -04:00
George Hotz
5e01cc299b zero len ranges fail (#12974)
* zero len ranges fail

* fix Python backend

* fix llvm

* fix ptx

* yolo fix nir

* this works...

* always store...

* always store...

* Revert "always store..."

This reverts commit 0816cf344d.
2025-10-28 22:49:55 +08:00
George Hotz
f5a3b33d33 add fun with nhwc convs 2025-10-28 17:12:22 +08:00
George Hotz
907499b02c clean up GROUP/SINK (#12969)
* clean up GROUP/SINK

* fix end

* range_str color
2025-10-28 16:08:10 +08:00
Sieds Lykles
e22c5e7e73 process_replay uses opts argument for KernelInfo.opts_to_apply (#12946)
* opts_to_apply is opts

* skip beamed kernels

* simpler change

* fix the tensor cores tests for process replay

* use opts
2025-10-28 09:00:28 +01:00
George Hotz
b0da173f2f add unique to const, fix longstanding bug (#12965)
* add unique to const, fix longstanding bug

* _force_unique=True

* fix tests

* fix more tests
2025-10-28 15:11:37 +08:00
Sieds Lykles
e110f4632a split cat (on cpu) (#12864)
* split ranges but only on cpu

* except KernelOptError for threads

* use GROUP and END

* no more flatten_range needed

* remove noop end

* always process replay for openpilot

* update test

* skip test

* fix in outs calculation

With the new linearizer the toposort is a problem, this matches the spec
now

* undo that
2025-10-28 07:55:19 +01:00
George Hotz
4d817a289e simplify spec (#12958)
* simplify spec

* more
2025-10-28 09:52:32 +08:00
chenyu
a79832b01f control_flow.py -> linearizer.py [pr] (#12948) 2025-10-27 12:38:13 -04:00
George Hotz
25c2da1579 check SPEC=2 in CI (#12945)
* check SPEC=2 in CI

* split SPEC=2

* fast enough
2025-10-27 21:53:57 +08:00
George Hotz
701a632907 move VECTORIZE/CONST (#12942) 2025-10-27 17:37:13 +08:00
George Hotz
804133cffd rename RECIP to RECIPROCAL (#12939) 2025-10-27 16:53:13 +08:00
chenyu
e18922f111 limit AND const min max to ints [pr] (#12918) 2025-10-25 16:07:52 -04:00
George Hotz
b4f6a2c7a3 add kernel spec (#12911)
* add kernel spec

* fix kernel spec
2025-10-25 11:49:20 +08:00
George Hotz
8a941d95a4 SPEC=2 is full spec, SPEC=1 is default (#12910)
* SPEC=1 passes all tests

* just use SPEC, not __debug__
2025-10-25 11:10:43 +08:00
chenyu
4b7329001d clean up test_avg_pool3d (#12905) 2025-10-24 14:31:36 -04:00
George Hotz
6b35467f53 stores don't end ranges (#12902)
* early endrange

* bugfixes
2025-10-24 23:05:03 +08:00
Sieds Lykles
e1f8c82938 Onnx Layer/Group/RMS/Batch-Norm ReduceL2 fp32 intermediates for fp16 (#12109)
* match onnx spec

* use least_upper_dtype

* promote the square

* just cast before the square
2025-10-24 12:26:11 +02:00
George Hotz
0bde87d8d7 cleanups from flash attention branch (#12897) 2025-10-24 14:14:56 +08:00
wozeparrot
9dac505565 variable bs keccak (#10731) 2025-10-23 14:10:21 -07:00
Sieds Lykles
c1db62ff7c move reduce collapse to rangeify (#12845) 2025-10-23 15:44:17 +02:00
George Hotz
ff68a6263b move locals into codegen (dedup works) (#12885)
* move locals into codegen (dedup works)

* move in optimize
2025-10-23 17:07:39 +08:00
George Hotz
ddb53d1d48 PCONTIG=3 both saves ram and flops (#12884)
* PCONTIG=3 both saves ram and flops

* group

* gate locals

* should be correct
2025-10-23 16:37:26 +08:00
George Hotz
e85cee0aad flip Ops.END srcs (#12882)
* flip Ops.END srcs

* backward

* late end split
2025-10-23 12:47:50 +08:00
George Hotz
74b4cfe44b Ops.GROUP + range check (#12880)
* simpler

* fix that

* Ops.GROUP + range check

* fix bugs

* fix linter

* fix test
2025-10-23 12:05:21 +08:00
George Hotz
7762b3558b clean up the spec (#12868)
* tighten up the spec

* move validate into a different file

* that moved to validate

* after(barr)
2025-10-22 19:50:42 +08:00
George Hotz
726988fa4b late ifs try 2 (#12865)
* late ifs try 2

* fix image

* fix that test

* panic

* ptx fixups

* preserve toposort

* those pass locally

* Revert "those pass locally"

This reverts commit 063409f828.

* no ls

* make that explicit
2025-10-22 18:49:27 +08:00
Sieds Lykles
8d0256c46b Move gate to load for loaded index (#12861)
* change condition

* change test to better represent how the uop looks irl
2025-10-22 09:53:07 +02:00
George Hotz
92778c7a8b rename opts to ren, add store ranges back (#12856)
* rename opts to ren

* fix docs and bring store back
2025-10-22 09:15:38 +08:00
b1tg
60d7e232f2 cuda fp8 (#12782)
* cuda fp8

* tensor core

* tc test

* clean

* clean pm
2025-10-21 15:05:25 -04:00
chenyu
8baa61bd67 use torch 2.9 and its Muon in test (#12773)
* use torch 2.9 and its Muon in test

* relax and disable
2025-10-21 13:35:17 -04:00
chenyu
f51f9aaa16 muon ns_params -> ns_coefficients (#12850)
match the official torch one
2025-10-21 12:35:52 -04:00
wozeparrot
62e7b8b870 feat: just use compile3 (#12849) 2025-10-21 07:56:50 -07:00