George Hotz
b4f6a2c7a3
add kernel spec ( #12911 )
...
* add kernel spec
* fix kernel spec
2025-10-25 11:49:20 +08:00
George Hotz
8a941d95a4
SPEC=2 is full spec, SPEC=1 is default ( #12910 )
...
* SPEC=1 passes all tests
* just use SPEC, not __debug__
2025-10-25 11:10:43 +08:00
chenyu
4b7329001d
clean up test_avg_pool3d ( #12905 )
2025-10-24 14:31:36 -04:00
George Hotz
6b35467f53
stores don't end ranges ( #12902 )
...
* early endrange
* bugfixes
2025-10-24 23:05:03 +08:00
Sieds Lykles
e1f8c82938
Onnx Layer/Group/RMS/Batch-Norm ReduceL2 fp32 intermediates for fp16 ( #12109 )
...
* match onnx spec
* use least_upper_dtype
* promote the square
* just cast before the square
2025-10-24 12:26:11 +02:00
George Hotz
0bde87d8d7
cleanups from flash attention branch ( #12897 )
2025-10-24 14:14:56 +08:00
wozeparrot
9dac505565
variable bs keccak ( #10731 )
2025-10-23 14:10:21 -07:00
Sieds Lykles
c1db62ff7c
move reduce collapse to rangeify ( #12845 )
2025-10-23 15:44:17 +02:00
George Hotz
ff68a6263b
move locals into codegen (dedup works) ( #12885 )
...
* move locals into codegen (dedup works)
* move in optimize
2025-10-23 17:07:39 +08:00
George Hotz
ddb53d1d48
PCONTIG=3 both saves ram and flops ( #12884 )
...
* PCONTIG=3 both saves ram and flops
* group
* gate locals
* should be correct
2025-10-23 16:37:26 +08:00
George Hotz
e85cee0aad
flip Ops.END srcs ( #12882 )
...
* flip Ops.END srcs
* backward
* late end split
2025-10-23 12:47:50 +08:00
George Hotz
74b4cfe44b
Ops.GROUP + range check ( #12880 )
...
* simpler
* fix that
* Ops.GROUP + range check
* fix bugs
* fix linter
* fix test
2025-10-23 12:05:21 +08:00
George Hotz
7762b3558b
clean up the spec ( #12868 )
...
* tighten up the spec
* move validate into a different file
* that moved to validate
* after(barr)
2025-10-22 19:50:42 +08:00
George Hotz
726988fa4b
late ifs try 2 ( #12865 )
...
* late ifs try 2
* fix image
* fix that test
* panic
* ptx fixups
* preserve toposort
* those pass locally
* Revert "those pass locally"
This reverts commit 063409f828 .
* no ls
* make that explicit
2025-10-22 18:49:27 +08:00
Sieds Lykles
8d0256c46b
Move gate to load for loaded index ( #12861 )
...
* change condition
* change test to better represent how the uop looks irl
2025-10-22 09:53:07 +02:00
George Hotz
92778c7a8b
rename opts to ren, add store ranges back ( #12856 )
...
* rename opts to ren
* fix docs and bring store back
2025-10-22 09:15:38 +08:00
b1tg
60d7e232f2
cuda fp8 ( #12782 )
...
* cuda fp8
* tensor core
* tc test
* clean
* clean pm
2025-10-21 15:05:25 -04:00
chenyu
8baa61bd67
use torch 2.9 and its Muon in test ( #12773 )
...
* use torch 2.9 and its Muon in test
* relax and disable
2025-10-21 13:35:17 -04:00
chenyu
f51f9aaa16
muon ns_params -> ns_coefficients ( #12850 )
...
match the official torch one
2025-10-21 12:35:52 -04:00
wozeparrot
62e7b8b870
feat: just use compile3 ( #12849 )
2025-10-21 07:56:50 -07:00
George Hotz
8960ac54f3
remove RewriteStep premature optimization ( #12840 )
...
* remove RewriteStep premature optimization
* fix ebs
* core line count
2025-10-21 21:45:20 +08:00
Sieds Lykles
7f798a9630
Cleanup const buffers ( #12829 )
...
* split pm_cleanups
* update test_schedule
* shrink when we remove bufferize
* dont do shrink if shape is empty
* update tests
* remove *1 from metadata
* deal with the noop bufferize
* only noop on cvar
* cleanup
* fix if
* rename
2025-10-21 14:53:49 +02:00
George Hotz
20a232f1c5
bugfixes from multioutput + PCONTIG=3 for fa bw memory fix ( #12837 )
...
* bugfixes from multioutput
* PCONTIG=3 fixes fa memory usage
* that's base
2025-10-21 19:21:02 +08:00
George Hotz
7d9551ce2e
move to late/control_flow.py ( #12835 )
2025-10-21 18:15:06 +08:00
George Hotz
d711a4b933
delete old linearizer ( #12834 )
...
* new linearizer with early endrange
* cleanups
* second stage removal
* not store
* do that later
* end cleanup
* fix globals
* end
* multi end
* fix ends earlier
* work
* do_merge_ends
* mini change
* range_gate
* fix cpu
* test fixups
* ranges on index
* not for ptx
* delete linearizer
* remove more junk
* delete that test
* we insert endif
* all ends
2025-10-21 17:52:18 +08:00
George Hotz
c780cd9abb
new linearizer with early endrange ( #12823 )
...
* new linearizer with early endrange
* cleanups
* second stage removal
* not store
* do that later
* end cleanup
* fix globals
* end
* multi end
* fix ends earlier
* work
* do_merge_ends
* mini change
* range_gate
* fix cpu
* test fixups
* ranges on index
* not for ptx
2025-10-21 17:37:48 +08:00
George Hotz
d59d4cdbe4
lil less is okay
2025-10-21 17:09:44 +08:00
qazal
32af1ff84b
viz graph drawing small cleanups ( #12830 )
...
* viz graph drawing small cleanups
* str literal
2025-10-21 15:51:32 +08:00
George Hotz
a71a41f6d1
rename Ops.ENDRANGE -> Ops.END ( #12824 )
2025-10-21 11:32:18 +08:00
George Hotz
203a93363c
Revert "after clean up of locals ( #12813 )" ( #12814 )
...
This reverts commit 5d0d3d7aac .
2025-10-20 19:33:35 +08:00
George Hotz
5d0d3d7aac
after clean up of locals ( #12813 )
2025-10-20 19:24:24 +08:00
Sieds Lykles
a8e4614436
remove REAL_SUBSTITUTE=0 and make it fast ( #12809 )
...
* fast REAL_substitute
* remove REAL_SUBSTITUTE=0
2025-10-20 12:44:20 +02:00
George Hotz
2e9082e0bc
after op ( #12801 )
...
* after op
* fix tests
2025-10-20 12:27:56 +08:00
George Hotz
ba593f7b98
don't render index ( #12796 )
...
* don't render index
* update to ignore_indexing
---------
Co-authored-by: qazal <qazal.software@gmail.com >
2025-10-20 09:48:36 +08:00
chenyu
63a23dfe80
test step 0 in TestTrainingOnnxOps ( #12790 )
...
and tighter rtol
2025-10-19 09:15:49 -04:00
chenyu
e8158afd4b
update test_qlinear_add_round_half_to_even ( #12789 )
...
this does not pass locally
2025-10-19 08:47:27 -04:00
Sieds Lykles
fd6ef4801c
rangeify uses symbolic_flat ( #12786 )
...
* symbolic_simple -> symbolic_flat
* remove expected failures
2025-10-19 12:27:14 +02:00
qazal
c8ef4b60f6
viz: share match tracing and TINY device profiler ( #12783 )
...
* set a default name for the traces
* set profile_matches + renames
* profile_matches test
* traces 4 steps total
2025-10-19 14:30:07 +08:00
chenyu
30ff84d050
update test_conv2d_ceildiv_edge_case ( #12779 )
2025-10-18 16:43:32 -04:00
nimlgen
442218266d
qcom: fix profiler ( #12778 )
...
* qcom: fix profiler
* this way
2025-10-19 01:27:59 +08:00
wozeparrot
82f10cfe2e
feat: assert on bufferview math ( #12772 )
2025-10-17 14:20:08 -07:00
chenyu
fcdf4ab37e
remove a contiguous in LARS ( #12770 )
2025-10-17 17:07:30 -04:00
George Hotz
062a6d68d7
test flash attention backward ( #12762 )
...
* test flash attention backward
* TODO: fix pcontig
* end ranges
* render colors
* very big
* multiout at every level
* reset ending ranges
* fix tests
* ugh
2025-10-17 23:15:59 +08:00
George Hotz
c9a3464f76
those decimals never mattered ( #12760 )
...
* those decimals never mattered
* this
* improve debug
* real substitute fixes pcontig
* locals are different buffers
2025-10-17 17:16:24 +08:00
qazal
0160f034d6
viz: show display name for copy runners ( #12761 )
...
* viz: show display name for copy runners
* more u32
2025-10-17 16:59:51 +08:00
qazal
253d32b065
viz: add metadata to buffer user list ( #12758 )
...
* simple failing test
* encodings
* test passing
* key is deduped
2025-10-17 16:28:54 +08:00
George Hotz
935a60db72
bring back partial contig and flash attention ( #12756 )
...
* bring back partial contig and flash attention
* why not 2
* work
* that
* fix pcontig
2025-10-17 16:19:05 +08:00
qazal
dfb8f9fc9e
viz: annotate buffer mutability in the memory graph ( #12750 )
2025-10-17 11:53:02 +08:00
chenyu
9561803cb0
fix assert in test_schedule ( #12745 )
...
* fix assert in test_schedule
updated kernel counts and some old tests
* fix
2025-10-16 15:39:50 -04:00
chenyu
285534ce64
delete DONT_REALIZE_EXPAND and DONT_GROUP_REDUCES ( #12744 )
...
does nothing now
2025-10-16 14:11:33 -04:00