Commit Graph

10490 Commits

Author SHA1 Message Date
George Hotz
d81acbeef6 multi: move shrink after copy (#10109)
* multi: move shrink after copy

* passing now
2025-04-30 10:29:51 -04:00
qazal
67bd8489ad grouper cleanups [pr] (#10113) 2025-04-30 18:54:47 +08:00
nimlgen
b4c9a3d8f4 hcq: use mmio iface in copies (#10111)
* hcq: use mmio iface in copies

* linter

* fix_am

* am
2025-04-30 11:05:13 +03:00
nimlgen
5c7d004da5 hcq: refactor int ptrs to hcqbuffers (#10105)
* hcq: refactor int ptrs to hcqbuffers

* more refactors

* linter

* use in allocator

* test fiz

* fx

* ops

* final?

* simpler

* keep this for now
2025-04-30 00:12:18 +03:00
chenyu
573bbb9746 Revert "remove TransformerBlock contiguous in llama (#10104)" (#10108)
This reverts commit b8d07dcc54.
2025-04-29 15:28:38 -04:00
chenyu
4a04098389 fix llama3 with nf4 quantize (#10107)
also int8 outputs is wrong
2025-04-29 15:14:36 -04:00
George Hotz
9c1b80499f names for graph rewrites + null device supports exp and friends (#10106) 2025-04-29 14:28:20 -04:00
chenyu
b8d07dcc54 remove TransformerBlock contiguous in llama (#10104) 2025-04-29 14:15:39 -04:00
Ignacio Sica
9d5677c12c fix ptx linearizer bug 2 [pr] (#9967)
* check for local buffer

* hotfix

* add test_tensor_cores_emulation run for ptx
2025-04-29 14:30:07 -03:00
qazal
a59d18da21 hack for VIZ=1 with examples/llama (#10103)
* hack for VIZ=1 with examples/llama

* move it alongside BEAM=0
2025-04-29 23:42:17 +08:00
qazal
93bf8764f2 do not open devices in lowering (#10101)
* do not open devices in lowering [pr]

* ctx=opts

* ctx

* fuzz test
2025-04-29 23:18:16 +08:00
George Hotz
c3ff308abb range has only one src now [pr] (#10100)
* range has only one op now

* fix z3 checker

* ci fix

* needs shell

* try pip ensure update

* that ensurepip is useless

* upgrade pip before cache

* windows happy?
2025-04-29 10:31:05 -04:00
George Hotz
427471550a hotfix: amd tflops to 74 and some external_benchmark_sdxl_softmax stuff 2025-04-29 09:02:27 -04:00
Ignacio Sica
58cf8cd493 add support for "shared_mem" for LLVM (#10093)
* init llvm shared

* add test_tensor_cores_emulation run for llvm
2025-04-29 08:56:36 -04:00
qazal
ad7546c931 assert in test_indexing_two_bind instead of silent fail (#10099)
* assert in test_indexing_two_bind instead of silent fail

* debuggable

* skip test_simple_train
2025-04-29 20:23:25 +08:00
George Hotz
cee220a1ab always expand ssa on wheres (#9697)
Co-authored-by: qazal <77887910+Qazalin@users.noreply.github.com>
2025-04-29 20:08:41 +08:00
qazal
3b67f56c02 kernelize some llama realizes (#10098) 2025-04-29 18:39:56 +08:00
qazal
cbf7347cd6 display viz rewrites with tabbing if they are subrewrites (#10097)
* display viz rewrites with tabbing if they are subrewrites

* update viz api
2025-04-29 17:57:21 +08:00
George Hotz
73c2f6602f test sdxl softmax (#10096) 2025-04-28 21:55:50 -04:00
George Hotz
eaceafecae do fusion locally (#10095)
* do fusion locally

* oops, that's the right way

* explicit delete closure
2025-04-28 20:45:37 -04:00
chenyu
3eba3d6ee9 don't pass model in convert_from_huggingface and convert_from_gguf (#10094)
it only needs n_layers
2025-04-28 20:11:19 -04:00
George Hotz
a2d0684fc1 test_attention_simple_view (#10092)
* test_attention_simple_view

* correct comment
2025-04-28 20:01:22 -04:00
Ignacio Sica
bda116d773 fix use_tensor_cores propagation (#10048)
* propagate use_tensor_cores

* add use_tensor_core to arg in test and search

* bugfix

* get TC val from ContextVar in search

* revert minor space change

* add tc emulation test to ci and benchmark

* revert

* revert whitespace change

* remove test for ptx

* add comment and remove llvm test run
2025-04-28 19:30:50 -03:00
George Hotz
d32f5e9f3a improve rendering of shapes in viz + investigate symbolic [pr] (#10091) 2025-04-28 16:44:09 -04:00
Sieds Lykles
dbb7aee02e Split constant in div with negative x (#10088)
* add rule

* change test

* lower complexity limit

* remove offset in fold_unrolled_divs

* remove import

* add one more condition
2025-04-28 16:24:14 -04:00
chenyu
610ee79b22 cherry pick mlperf5.0 branch to master (#10089) 2025-04-28 15:36:56 -04:00
chenyu
459a223202 simpler Literal annotation in code_for_workitem [pr] (#10087) 2025-04-28 14:59:25 -04:00
nimlgen
dcd9a633c3 am: load minimum fw (#10083)
* am: load minimum psp parts

* try thos

* remove me & pfp
2025-04-28 21:28:05 +03:00
George Hotz
ecff82a698 fixing single kernel softmax: resolve (#10086)
* fixing single kernel softmax: resolve

* add failing lin test
2025-04-28 13:46:20 -04:00
George Hotz
4c242b0483 hotfix: tests all pass on metal local 2025-04-28 12:09:00 -04:00
George Hotz
690dac79b5 don't modify the ranges on reduce rewrite (#10062)
* bug in div range folding

* simpler

* oh, this is right for indexing, but the div mod folding needs to be fixed

* reenable

* Passing test_complexity_w_unroll2 (#10068)

* Passing

* remove non_folded_divs

* Add check for negative tern in div folding

* Add test

* bump that limit

* fix casted

---------

Co-authored-by: Sieds Lykles <93992551+S-Lykles@users.noreply.github.com>
2025-04-28 12:01:19 -04:00
quortus
5130759605 Make sure clang always inline batched functions (#10037) 2025-04-28 10:48:24 -04:00
George Hotz
c4a50f9d89 fix full shape in kernel.py [pr] (#10085)
* fix full shape in kernel.py

* fix that heuristic

* full shape in shapetracker is fast

* fix process replay [pr]

* simpler

* this

* i'm just going to ignore that one
2025-04-28 09:32:58 -04:00
qazal
ac37510f60 remu: only write v_cmp result if exec is set (#10084) 2025-04-28 20:31:52 +08:00
qazal
d6b436a815 remu bugfix with -0.0 negation (#10082) 2025-04-28 15:46:42 +08:00
nimlgen
15e4302784 am: optimize zeroing out boot structs (#10081) 2025-04-28 10:15:32 +03:00
nimlgen
68e5ab8552 am: fix typo in fw loading (#10080) 2025-04-28 09:45:00 +03:00
chenyu
e996584685 olmoe in mac benchmark (#10077) 2025-04-27 21:07:02 -04:00
George Hotz
732e172961 don't require contiguous after fuse (#10074) 2025-04-27 13:17:22 -04:00
qazal
1aed04ec12 cpu is ground truth in VALIDATE_WITH_CPU=1 [pr] (#10067) 2025-04-28 01:14:21 +08:00
George Hotz
129bddde74 lin failure from SINGLE_KERNEL_SOFTMAX (#10073)
* lin failure from SINGLE_KERNEL_SOFTMAX

* fix lin issue

* more pure diff
2025-04-27 13:02:10 -04:00
George Hotz
b341296304 hotfix: save sdxl ram 2025-04-27 12:09:45 -04:00
George Hotz
68c5f7ba80 load fast in sdxl (#10072)
* load fast in sdxl

* back to that with the ret

* no context
2025-04-27 11:58:51 -04:00
George Hotz
768eb94c3e disable debug for load_state_dict [pr] (#10070) 2025-04-27 11:11:56 -04:00
George Hotz
4b8ef6ce78 hotfix: sdxl corealize 2025-04-27 10:41:46 -04:00
George Hotz
b6d2effaf5 assign is contiguous (#10066)
* assign is contiguous

* disable process replay for SDXL
2025-04-27 08:40:33 -04:00
George Hotz
1253819151 make beautiful indexing use a Variable (#10063)
* make beautiful indexing use a Variable

* stunning test

* better color

* training is broken

* fix tests

* fix variable indexing

* fix test

* no contiguous

* revert that

* revert that too

* indexing two bind

* skip for webgpu

* make not slow
2025-04-27 08:22:38 -04:00
Rory Clear
a13a43c4fe yolo 416 to 640 res (#10047) 2025-04-26 20:45:58 -04:00
chenyu
4c1ce1a299 don't simplify if div folding resulted in negative numerator (#10064)
* don't simplify if div folding resulted in negative numerator

* test
2025-04-26 17:01:18 -04:00
George Hotz
1805403821 fix rand arange folding (#10060)
* test rand range

* --amend

* fix rand arange folding

* reduce_rangeless fix
2025-04-26 12:24:05 -04:00