chenyu
b8d07dcc54
remove TransformerBlock contiguous in llama ( #10104 )
2025-04-29 14:15:39 -04:00
Ignacio Sica
9d5677c12c
fix ptx linearizer bug 2 [pr] ( #9967 )
...
* check for local buffer
* hotfix
* add test_tensor_cores_emulation run for ptx
2025-04-29 14:30:07 -03:00
qazal
a59d18da21
hack for VIZ=1 with examples/llama ( #10103 )
...
* hack for VIZ=1 with examples/llama
* move it alongside BEAM=0
2025-04-29 23:42:17 +08:00
qazal
93bf8764f2
do not open devices in lowering ( #10101 )
...
* do not open devices in lowering [pr]
* ctx=opts
* ctx
* fuzz test
2025-04-29 23:18:16 +08:00
George Hotz
c3ff308abb
range has only one src now [pr] ( #10100 )
...
* range has only one op now
* fix z3 checker
* ci fix
* needs shell
* try pip ensure update
* that ensurepip is useless
* upgrade pip before cache
* windows happy?
2025-04-29 10:31:05 -04:00
George Hotz
427471550a
hotfix: amd tflops to 74 and some external_benchmark_sdxl_softmax stuff
2025-04-29 09:02:27 -04:00
Ignacio Sica
58cf8cd493
add support for "shared_mem" for LLVM ( #10093 )
...
* init llvm shared
* add test_tensor_cores_emulation run for llvm
2025-04-29 08:56:36 -04:00
qazal
ad7546c931
assert in test_indexing_two_bind instead of silent fail ( #10099 )
...
* assert in test_indexing_two_bind instead of silent fail
* debuggable
* skip test_simple_train
2025-04-29 20:23:25 +08:00
George Hotz
cee220a1ab
always expand ssa on wheres ( #9697 )
...
Co-authored-by: qazal <77887910+Qazalin@users.noreply.github.com >
2025-04-29 20:08:41 +08:00
qazal
3b67f56c02
kernelize some llama realizes ( #10098 )
2025-04-29 18:39:56 +08:00
qazal
cbf7347cd6
display viz rewrites with tabbing if they are subrewrites ( #10097 )
...
* display viz rewrites with tabbing if they are subrewrites
* update viz api
2025-04-29 17:57:21 +08:00
George Hotz
73c2f6602f
test sdxl softmax ( #10096 )
2025-04-28 21:55:50 -04:00
George Hotz
eaceafecae
do fusion locally ( #10095 )
...
* do fusion locally
* oops, that's the right way
* explicit delete closure
2025-04-28 20:45:37 -04:00
chenyu
3eba3d6ee9
don't pass model in convert_from_huggingface and convert_from_gguf ( #10094 )
...
it only needs n_layers
2025-04-28 20:11:19 -04:00
George Hotz
a2d0684fc1
test_attention_simple_view ( #10092 )
...
* test_attention_simple_view
* correct comment
2025-04-28 20:01:22 -04:00
Ignacio Sica
bda116d773
fix use_tensor_cores propagation ( #10048 )
...
* propagate use_tensor_cores
* add use_tensor_core to arg in test and search
* bugfix
* get TC val from ContextVar in search
* revert minor space change
* add tc emulation test to ci and benchmark
* revert
* revert whitespace change
* remove test for ptx
* add comment and remove llvm test run
2025-04-28 19:30:50 -03:00
George Hotz
d32f5e9f3a
improve rendering of shapes in viz + investigate symbolic [pr] ( #10091 )
2025-04-28 16:44:09 -04:00
Sieds Lykles
dbb7aee02e
Split constant in div with negative x ( #10088 )
...
* add rule
* change test
* lower complexity limit
* remove offset in fold_unrolled_divs
* remove import
* add one more condition
2025-04-28 16:24:14 -04:00
chenyu
610ee79b22
cherry pick mlperf5.0 branch to master ( #10089 )
2025-04-28 15:36:56 -04:00
chenyu
459a223202
simpler Literal annotation in code_for_workitem [pr] ( #10087 )
2025-04-28 14:59:25 -04:00
nimlgen
dcd9a633c3
am: load minimum fw ( #10083 )
...
* am: load minimum psp parts
* try thos
* remove me & pfp
2025-04-28 21:28:05 +03:00
George Hotz
ecff82a698
fixing single kernel softmax: resolve ( #10086 )
...
* fixing single kernel softmax: resolve
* add failing lin test
2025-04-28 13:46:20 -04:00
George Hotz
4c242b0483
hotfix: tests all pass on metal local
2025-04-28 12:09:00 -04:00
George Hotz
690dac79b5
don't modify the ranges on reduce rewrite ( #10062 )
...
* bug in div range folding
* simpler
* oh, this is right for indexing, but the div mod folding needs to be fixed
* reenable
* Passing test_complexity_w_unroll2 (#10068 )
* Passing
* remove non_folded_divs
* Add check for negative tern in div folding
* Add test
* bump that limit
* fix casted
---------
Co-authored-by: Sieds Lykles <93992551+S-Lykles@users.noreply.github.com >
2025-04-28 12:01:19 -04:00
quortus
5130759605
Make sure clang always inline batched functions ( #10037 )
2025-04-28 10:48:24 -04:00
George Hotz
c4a50f9d89
fix full shape in kernel.py [pr] ( #10085 )
...
* fix full shape in kernel.py
* fix that heuristic
* full shape in shapetracker is fast
* fix process replay [pr]
* simpler
* this
* i'm just going to ignore that one
2025-04-28 09:32:58 -04:00
qazal
ac37510f60
remu: only write v_cmp result if exec is set ( #10084 )
2025-04-28 20:31:52 +08:00
qazal
d6b436a815
remu bugfix with -0.0 negation ( #10082 )
2025-04-28 15:46:42 +08:00
nimlgen
15e4302784
am: optimize zeroing out boot structs ( #10081 )
2025-04-28 10:15:32 +03:00
nimlgen
68e5ab8552
am: fix typo in fw loading ( #10080 )
2025-04-28 09:45:00 +03:00
chenyu
e996584685
olmoe in mac benchmark ( #10077 )
2025-04-27 21:07:02 -04:00
George Hotz
732e172961
don't require contiguous after fuse ( #10074 )
2025-04-27 13:17:22 -04:00
qazal
1aed04ec12
cpu is ground truth in VALIDATE_WITH_CPU=1 [pr] ( #10067 )
2025-04-28 01:14:21 +08:00
George Hotz
129bddde74
lin failure from SINGLE_KERNEL_SOFTMAX ( #10073 )
...
* lin failure from SINGLE_KERNEL_SOFTMAX
* fix lin issue
* more pure diff
2025-04-27 13:02:10 -04:00
George Hotz
b341296304
hotfix: save sdxl ram
2025-04-27 12:09:45 -04:00
George Hotz
68c5f7ba80
load fast in sdxl ( #10072 )
...
* load fast in sdxl
* back to that with the ret
* no context
2025-04-27 11:58:51 -04:00
George Hotz
768eb94c3e
disable debug for load_state_dict [pr] ( #10070 )
2025-04-27 11:11:56 -04:00
George Hotz
4b8ef6ce78
hotfix: sdxl corealize
2025-04-27 10:41:46 -04:00
George Hotz
b6d2effaf5
assign is contiguous ( #10066 )
...
* assign is contiguous
* disable process replay for SDXL
2025-04-27 08:40:33 -04:00
George Hotz
1253819151
make beautiful indexing use a Variable ( #10063 )
...
* make beautiful indexing use a Variable
* stunning test
* better color
* training is broken
* fix tests
* fix variable indexing
* fix test
* no contiguous
* revert that
* revert that too
* indexing two bind
* skip for webgpu
* make not slow
2025-04-27 08:22:38 -04:00
Rory Clear
a13a43c4fe
yolo 416 to 640 res ( #10047 )
2025-04-26 20:45:58 -04:00
chenyu
4c1ce1a299
don't simplify if div folding resulted in negative numerator ( #10064 )
...
* don't simplify if div folding resulted in negative numerator
* test
2025-04-26 17:01:18 -04:00
George Hotz
1805403821
fix rand arange folding ( #10060 )
...
* test rand range
* --amend
* fix rand arange folding
* reduce_rangeless fix
2025-04-26 12:24:05 -04:00
qazal
d13c100981
don't sort dims in verify_sink_dims [pr] ( #10059 )
...
* don't sort dims in verify_sink_dims [pr]
* 1 can exist with n
* put process_replay warn last
* assert shape is the same
* bring that back
2025-04-26 23:24:30 +08:00
George Hotz
c80fe6d5fc
handle some fancier reduces ( #10057 )
...
* reduce_unparented
* handle fancier reduces
* fold more
* bugfix
2025-04-26 11:20:15 -04:00
nimlgen
e08270c1ba
nv: fix program init for no-args kernels ( #10058 )
2025-04-26 18:08:53 +03:00
George Hotz
11113c9d07
reduce_unparented ( #10056 )
2025-04-26 09:48:16 -04:00
George Hotz
ea5dddc537
reduce collapse generic ( #10045 )
...
* reduce collapse generic
* new arange folder
* new range folding
* correct with sym
* all tests pass
* indexing ops passes
* failing tests
* fix tests, remove unused
* revert that
* torch indexing is fast
* skip on webgpu
* touchups
* comments
2025-04-26 09:13:24 -04:00
quortus
5cdc96409e
Update outdated renderer.render calls ( #10044 )
2025-04-26 07:35:19 -04:00
nimlgen
e055b9422f
am: fix mmap failures ( #10054 )
2025-04-26 14:21:28 +03:00