qazal
616e9c1483
CDNA assembly gemm in tensor.py with flag ( #14310 )
...
* work
* work
* the assembly
* remove the old one
* remove ws bufs, assert splitk
* notes cleanup
* work
* gemm args
* gemm in mixins would be nice
* add gemm gradient
* print counters
* the realize is for DEBUG=2 aesthetics
* dedup
* rewrite to python dsl, no list copies
* leave that
* add B, M, N, K to gemm name
* it's M0 not NULL
* fp16 support
* test cleanup + more gemms
* work from viz
* more work
* gemm batch_size
* xccg path work
* tiny comments on the label naming
* s_waitcnt
2026-01-31 22:34:14 +09:00
chenyu
55f806b713
tighter late_buffer_view match [pr] ( #14456 )
...
src must be len 2 at that point
2026-01-31 07:28:26 -05:00
qazal
d69bc5aa1a
make DEV=NULL EMULATE=AMD amd_asm_matmul run ( #14460 )
2026-01-31 20:45:24 +09:00
qazal
4976544bf9
multi ram usage tests on the NULL device ( #14457 )
2026-01-31 14:14:53 +09:00
chenyu
99b44121bc
failed test case for non-consecutive disk read ( #14455 )
...
silently fail now
2026-01-30 23:44:04 -05:00
George Hotz
b705c9143c
assembly/amd: test more instructions ( #14365 )
...
* assembly/amd: test more instructions
* more
* passing
* revert
* no const fold
* remove junk
* cleaner
2026-01-31 12:40:22 +08:00
George Hotz
c9a3ddb341
benchmark llama walltime script ( #14454 )
...
* benchmark llama walltime script
* adj layers
2026-01-31 10:21:54 +08:00
George Hotz
f5346d6a1a
fix USE_ATOMICS for non float dtypes and make it the default ( #14444 )
...
* embedded multistep test
* complex test
* with jit
* fix dtypes and reenable USE_ATOMICS
* that test didn't catch anything
2026-01-31 09:44:16 +08:00
Christopher Milan
e575dd8275
prevent UB in long decomp and more emulated tests ( #14447 )
2026-01-30 19:38:41 -05:00
chenyu
3204f94454
correct var_vals schedule filter ( #14451 )
...
complete_create_schedule_with_vars returns var_vals that's used in schedule
2026-01-30 17:10:07 -05:00
chenyu
cfcd1debb5
test schedule with multiple AFTER ( #14449 )
2026-01-30 15:59:00 -05:00
nimlgen
486d53d646
device: call free for external_ptr ( #14448 )
...
* device: call free for external_ptr
* lin
2026-01-30 23:53:17 +03:00
nimlgen
e0978498dc
amd: read_ptr/write_ptr/doorbells are not lists ( #14445 )
2026-01-30 23:11:57 +03:00
Christopher Milan
1803ee939d
EMULATED_DTYPES=long works with CPU_LLVM ( #14446 )
2026-01-30 13:54:43 -05:00
chenyu
03613e83ad
update TestTensorMetadata ( #14443 )
...
run with SCACHE=0 some more TODOs
2026-01-30 12:39:01 -05:00
George Hotz
cbb1eed57b
hotfix: partial revert of 9eb449f88, caused llama NaN
2026-01-30 17:19:27 +00:00
chenyu
26f5c00265
move TestTensorMetadata to unit ( #14442 )
2026-01-30 12:14:21 -05:00
chenyu
c05a0b85ae
flip unique const src order [pr] ( #14441 )
...
* flip unique const src order [pr]
matches buffer, simplifies replace_input_buffer
* combine rules
2026-01-30 11:44:18 -05:00
George Hotz
ee2c78709d
mlperf/llama: disable USE_ATOMICS for now
2026-01-31 00:42:08 +08:00
chenyu
beecac4d85
expand ranges -> unroll outer ranges [pr] ( #14440 )
2026-01-30 11:26:05 -05:00
chenyu
9eb449f882
clean up toposort sched_sink [pr] ( #14439 )
2026-01-30 10:18:28 -05:00
George Hotz
838cd078bc
use atomics for embedding backward ( #14400 )
...
* embedding is slow
* failing
* float is fine
* null
* it fails
* simplify embedding with broadcasting
* ATOMIC_ADD incoming
* min change
* simpler test
* better test
* fix test
* real test
* simpler
* cleanups
* types and names
* _zero_kernel
* grad multi
* hack
* none
* multi unshard
* more for call
* don't tag in call
* good
* call_multi
* call_multi wow claude is useless
* embedding backward mutli test
* test passes
* fix as_param
* shape_to_shape_arg
* add clip
* before cast
* fix spec=2, use atomics
2026-01-30 18:10:59 +08:00
nimlgen
1998e0bb28
nv: add prof props to dev ( #14437 )
2026-01-30 12:51:43 +03:00
George Hotz
7a9dee4e50
add call/param UOps ( #14433 )
...
* add call/param UOps
* resolve call
* skip that for now
* grad on call
* fix tests
2026-01-30 14:51:45 +08:00
qazal
66d6a68016
viz: sqtt work from cdna gemm ( #14434 )
...
* it's the tag
* initialize rows based on the disasm
* test_cfg with Ops.BINARY
* pyremu wants s_code_end?
* test_diamond
* diff cleanup
2026-01-30 14:00:56 +09:00
Christopher Milan
88caf57ef4
ci: unify python versions ( #14430 )
2026-01-29 21:42:03 -05:00
chenyu
86a204d22a
allow Tensor setitem input to be list/tuple ( #14432 )
...
matches assign, and generally matches numpy
2026-01-29 21:26:58 -05:00
chenyu
4a80319093
clean up split_store final logic [pr] ( #14429 )
...
explicitly check the structure
2026-01-29 18:40:07 -05:00
Christopher Milan
e47f12f671
ci: replace testing_minimal with testing_unit ( #14427 )
2026-01-29 18:02:43 -05:00
wozeparrot
c2fb8b208f
fa: 32 block size ( #14416 )
2026-01-29 13:59:13 -08:00
chenyu
a979fafae5
cleanup around disk buffer [pr] ( #14428 )
...
style change, prep for refactor
2026-01-29 16:18:44 -05:00
nimlgen
dc977a03b0
nv_pma: bw decoder ( #14424 )
...
* nv_pma: bw decoder
* decoder fix
* better
2026-01-30 00:12:39 +03:00
chenyu
ddc041854b
failed test case for disk setitem ( #14426 )
...
strided setitem is wrong
2026-01-29 14:54:19 -05:00
chenyu
31706bf6bc
add few more types [pr] ( #14425 )
2026-01-29 14:04:09 -05:00
nimlgen
2d5c24879f
nv: pma for 5090 ( #14420 )
...
* nv: pma for 5090
* hm
* 4090
2026-01-29 20:06:01 +03:00
nimlgen
c8dc6332d2
memory: read_fields is not universal ( #14348 )
2026-01-29 20:00:00 +03:00
chenyu
dbe8f034a7
pass z3.Context in validate ctx [pr] ( #14423 )
...
does not need to pass the whole solver
2026-01-29 11:11:47 -05:00
chenyu
033ce1b885
types for validate.py ( #14422 )
2026-01-29 10:56:50 -05:00
nimlgen
230d08ec70
test for am recovery and faults handling ( #14421 )
...
* test for am recovery and faults handling
* linter
2026-01-29 17:11:24 +03:00
George Hotz
793afbd473
simplify nn.Embedding, support AFTER in CUSTOM_KERNEL ( #14419 )
2026-01-29 17:22:13 +08:00
Christopher Milan
0c855d6149
ci: remove unused pydeps ( #14418 )
2026-01-29 01:51:26 -05:00
wozeparrot
4845e42135
llama3 gradacc fixes ( #14414 )
2026-01-28 19:12:39 -08:00
chenyu
37cde4a01a
add one line mypy report ( #14415 )
2026-01-28 20:39:32 -05:00
chenyu
15aed51544
return types for all math.py function ( #14413 )
...
calling int() on sint -> int, i think it's better support since some UOp can be safely cast to int
2026-01-28 20:10:11 -05:00
nimlgen
aec1ae0de1
llama: set manual_seed ( #14409 )
2026-01-28 14:40:00 -08:00
chenyu
0870ed28b1
add Self type to MathMixin ( #14411 )
...
these don't cause error
2026-01-28 16:59:38 -05:00
chenyu
079f33c208
fix type in Tensor.mean and Tensor.var ( #14410 )
...
use Tensor.from_uop to wrap UOp from symbolic shape, kernels are the same
2026-01-28 15:24:02 -05:00
chenyu
2b5e99ccc1
minor type cleanups [pr] ( #14408 )
...
mypy --warn-redundant-casts has false negative
2026-01-28 14:11:50 -05:00
chenyu
726415dbc8
import sint directly in movement.py TYPE_CHECKING ( #14406 )
...
avoid creating string TypeAlias, fixed warning in `TYPED=1 python test/test_tiny.py`
2026-01-28 12:47:26 -05:00
nimlgen
acb2fc36ba
nv_pma: add decoder ( #14404 )
...
* nv_pma: add decoder
* cl
2026-01-28 20:44:02 +03:00