For SZ = 32 31 vector registers: v0-v31 * 1024 element (19-bit TF32) wide * 32x32 "matrix" # Support masked execution of everything with v0 # Helper instructions to load sz into mask? Movement: mov v1, v2 mov v1, x0 # load 0 Vector Engine (1024 ALUs): vfadd.vv vd, vs2, vs1, vm add sub mul div pow mac add v3, v1, v2 # vector/vector add add v3, v1, t0 # vector/scalar add Matrix Engine: matmul v3, v1, v2 Load/Store Engine (R-type instructions): load v1, t0(
), t1() store t0(
), v1, t1() Permute Engine: # interleave - shuffle - unpack lo/hi - table interleave transpose v1, v0