* check SPEC=2 in CI * split SPEC=2 * fast enough
* add some docs about speed [pr] * better torch gemm * enable locals on llvm/clang * disable locals for beam speed on LLVM/CLANG * 0x20 alignment in llvm allows ymm use
* fast asm * torch gemm