* gemm
* off by factor of 5
* 50 GFLOPS
* works
* 91 gflops
* working at 50G
* works
* iy
* 150 GFLOPS
* 150 GFLOPS
* N=2048 is still fast
* threading soon
* multithread
* pinning
* throttling is sad
* Align matrices to cacheline width (#361)
Co-authored-by: cloud <Cloud11665@gmail.com>