* extra/gemm/max_matmul: start of custom kernels for GEMM
* add an unoptimized FP16/FP16 MMA example
* add slow 3-stage fp16 acc example
* add correct 3-stage pipeline with unswizzled/flat smem input (slow)
* add acc fp16 example with 3 stages and swizzle (no bank conflicts)
* add max version of NV fp16_fp16_fp16
* fix up comments and removed unused code in max variations
* add start of no_xor example
* fix to account for UOps to Ops