* wmma: refactor to remove wmma_func and create TC funcs as needed
* test_linearizer: disable bf16 CUDA during emulation testing
* cstyle: clean up creation of CUDA vec dtypes
* extra/gemm: add option to accumulate to bfloat16
* cleanups
* benchmark: add CUDA bfloat16 matmul
* more cleanups
* extra/gemm: add a simple_conv.py along with correctness check
The goal is to easily test tensor core triggering situations
* test: add tests for acc_dtype handling and fixed typing