* cleanup llama apply_rotary_emb and other helpers used ellipsis and other higher level tensor function. disabled the half @ half -> half tensor core as it fails uop dtype checks * keep hip 8x8->8 wmma