* add llama attention test for multigpu
* test fails
* kv cache trying to shrink on sharded axis
* mask None works for scale dot product
* kv cache seems to be working but scale dot product breaks
* scaled dot product works, but the last linear layer failed
* running into the reshape case where it could be wrong for multigpu
* making sure it was the reshape
* adding contiguous doesn't solve
* need to shard more properly
* remove reshape test
* minor adjustment to scale dot product attention test
* weights are sharded wrong
* continue fix new weight sharding
* clean up
* fix attention when start_pos is 0
* remove print
* add TODOs for the best mutigpu interface