This adds a new pass that converts `scf.forall` loops into nested
`scf.for` operations. The conversion carries parallel output tensors
from the original loop as dependencies through the loop nest and
replaces any occurrence of `tensor.parallel_insert_slice` operations
in the `scf.forall.in_parallel` terminator with equivalent
`tensor.insert_slice` operations.