fixed ring allreduce pattern and recovered most of the bert step time regression (10% faster), will double check all benchmark