mirror of
https://github.com/ROCm/ROCm.git
synced 2026-02-21 03:00:39 -05:00
* Add waves_per_eu in the tuning space * Do not allocate tensor on device during kernel compilation step * Add breakdown elapsed time * Parallelize the post-processing step * Parallelize the profile step with --ngpus * Better timing info printout