mirror of
https://github.com/pseXperiments/icicle.git
synced 2026-01-08 20:48:06 -05:00
## Describe the changes Add a non-blocking `warmup` function to `CudaStream` > when you run the benchmark (e.g. the msm example you have) the first instance is always slow, with a constant overhead of 200~300ms cuda stream warmup. and I want to get rid of that in my application by warming it up in parallel while my host do something else.