retinanet eval combine output on GPUS[0] (#9966)

eval 35 sec -> 20 sec. it was spending 13 seconds assembling output tensor on CPU backend. GPUS[0] seems to have enough memory, otherwise we can lower EVAL_BS
This commit is contained in:
chenyu
2025-04-22 07:43:51 -04:00
committed by GitHub
parent 7b55846e08
commit fb89d9a584

View File

@@ -410,7 +410,8 @@ def train_retinanet():
@TinyJit
def _eval_step(model, x, **kwargs):
out = model(normalize(x, GPUS), **kwargs)
return out.realize()
# reassemble on GPUS[0] before sending back to CPU for speed
return out.to(GPUS[0]).realize()
# ** hyperparameters **
config["seed"] = SEED = getenv("SEED", random.SystemRandom().randint(0, 2**32 - 1))