fix(direct_benchmark): don't mark timed-out challenges as passed

Previously, the evaluator would run on all results including timed-out
challenges. If the agent happened to write a working solution before
timing out, evaluation would pass and override success=True, resulting
in contradictory output showing both PASS and "timed out".

Now we skip evaluation for timed-out challenges - they cannot pass.

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
This commit is contained in:
Nicholas Tindle
2026-01-20 00:25:41 -06:00
parent f07dff1cdd
commit 0e65785228

View File

@@ -17,6 +17,13 @@ class Evaluator:
self, result: ChallengeResult, challenge: Challenge
) -> ChallengeResult:
"""Evaluate a challenge result and update success/score."""
# If the challenge timed out or had an error, don't override with evaluation
# A timed-out challenge cannot be considered a pass
if result.timed_out:
result.success = False
result.score = 0.0
return result
ground = challenge.ground_truth
if not ground: