fix(direct_benchmark): don't mark timed-out challenges as passed

Previously, the evaluator would run on all results including timed-out challenges. If the agent happened to write a working solution before timing out, evaluation would pass and override success=True, resulting in contradictory output showing both PASS and "timed out". Now we skip evaluation for timed-out challenges - they cannot pass. Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
2026-04-30 03:00:41 -04:00 · 2026-01-20 00:25:41 -06:00
parent f07dff1cdd
commit 0e65785228
1 changed files with 7 additions and 0 deletions
--- a/classic/direct_benchmark/direct_benchmark/evaluator.py
+++ b/classic/direct_benchmark/direct_benchmark/evaluator.py
@@ -17,6 +17,13 @@ class Evaluator:
        self, result: ChallengeResult, challenge: Challenge
    ) -> ChallengeResult:
        """Evaluate a challenge result and update success/score."""
+        # If the challenge timed out or had an error, don't override with evaluation
+        # A timed-out challenge cannot be considered a pass
+        if result.timed_out:
+            result.success = False
+            result.score = 0.0
+            return result
+
        ground = challenge.ground_truth

        if not ground: