mirror of
https://github.com/Significant-Gravitas/AutoGPT.git
synced 2026-04-30 03:00:41 -04:00
fix(direct_benchmark): don't mark timed-out challenges as passed
Previously, the evaluator would run on all results including timed-out challenges. If the agent happened to write a working solution before timing out, evaluation would pass and override success=True, resulting in contradictory output showing both PASS and "timed out". Now we skip evaluation for timed-out challenges - they cannot pass. Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
This commit is contained in:
@@ -17,6 +17,13 @@ class Evaluator:
|
||||
self, result: ChallengeResult, challenge: Challenge
|
||||
) -> ChallengeResult:
|
||||
"""Evaluate a challenge result and update success/score."""
|
||||
# If the challenge timed out or had an error, don't override with evaluation
|
||||
# A timed-out challenge cannot be considered a pass
|
||||
if result.timed_out:
|
||||
result.success = False
|
||||
result.score = 0.0
|
||||
return result
|
||||
|
||||
ground = challenge.ground_truth
|
||||
|
||||
if not ground:
|
||||
|
||||
Reference in New Issue
Block a user