mirror of https://github.com/Significant-Gravitas/AutoGPT.git synced 2026-02-12 07:45:14 -05:00

Files

Nicholas Tindle f07dff1cdd fix(direct_benchmark): add pytest dependency for challenge evaluation

The TicTacToe and other challenges use pytest-based test files for
evaluation. Without pytest installed in the benchmark virtualenv,
these evaluations were silently failing.

Root cause: test.py imports pytest but the package wasn't a dependency,
causing ModuleNotFoundError during evaluation subprocess.

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>

2026-01-20 00:21:12 -06:00

challenges

feat(direct_benchmark): add step-level logging with colored prefixes

2026-01-19 23:02:20 -06:00

direct_benchmark

feat(direct_benchmark): add run ID to completion blocks

2026-01-20 00:14:23 -06:00

.gitignore

refactor(classic): migrate from agbenchmark to direct_benchmark harness

2026-01-19 22:29:51 -06:00

analyze_failures.py

refactor(classic): migrate from agbenchmark to direct_benchmark harness

2026-01-19 22:29:51 -06:00

analyze_reports.py

refactor(classic): migrate from agbenchmark to direct_benchmark harness

2026-01-19 22:29:51 -06:00

challenges_already_beaten.json

refactor(classic): migrate from agbenchmark to direct_benchmark harness

2026-01-19 22:29:51 -06:00

CLAUDE.md

feat(direct_benchmark): add incremental resume and selective reset

2026-01-19 23:32:27 -06:00

poetry.lock

fix(direct_benchmark): add pytest dependency for challenge evaluation

2026-01-20 00:21:12 -06:00

pyproject.toml

fix(direct_benchmark): add pytest dependency for challenge evaluation

2026-01-20 00:21:12 -06:00

README.md

refactor(classic): migrate from agbenchmark to direct_benchmark harness

2026-01-19 22:29:51 -06:00

README.md

Direct Benchmark Harness

High-performance benchmark harness for AutoGPT that directly instantiates agents without HTTP server overhead, enabling parallel execution of multiple configurations.

Features

Direct Agent Instantiation: No HTTP server, no Agent Protocol overhead
Parallel Execution: Run multiple strategy/model combinations concurrently
Multiple Attempts: Run each challenge multiple times for statistical reliability
Rich UI: Live progress display with Rich library
Multiple Output Modes: Default (rich), quiet, verbose, JSON for CI
Full CLI Compatibility: All flags from the original agbenchmark supported

Installation

cd classic/direct_benchmark
poetry install

Usage

# Run benchmarks with default settings
poetry run python -m direct_benchmark run

# Run specific strategies and models
poetry run python -m direct_benchmark run \
    --strategies one_shot,rewoo \
    --models claude,openai \
    --parallel 4

# Run a single test
poetry run python -m direct_benchmark run \
    --strategies one_shot \
    --tests ReadFile

# Run multiple attempts per challenge
poetry run python -m direct_benchmark run \
    --strategies one_shot \
    --attempts 3

# Run only regression tests (previously beaten)
poetry run python -m direct_benchmark run --maintain

# Run only non-regression tests (not consistently beaten)
poetry run python -m direct_benchmark run --improve

# Run only never-beaten challenges
poetry run python -m direct_benchmark run --explore

# List available challenges
poetry run python -m direct_benchmark list-challenges

# List model presets
poetry run python -m direct_benchmark list-models

# List strategies
poetry run python -m direct_benchmark list-strategies

CLI Options

Challenge Selection

--strategies, -s: Comma-separated strategies (one_shot, rewoo, plan_execute, reflexion, tree_of_thoughts)
--models, -m: Comma-separated model presets (claude, openai, etc.)
--categories, -c: Filter by challenge categories
--skip-category, -S: Exclude categories
--tests, -t: Filter by test names

Execution Control

--attempts, -N: Number of times to run each challenge
--parallel, -p: Maximum parallel runs (default: 4)
--timeout: Per-challenge timeout in seconds (default: 300)
--cutoff: Alias for --timeout
--no-cutoff, --nc: Disable time limit
--max-steps: Maximum steps per challenge (default: 50)

Challenge Filtering Modes

--maintain: Run only regression tests (previously beaten consistently)
--improve: Run only non-regression tests (not consistently beaten)
--explore: Run only challenges that have never been beaten
--no-dep: Run all challenges regardless of dependency success/failure

Output & Debug

--quiet, -q: Minimal output
--verbose, -v: Detailed per-challenge output
--json: JSON output for CI/scripting
--debug: Enable debug output
--keep-answers: Keep answer files for debugging

Paths

--workspace: Workspace root directory
--challenges-dir: Path to challenges directory
--reports-dir: Path to reports directory

Available Strategies

Strategy	Description
`one_shot`	Single-pass reasoning (default, most reliable)
`rewoo`	Reasoning with observations
`plan_execute`	Plan then execute
`reflexion`	Self-reflection loop
`tree_of_thoughts`	Multiple reasoning paths

Available Model Presets

Claude

claude: sonnet-4 smart, haiku fast (default)
claude-smart: sonnet-4 for both
claude-fast: haiku for both
claude-opus: opus smart, sonnet fast
claude-opus-only: opus for both

Claude with Extended Thinking

claude-thinking-10k: 10k thinking tokens
claude-thinking-25k: 25k thinking tokens
claude-thinking-50k: 50k thinking tokens
claude-opus-thinking: opus with 25k thinking
claude-opus-thinking-50k: opus with 50k thinking

OpenAI

openai: gpt-4o smart, gpt-4o-mini fast
openai-smart: gpt-4o for both
openai-fast: gpt-4o-mini for both
gpt5: gpt-5 smart, gpt-4o fast
gpt5-only: gpt-5 for both

OpenAI Reasoning Models

o1, o1-mini: o1 variants
o1-low, o1-medium, o1-high: o1 with reasoning effort
o3-low, o3-medium, o3-high: o3 with reasoning effort

Reports

Reports are generated in ./reports/ with format:

reports/
├── {timestamp}_{strategy}_{model}/
│   └── report.json
└── strategy_comparison_{timestamp}.json

Key Differences from agbenchmark

agbenchmark	direct_benchmark
`subprocess.Popen` + HTTP server	Direct `create_agent()`
HTTP/REST via Agent Protocol	Direct `propose_action()`/`execute()`
Sequential (one config at a time)	Parallel via asyncio semaphore
Port-based isolation	Workspace-based isolation