Files
AutoGPT/classic/direct_benchmark
Nicholas Tindle f07dff1cdd fix(direct_benchmark): add pytest dependency for challenge evaluation
The TicTacToe and other challenges use pytest-based test files for
evaluation. Without pytest installed in the benchmark virtualenv,
these evaluations were silently failing.

Root cause: test.py imports pytest but the package wasn't a dependency,
causing ModuleNotFoundError during evaluation subprocess.

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
2026-01-20 00:21:12 -06:00
..

Direct Benchmark Harness

High-performance benchmark harness for AutoGPT that directly instantiates agents without HTTP server overhead, enabling parallel execution of multiple configurations.

Features

  • Direct Agent Instantiation: No HTTP server, no Agent Protocol overhead
  • Parallel Execution: Run multiple strategy/model combinations concurrently
  • Multiple Attempts: Run each challenge multiple times for statistical reliability
  • Rich UI: Live progress display with Rich library
  • Multiple Output Modes: Default (rich), quiet, verbose, JSON for CI
  • Full CLI Compatibility: All flags from the original agbenchmark supported

Installation

cd classic/direct_benchmark
poetry install

Usage

# Run benchmarks with default settings
poetry run python -m direct_benchmark run

# Run specific strategies and models
poetry run python -m direct_benchmark run \
    --strategies one_shot,rewoo \
    --models claude,openai \
    --parallel 4

# Run a single test
poetry run python -m direct_benchmark run \
    --strategies one_shot \
    --tests ReadFile

# Run multiple attempts per challenge
poetry run python -m direct_benchmark run \
    --strategies one_shot \
    --attempts 3

# Run only regression tests (previously beaten)
poetry run python -m direct_benchmark run --maintain

# Run only non-regression tests (not consistently beaten)
poetry run python -m direct_benchmark run --improve

# Run only never-beaten challenges
poetry run python -m direct_benchmark run --explore

# List available challenges
poetry run python -m direct_benchmark list-challenges

# List model presets
poetry run python -m direct_benchmark list-models

# List strategies
poetry run python -m direct_benchmark list-strategies

CLI Options

Challenge Selection

  • --strategies, -s: Comma-separated strategies (one_shot, rewoo, plan_execute, reflexion, tree_of_thoughts)
  • --models, -m: Comma-separated model presets (claude, openai, etc.)
  • --categories, -c: Filter by challenge categories
  • --skip-category, -S: Exclude categories
  • --tests, -t: Filter by test names

Execution Control

  • --attempts, -N: Number of times to run each challenge
  • --parallel, -p: Maximum parallel runs (default: 4)
  • --timeout: Per-challenge timeout in seconds (default: 300)
  • --cutoff: Alias for --timeout
  • --no-cutoff, --nc: Disable time limit
  • --max-steps: Maximum steps per challenge (default: 50)

Challenge Filtering Modes

  • --maintain: Run only regression tests (previously beaten consistently)
  • --improve: Run only non-regression tests (not consistently beaten)
  • --explore: Run only challenges that have never been beaten
  • --no-dep: Run all challenges regardless of dependency success/failure

Output & Debug

  • --quiet, -q: Minimal output
  • --verbose, -v: Detailed per-challenge output
  • --json: JSON output for CI/scripting
  • --debug: Enable debug output
  • --keep-answers: Keep answer files for debugging

Paths

  • --workspace: Workspace root directory
  • --challenges-dir: Path to challenges directory
  • --reports-dir: Path to reports directory

Available Strategies

Strategy Description
one_shot Single-pass reasoning (default, most reliable)
rewoo Reasoning with observations
plan_execute Plan then execute
reflexion Self-reflection loop
tree_of_thoughts Multiple reasoning paths

Available Model Presets

Claude

  • claude: sonnet-4 smart, haiku fast (default)
  • claude-smart: sonnet-4 for both
  • claude-fast: haiku for both
  • claude-opus: opus smart, sonnet fast
  • claude-opus-only: opus for both

Claude with Extended Thinking

  • claude-thinking-10k: 10k thinking tokens
  • claude-thinking-25k: 25k thinking tokens
  • claude-thinking-50k: 50k thinking tokens
  • claude-opus-thinking: opus with 25k thinking
  • claude-opus-thinking-50k: opus with 50k thinking

OpenAI

  • openai: gpt-4o smart, gpt-4o-mini fast
  • openai-smart: gpt-4o for both
  • openai-fast: gpt-4o-mini for both
  • gpt5: gpt-5 smart, gpt-4o fast
  • gpt5-only: gpt-5 for both

OpenAI Reasoning Models

  • o1, o1-mini: o1 variants
  • o1-low, o1-medium, o1-high: o1 with reasoning effort
  • o3-low, o3-medium, o3-high: o3 with reasoning effort

Reports

Reports are generated in ./reports/ with format:

reports/
├── {timestamp}_{strategy}_{model}/
│   └── report.json
└── strategy_comparison_{timestamp}.json

Key Differences from agbenchmark

agbenchmark direct_benchmark
subprocess.Popen + HTTP server Direct create_agent()
HTTP/REST via Agent Protocol Direct propose_action()/execute()
Sequential (one config at a time) Parallel via asyncio semaphore
Port-based isolation Workspace-based isolation