mirror of
https://github.com/Significant-Gravitas/AutoGPT.git
synced 2026-02-11 23:35:25 -05:00
Direct Benchmark Harness
High-performance benchmark harness for AutoGPT that directly instantiates agents without HTTP server overhead, enabling parallel execution of multiple configurations.
Features
- Direct Agent Instantiation: No HTTP server, no Agent Protocol overhead
- Parallel Execution: Run multiple strategy/model combinations concurrently
- Multiple Attempts: Run each challenge multiple times for statistical reliability
- Rich UI: Live progress display with Rich library
- Multiple Output Modes: Default (rich), quiet, verbose, JSON for CI
- Full CLI Compatibility: All flags from the original agbenchmark supported
Installation
All commands run from the classic/ directory (parent of this directory):
cd classic
poetry install
Usage
# Run benchmarks with default settings
poetry run direct-benchmark run
# Run specific strategies and models
poetry run direct-benchmark run \
--strategies one_shot,rewoo \
--models claude,openai \
--parallel 4
# Run a single test
poetry run direct-benchmark run \
--strategies one_shot \
--tests ReadFile
# Run multiple attempts per challenge
poetry run direct-benchmark run \
--strategies one_shot \
--attempts 3
# Run only regression tests (previously beaten)
poetry run direct-benchmark run --maintain
# Run only non-regression tests (not consistently beaten)
poetry run direct-benchmark run --improve
# Run only never-beaten challenges
poetry run direct-benchmark run --explore
# List available challenges
poetry run direct-benchmark list-challenges
# List model presets
poetry run direct-benchmark list-models
# List strategies
poetry run direct-benchmark list-strategies
CLI Options
Challenge Selection
--strategies, -s: Comma-separated strategies (one_shot, rewoo, plan_execute, reflexion, tree_of_thoughts)--models, -m: Comma-separated model presets (claude, openai, etc.)--categories, -c: Filter by challenge categories--skip-category, -S: Exclude categories--tests, -t: Filter by test names
Execution Control
--attempts, -N: Number of times to run each challenge--parallel, -p: Maximum parallel runs (default: 4)--timeout: Per-challenge timeout in seconds (default: 300)--cutoff: Alias for --timeout--no-cutoff, --nc: Disable time limit--max-steps: Maximum steps per challenge (default: 50)
Challenge Filtering Modes
--maintain: Run only regression tests (previously beaten consistently)--improve: Run only non-regression tests (not consistently beaten)--explore: Run only challenges that have never been beaten--no-dep: Run all challenges regardless of dependency success/failure
Output & Debug
--quiet, -q: Minimal output--verbose, -v: Detailed per-challenge output--json: JSON output for CI/scripting--debug: Enable debug output--keep-answers: Keep answer files for debugging
Paths
--workspace: Workspace root directory--challenges-dir: Path to challenges directory--reports-dir: Path to reports directory
Available Strategies
| Strategy | Description |
|---|---|
one_shot |
Single-pass reasoning (default, most reliable) |
rewoo |
Reasoning with observations |
plan_execute |
Plan then execute |
reflexion |
Self-reflection loop |
tree_of_thoughts |
Multiple reasoning paths |
Available Model Presets
Claude
claude: sonnet-4 smart, haiku fast (default)claude-smart: sonnet-4 for bothclaude-fast: haiku for bothclaude-opus: opus smart, sonnet fastclaude-opus-only: opus for both
Claude with Extended Thinking
claude-thinking-10k: 10k thinking tokensclaude-thinking-25k: 25k thinking tokensclaude-thinking-50k: 50k thinking tokensclaude-opus-thinking: opus with 25k thinkingclaude-opus-thinking-50k: opus with 50k thinking
OpenAI
openai: gpt-4o smart, gpt-4o-mini fastopenai-smart: gpt-4o for bothopenai-fast: gpt-4o-mini for bothgpt5: gpt-5 smart, gpt-4o fastgpt5-only: gpt-5 for both
OpenAI Reasoning Models
o1,o1-mini: o1 variantso1-low,o1-medium,o1-high: o1 with reasoning efforto3-low,o3-medium,o3-high: o3 with reasoning effort
Reports
Reports are generated in ./reports/ with format:
reports/
├── {timestamp}_{strategy}_{model}/
│ └── report.json
└── strategy_comparison_{timestamp}.json
Key Differences from agbenchmark
| agbenchmark | direct_benchmark |
|---|---|
subprocess.Popen + HTTP server |
Direct create_agent() |
| HTTP/REST via Agent Protocol | Direct propose_action()/execute() |
| Sequential (one config at a time) | Parallel via asyncio semaphore |
| Port-based isolation | Workspace-based isolation |