mirror of
https://github.com/Significant-Gravitas/AutoGPT.git
synced 2026-04-30 03:00:41 -04:00
Merge forge/, original_autogpt/, and direct_benchmark/ into a single Poetry project to eliminate cross-project path dependency issues. Changes: - Create classic/pyproject.toml with merged dependencies from all three projects - Remove individual pyproject.toml and poetry.lock files from subdirectories - Update all CLAUDE.md files to reflect commands run from classic/ root - Update all README.md files with new installation and usage instructions All packages are now included via the packages directive: - forge/forge (core agent framework) - original_autogpt/autogpt (AutoGPT agent) - direct_benchmark/direct_benchmark (benchmark harness) CLI entry points preserved: autogpt, serve, direct-benchmark Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
9.1 KiB
9.1 KiB
CLAUDE.md - Direct Benchmark Harness
This file provides guidance to Claude Code when working with the direct benchmark harness.
Overview
The Direct Benchmark Harness is a high-performance testing framework for AutoGPT that directly instantiates agents without HTTP server overhead. It enables parallel execution of multiple strategy/model configurations.
Quick Reference
All commands run from the classic/ directory (parent of this directory):
# Install (one-time setup)
cd classic
poetry install
# Run benchmarks
poetry run direct-benchmark run
# Run specific strategies and models
poetry run direct-benchmark run \
--strategies one_shot,rewoo \
--models claude,openai \
--parallel 4
# Run a single test
poetry run direct-benchmark run \
--strategies one_shot \
--tests ReadFile
# List available challenges
poetry run direct-benchmark list-challenges
# List model presets
poetry run direct-benchmark list-models
# List strategies
poetry run direct-benchmark list-strategies
CLI Options
Run Command
| Option | Short | Description |
|---|---|---|
--strategies |
-s |
Comma-separated strategies (one_shot, rewoo, plan_execute, reflexion, tree_of_thoughts) |
--models |
-m |
Comma-separated model presets (claude, openai, etc.) |
--categories |
-c |
Filter by challenge categories |
--skip-category |
-S |
Exclude categories |
--tests |
-t |
Filter by test names |
--attempts |
-N |
Number of times to run each challenge |
--parallel |
-p |
Maximum parallel runs (default: 4) |
--timeout |
Per-challenge timeout in seconds (default: 300) | |
--cutoff |
Alias for --timeout | |
--no-cutoff |
--nc |
Disable time limit |
--max-steps |
Maximum steps per challenge (default: 50) | |
--maintain |
Run only regression tests | |
--improve |
Run only non-regression tests | |
--explore |
Run only never-beaten challenges | |
--no-dep |
Ignore challenge dependencies | |
--workspace |
Workspace root directory | |
--challenges-dir |
Path to challenges directory | |
--reports-dir |
Path to reports directory | |
--keep-answers |
Keep answer files for debugging | |
--quiet |
-q |
Minimal output |
--verbose |
-v |
Detailed per-challenge output |
--json |
JSON output for CI/scripting | |
--ci |
CI mode: no live display, shows completion blocks (auto-enabled when CI env var is set or not a TTY) | |
--fresh |
Clear all saved state and start fresh (don't resume) | |
--retry-failures |
Re-run only the challenges that failed in previous run | |
--reset-strategy |
Reset saved results for specific strategy (can repeat) | |
--reset-model |
Reset saved results for specific model (can repeat) | |
--reset-challenge |
Reset saved results for specific challenge (can repeat) | |
--debug |
Enable debug output |
State Management Commands
# Show current state
poetry run direct-benchmark state show
# Clear all state
poetry run direct-benchmark state clear
# Reset specific strategy/model/challenge
poetry run direct-benchmark state reset --strategy reflexion
poetry run direct-benchmark state reset --model claude-thinking-25k
poetry run direct-benchmark state reset --challenge ThreeSum
Available Strategies
one_shot- Single-pass reasoning (default)rewoo- Reasoning with observationsplan_execute- Plan then executereflexion- Self-reflection looptree_of_thoughts- Multiple reasoning paths
Available Model Presets
Claude
claude- sonnet-4 smart, haiku fastclaude-smart- sonnet-4 for bothclaude-fast- haiku for bothclaude-opus- opus smart, sonnet fastclaude-opus-only- opus for both
Claude with Extended Thinking
claude-thinking-10k- 10k thinking tokensclaude-thinking-25k- 25k thinking tokensclaude-thinking-50k- 50k thinking tokensclaude-opus-thinking- opus with 25k thinkingclaude-opus-thinking-50k- opus with 50k thinking
OpenAI
openai- gpt-4o smart, gpt-4o-mini fastopenai-smart- gpt-4o for bothopenai-fast- gpt-4o-mini for bothgpt5- gpt-5 smart, gpt-4o fastgpt5-only- gpt-5 for both
OpenAI Reasoning Models
o1,o1-mini- o1 variantso1-low,o1-medium,o1-high- o1 with reasoning efforto3-low,o3-medium,o3-high- o3 with reasoning effortgpt5-low,gpt5-medium,gpt5-high- gpt-5 with reasoning effort
Directory Structure
direct_benchmark/
├── pyproject.toml # Poetry config
├── README.md # User documentation
├── CLAUDE.md # This file
├── .gitignore
└── direct_benchmark/
├── __init__.py
├── __main__.py # CLI entry point
├── models.py # Pydantic models, presets
├── harness.py # Main orchestrator
├── runner.py # AgentRunner (single agent lifecycle)
├── parallel.py # ParallelExecutor (concurrent runs)
├── challenge_loader.py # Load challenges from JSON
├── evaluator.py # Evaluate outputs vs ground truth
├── report.py # Report generation
└── ui.py # Rich UI components
Architecture
Execution Flow
CLI args → HarnessConfig
↓
BenchmarkHarness.run()
↓
ChallengeLoader.load_all() → list[Challenge]
↓
ParallelExecutor.execute_matrix(configs × challenges × attempts)
↓
[Parallel with semaphore limiting to N concurrent]
↓
AgentRunner.run_challenge():
1. Create temp workspace
2. Copy input artifacts to agent workspace
3. Create AppConfig with strategy/model
4. create_agent() - direct instantiation
5. Run agent loop until finish/timeout
6. Collect output files
↓
Evaluator.evaluate() - check against ground truth
↓
ReportGenerator - write reports
Key Components
AgentRunner (runner.py)
- Manages single agent lifecycle for one challenge
- Creates isolated temp workspace per run
- Copies input artifacts to
{workspace}/.autogpt/agents/{agent_id}/workspace/ - Instantiates agent directly via
create_agent() - Runs agent loop:
propose_action()→execute()until finish/timeout
ParallelExecutor (parallel.py)
- Manages concurrent execution with asyncio semaphore
- Supports multiple attempts per challenge
- Reports progress via callbacks
Evaluator (evaluator.py)
- String matching (should_contain/should_not_contain)
- Python script execution
- Pytest execution
ReportGenerator (report.py)
- Per-config
report.jsonfiles (compatible with agbenchmark format) - Comparison reports across all configs
Report Format
Reports are generated in ./reports/ with format:
reports/
├── {timestamp}_{strategy}_{model}/
│ └── report.json
└── strategy_comparison_{timestamp}.json
Dependencies
autogpt-forge- Core agent frameworkautogpt- Original AutoGPT agentclick- CLI frameworkpydantic- Data modelsrich- Terminal UI
Key Differences from agbenchmark
| agbenchmark | direct_benchmark |
|---|---|
subprocess.Popen + HTTP server |
Direct create_agent() |
| HTTP/REST via Agent Protocol | Direct propose_action()/execute() |
| Sequential (one config at a time) | Parallel via asyncio semaphore |
| Port-based isolation | Workspace-based isolation |
agbenchmark run CLI |
Direct JSON parsing |
Common Tasks
Run Full Benchmark Suite
poetry run direct-benchmark run \
--strategies one_shot,rewoo,plan_execute \
--models claude \
--parallel 8
Compare Strategies
poetry run direct-benchmark run \
--strategies one_shot,rewoo,plan_execute,reflexion \
--models claude \
--tests ReadFile,WriteFile,ThreeSum
Debug a Failing Test
poetry run direct-benchmark run \
--strategies one_shot \
--tests FailingTest \
--keep-answers \
--verbose
Resume / Incremental Runs
The benchmark automatically saves progress and resumes from where it left off.
State is saved to .benchmark_state.json in the reports directory.
# Run benchmarks - will resume from last run automatically
poetry run direct-benchmark run \
--strategies one_shot,reflexion \
--models claude
# Start fresh (clear all saved state)
poetry run direct-benchmark run --fresh \
--strategies one_shot,reflexion \
--models claude
# Reset specific strategy and re-run
poetry run direct-benchmark run \
--reset-strategy reflexion \
--strategies one_shot,reflexion \
--models claude
# Reset specific model and re-run
poetry run direct-benchmark run \
--reset-model claude-thinking-25k \
--strategies one_shot \
--models claude,claude-thinking-25k
# Retry only the failures from the last run
poetry run direct-benchmark run --retry-failures \
--strategies one_shot,reflexion \
--models claude
CI/Scripting Mode
# JSON output (parseable)
poetry run direct-benchmark run --json
# CI mode - shows completion blocks without Live display
# Auto-enabled when CI=true env var is set or stdout is not a TTY
poetry run direct-benchmark run --ci