mirror of https://github.com/Significant-Gravitas/AutoGPT.git synced 2026-04-30 03:00:41 -04:00

Files

Nicholas Tindle 572c3f5e0d refactor(classic): consolidate Poetry projects into single pyproject.toml

Merge forge/, original_autogpt/, and direct_benchmark/ into a single Poetry
project to eliminate cross-project path dependency issues.

Changes:
- Create classic/pyproject.toml with merged dependencies from all three projects
- Remove individual pyproject.toml and poetry.lock files from subdirectories
- Update all CLAUDE.md files to reflect commands run from classic/ root
- Update all README.md files with new installation and usage instructions

All packages are now included via the packages directive:
- forge/forge (core agent framework)
- original_autogpt/autogpt (AutoGPT agent)
- direct_benchmark/direct_benchmark (benchmark harness)

CLI entry points preserved: autogpt, serve, direct-benchmark

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>

2026-01-20 00:49:56 -06:00

9.1 KiB

Raw Blame History

CLAUDE.md - Direct Benchmark Harness

This file provides guidance to Claude Code when working with the direct benchmark harness.

Overview

The Direct Benchmark Harness is a high-performance testing framework for AutoGPT that directly instantiates agents without HTTP server overhead. It enables parallel execution of multiple strategy/model configurations.

Quick Reference

All commands run from the classic/ directory (parent of this directory):

# Install (one-time setup)
cd classic
poetry install

# Run benchmarks
poetry run direct-benchmark run

# Run specific strategies and models
poetry run direct-benchmark run \
    --strategies one_shot,rewoo \
    --models claude,openai \
    --parallel 4

# Run a single test
poetry run direct-benchmark run \
    --strategies one_shot \
    --tests ReadFile

# List available challenges
poetry run direct-benchmark list-challenges

# List model presets
poetry run direct-benchmark list-models

# List strategies
poetry run direct-benchmark list-strategies

CLI Options

Run Command

Option	Short	Description
`--strategies`	`-s`	Comma-separated strategies (one_shot, rewoo, plan_execute, reflexion, tree_of_thoughts)
`--models`	`-m`	Comma-separated model presets (claude, openai, etc.)
`--categories`	`-c`	Filter by challenge categories
`--skip-category`	`-S`	Exclude categories
`--tests`	`-t`	Filter by test names
`--attempts`	`-N`	Number of times to run each challenge
`--parallel`	`-p`	Maximum parallel runs (default: 4)
`--timeout`		Per-challenge timeout in seconds (default: 300)
`--cutoff`		Alias for --timeout
`--no-cutoff`	`--nc`	Disable time limit
`--max-steps`		Maximum steps per challenge (default: 50)
`--maintain`		Run only regression tests
`--improve`		Run only non-regression tests
`--explore`		Run only never-beaten challenges
`--no-dep`		Ignore challenge dependencies
`--workspace`		Workspace root directory
`--challenges-dir`		Path to challenges directory
`--reports-dir`		Path to reports directory
`--keep-answers`		Keep answer files for debugging
`--quiet`	`-q`	Minimal output
`--verbose`	`-v`	Detailed per-challenge output
`--json`		JSON output for CI/scripting
`--ci`		CI mode: no live display, shows completion blocks (auto-enabled when CI env var is set or not a TTY)
`--fresh`		Clear all saved state and start fresh (don't resume)
`--retry-failures`		Re-run only the challenges that failed in previous run
`--reset-strategy`		Reset saved results for specific strategy (can repeat)
`--reset-model`		Reset saved results for specific model (can repeat)
`--reset-challenge`		Reset saved results for specific challenge (can repeat)
`--debug`		Enable debug output

State Management Commands

# Show current state
poetry run direct-benchmark state show

# Clear all state
poetry run direct-benchmark state clear

# Reset specific strategy/model/challenge
poetry run direct-benchmark state reset --strategy reflexion
poetry run direct-benchmark state reset --model claude-thinking-25k
poetry run direct-benchmark state reset --challenge ThreeSum

Available Strategies

one_shot - Single-pass reasoning (default)
rewoo - Reasoning with observations
plan_execute - Plan then execute
reflexion - Self-reflection loop
tree_of_thoughts - Multiple reasoning paths

Available Model Presets

Claude

claude - sonnet-4 smart, haiku fast
claude-smart - sonnet-4 for both
claude-fast - haiku for both
claude-opus - opus smart, sonnet fast
claude-opus-only - opus for both

Claude with Extended Thinking

claude-thinking-10k - 10k thinking tokens
claude-thinking-25k - 25k thinking tokens
claude-thinking-50k - 50k thinking tokens
claude-opus-thinking - opus with 25k thinking
claude-opus-thinking-50k - opus with 50k thinking

OpenAI

openai - gpt-4o smart, gpt-4o-mini fast
openai-smart - gpt-4o for both
openai-fast - gpt-4o-mini for both
gpt5 - gpt-5 smart, gpt-4o fast
gpt5-only - gpt-5 for both

OpenAI Reasoning Models

o1, o1-mini - o1 variants
o1-low, o1-medium, o1-high - o1 with reasoning effort
o3-low, o3-medium, o3-high - o3 with reasoning effort
gpt5-low, gpt5-medium, gpt5-high - gpt-5 with reasoning effort

Directory Structure

direct_benchmark/
├── pyproject.toml           # Poetry config
├── README.md                 # User documentation
├── CLAUDE.md                 # This file
├── .gitignore
└── direct_benchmark/
    ├── __init__.py
    ├── __main__.py           # CLI entry point
    ├── models.py             # Pydantic models, presets
    ├── harness.py            # Main orchestrator
    ├── runner.py             # AgentRunner (single agent lifecycle)
    ├── parallel.py           # ParallelExecutor (concurrent runs)
    ├── challenge_loader.py   # Load challenges from JSON
    ├── evaluator.py          # Evaluate outputs vs ground truth
    ├── report.py             # Report generation
    └── ui.py                 # Rich UI components

Architecture

Execution Flow

CLI args → HarnessConfig
    ↓
BenchmarkHarness.run()
    ↓
ChallengeLoader.load_all() → list[Challenge]
    ↓
ParallelExecutor.execute_matrix(configs × challenges × attempts)
    ↓
[Parallel with semaphore limiting to N concurrent]
    ↓
AgentRunner.run_challenge():
  1. Create temp workspace
  2. Copy input artifacts to agent workspace
  3. Create AppConfig with strategy/model
  4. create_agent() - direct instantiation
  5. Run agent loop until finish/timeout
  6. Collect output files
    ↓
Evaluator.evaluate() - check against ground truth
    ↓
ReportGenerator - write reports

Key Components

AgentRunner (runner.py)

Manages single agent lifecycle for one challenge
Creates isolated temp workspace per run
Copies input artifacts to {workspace}/.autogpt/agents/{agent_id}/workspace/
Instantiates agent directly via create_agent()
Runs agent loop: propose_action() → execute() until finish/timeout

ParallelExecutor (parallel.py)

Manages concurrent execution with asyncio semaphore
Supports multiple attempts per challenge
Reports progress via callbacks

Evaluator (evaluator.py)

String matching (should_contain/should_not_contain)
Python script execution
Pytest execution

ReportGenerator (report.py)

Per-config report.json files (compatible with agbenchmark format)
Comparison reports across all configs

Report Format

Reports are generated in ./reports/ with format:

reports/
├── {timestamp}_{strategy}_{model}/
│   └── report.json
└── strategy_comparison_{timestamp}.json

Dependencies

autogpt-forge - Core agent framework
autogpt - Original AutoGPT agent
click - CLI framework
pydantic - Data models
rich - Terminal UI

Key Differences from agbenchmark

agbenchmark	direct_benchmark
`subprocess.Popen` + HTTP server	Direct `create_agent()`
HTTP/REST via Agent Protocol	Direct `propose_action()`/`execute()`
Sequential (one config at a time)	Parallel via asyncio semaphore
Port-based isolation	Workspace-based isolation
`agbenchmark run` CLI	Direct JSON parsing

Common Tasks

Run Full Benchmark Suite

poetry run direct-benchmark run \
    --strategies one_shot,rewoo,plan_execute \
    --models claude \
    --parallel 8

Compare Strategies

poetry run direct-benchmark run \
    --strategies one_shot,rewoo,plan_execute,reflexion \
    --models claude \
    --tests ReadFile,WriteFile,ThreeSum

Debug a Failing Test

poetry run direct-benchmark run \
    --strategies one_shot \
    --tests FailingTest \
    --keep-answers \
    --verbose

Resume / Incremental Runs

The benchmark automatically saves progress and resumes from where it left off. State is saved to .benchmark_state.json in the reports directory.

# Run benchmarks - will resume from last run automatically
poetry run direct-benchmark run \
    --strategies one_shot,reflexion \
    --models claude

# Start fresh (clear all saved state)
poetry run direct-benchmark run --fresh \
    --strategies one_shot,reflexion \
    --models claude

# Reset specific strategy and re-run
poetry run direct-benchmark run \
    --reset-strategy reflexion \
    --strategies one_shot,reflexion \
    --models claude

# Reset specific model and re-run
poetry run direct-benchmark run \
    --reset-model claude-thinking-25k \
    --strategies one_shot \
    --models claude,claude-thinking-25k

# Retry only the failures from the last run
poetry run direct-benchmark run --retry-failures \
    --strategies one_shot,reflexion \
    --models claude

CI/Scripting Mode

# JSON output (parseable)
poetry run direct-benchmark run --json

# CI mode - shows completion blocks without Live display
# Auto-enabled when CI=true env var is set or stdout is not a TTY
poetry run direct-benchmark run --ci

9.1 KiB Raw Blame History Unescape Escape