Some LLM providers (notably Anthropic) don't support system messages
in the middle of a conversation. Changed ChatMessage.system() to
ChatMessage.user() for all mid-conversation context messages across
components (action history, context, skills, system clock, todo,
error reporting, LATS, and multi-agent debate strategies).
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
do_not_execute() was not calling append_user_feedback(), so feedback
from denied commands only appeared as a tool result message which the
model often ignored. Now feedback is also surfaced as a prominent
[USER FEEDBACK] user message in the next prompt.
Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
Implement the open Agent Skills standard for Classic AutoGPT, enabling
modular, progressively-loaded capabilities via SKILL.md files. Skills
are discovered from workspace (.autogpt/skills) and global
(~/.autogpt/skills) directories with three-level progressive disclosure
to minimize token usage.
Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
Add a new `config` command that opens a tabbed TUI for browsing and
editing AutoGPT settings. The UI allows users to configure settings
interactively rather than manually editing .env files.
Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
- Fix line-too-long in test_permissions.py docstring
- Fix type annotation in validators.py (callable -> Callable)
- Add --fresh flag to benchmark tests to prevent state resumption
- Exclude direct_benchmark/adapters from pyright (optional deps)
Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
- Add PlatformBlocksComponent to Agent as a default component
- Component automatically enables when PLATFORM_API_KEY env var is set
- Config now uses UserConfigurable for env var support:
- PLATFORM_API_KEY (required to enable)
- PLATFORM_URL (default: https://platform.agpt.co)
- PLATFORM_BLOCKS_ENABLED (default: true)
- PLATFORM_TIMEOUT (default: 60)
- API key stored as SecretStr for security
Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
In CLI mode, agents now work directly in the current directory instead of
being sandboxed to .autogpt/agents/{id}/workspace/. Agent state files are
still stored in .autogpt/agents/{id}/state.json.
Server mode retains the original sandboxed behavior for isolation.
Changes:
- Add workspace_root parameter to FileManagerComponent to detect CLI mode
- Update Agent to pass workspace_root when file_storage is rooted at workspace
- Adjust save_state paths based on mode (CLI uses .autogpt/ prefix)
- Add use_tools field to ActionProposal for parallel tool execution
- Support parallel tool execution in Agent.execute()
Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
- SystemComponent: Keep both general constraints (physical objects) and
code-specific constraints (don't modify tests, check dependencies, no secrets)
- SystemComponent: Keep both general best practices (self-review, reflection)
and code-specific best practices (read before modify, mimic style, verify)
- LATS: Keep general phase instructions while adding coding task priorities
- one_shot: Remove redundant 'text' field from AssistantThoughts, use 'reasoning'
- one_shot: Fix intro to clarify when to use ask_user instead of contradicting it
- one_shot: Add efficiency guidelines and parallel execution support
- Update UI to display reasoning as main thoughts (remove redundant REASONING line)
- Update test fixtures to match new AssistantThoughts schema
Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
Simplify the platform_blocks component to fetch blocks from the
platform API (/api/v1/blocks) instead of loading them locally from
the monorepo. This removes the dependency on having the platform
backend code available.
- Remove loader.py (no longer needed)
- Update client.py with list_blocks() method
- Simplify component.py to use API for both search and execute
- Remove user_id from config (not needed by API)
- Update tests for API-based approach
Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
Add search_blocks and execute_block commands that expose platform blocks
to classic agents:
- search_blocks: Local search by name, description, or category (fast, offline)
- execute_block: Execute via platform API with automatic credential handling
The loader automatically discovers the platform backend from the monorepo
structure without requiring manual PYTHONPATH configuration.
Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
Integrate standard AI agent benchmarks into the direct_benchmark infrastructure
using a plugin-based adapter pattern:
- Add BenchmarkAdapter base class with setup(), load_challenges(), and evaluate()
- Implement GAIAAdapter for the GAIA benchmark (requires HF token)
- Implement SWEBenchAdapter for SWE-bench (requires Docker)
- Implement AgentBenchAdapter for AgentBench multi-environment benchmark
- Extend HarnessConfig with benchmark options (--benchmark, --benchmark-split, etc.)
- Modify ParallelExecutor to use adapter's evaluate() for external benchmarks
- Fix runner to record finish step (was being skipped, breaking answer extraction)
- Add optional benchmarks dependency group with datasets and huggingface-hub
- Increase default benchmark timeout to 900s
Usage:
poetry run direct-benchmark run \
--benchmark agent-bench \
--benchmark-subset dbbench \
--strategies one_shot \
--models claude
Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
Docker containers cannot have their mount bindings updated after creation.
When running benchmarks or multiple agent instances, the same container name
could be reused with a different workspace directory, causing the container
to still reference the OLD mount path. This resulted in "python: can't open
file '/workspace/temp*.py'" errors.
The fix: remove existing containers before creating new ones to ensure fresh
mount bindings to the current workspace directory.
Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
When prompts encourage parallel tool execution and the LLM makes multiple
tool calls simultaneously, the Anthropic API requires a tool_result message
for EACH tool_use. Previously, we only created one tool result for the first
tool call, causing "tool_use ids were found without tool_result blocks" errors.
This fix:
- Adds _make_result_messages() to create results for ALL tool calls
- Maps tool names to their outputs from parallel execution results
- Handles errors per-tool from the _errors list
- Falls back gracefully when results are missing
Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
Update black version to match pre-commit hook (24.10.0) and reformat
all files with the new version.
Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
- Add missing strategies (lats, multi_agent_debate) to PromptStrategyName
- Fix method override signatures for reasoning_effort parameter
- Fix Pydantic Field() overload issues with helper function
- Fix BeautifulSoup Tag type narrowing in web_fetch.py
- Fix Optional member access in playwright_browser.py and rewoo.py
- Convert hasattr patterns to getattr for proper type narrowing
- Add proper type casts for Literal types
- Fix file storage path type conversions
- Exclude legacy challenges/ from pyright checking
Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
- Update .flake8 config to exclude workspace directories and ignore E203
- Fix import sorting (isort) across multiple files
- Fix code formatting (black) across multiple files
- Remove unused imports and fix line length issues (flake8)
- Fix f-strings without placeholders and unused variables
Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
Add comprehensive sub-agent spawning infrastructure that enables prompt
strategies to coordinate multiple agents for advanced reasoning patterns.
New files:
- forge/agent/execution_context.py: ExecutionContext, ResourceBudget,
SubAgentHandle, and AgentFactory protocol for sub-agent lifecycle
- agent_factory/default_factory.py: DefaultAgentFactory implementation
- prompt_strategies/lats.py: Language Agent Tree Search using MCTS
with sub-agents for action expansion and evaluation
- prompt_strategies/multi_agent_debate.py: Multi-agent debate with
proposal, critique, and consensus phases
Key changes:
- BaseMultiStepPromptStrategy gains spawn_sub_agent(), run_sub_agent(),
spawn_and_run(), and run_parallel() methods
- Agent class accepts optional ExecutionContext and injects it into strategies
- Sub-agents enabled by default (enable_sub_agents=True)
- Resource limits: max_depth=5, max_sub_agents=25, max_cycles=25
All 7 strategies now available in benchmark:
one_shot, rewoo, plan_execute, reflexion, tree_of_thoughts, lats, multi_agent_debate
Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
Enable agents to execute shell commands during benchmarks by setting
execute_local_commands=True and using denylist mode to block dangerous
commands (rm, sudo, chmod, kill, etc.) while allowing safe operations.
Also adds ExecutePython challenge to test code execution capability.
Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
Merge forge/, original_autogpt/, and direct_benchmark/ into a single Poetry
project to eliminate cross-project path dependency issues.
Changes:
- Create classic/pyproject.toml with merged dependencies from all three projects
- Remove individual pyproject.toml and poetry.lock files from subdirectories
- Update all CLAUDE.md files to reflect commands run from classic/ root
- Update all README.md files with new installation and usage instructions
All packages are now included via the packages directive:
- forge/forge (core agent framework)
- original_autogpt/autogpt (AutoGPT agent)
- direct_benchmark/direct_benchmark (benchmark harness)
CLI entry points preserved: autogpt, serve, direct-benchmark
Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
When a challenge times out but the agent's solution would have passed
evaluation, this is now clearly indicated:
- Completion blocks show "TIMEOUT (would have passed)" in yellow
- Recent completions panel shows hourglass icon + "would pass" suffix
- Summary table has new "Would Pass" column
- Final summary shows "+N would pass" count
- Success rate includes "would pass" challenges
The evaluator still runs on timed-out challenges to calculate the score,
but success remains False. This gives visibility into near-misses that
just needed more time.
Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
Previously, the evaluator would run on all results including timed-out
challenges. If the agent happened to write a working solution before
timing out, evaluation would pass and override success=True, resulting
in contradictory output showing both PASS and "timed out".
Now we skip evaluation for timed-out challenges - they cannot pass.
Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
The TicTacToe and other challenges use pytest-based test files for
evaluation. Without pytest installed in the benchmark virtualenv,
these evaluations were silently failing.
Root cause: test.py imports pytest but the package wasn't a dependency,
causing ModuleNotFoundError during evaluation subprocess.
Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
Include config:challenge:attempt and timestamp in completion block
header for easier debugging and log correlation.
Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
- Remove selenium.py and test_selenium.py
- Add playwright_browser.py with WebPlaywrightComponent
- Update web component exports to use Playwright
- Update dependencies in pyproject.toml/poetry.lock
- Minor agent and reflexion strategy improvements
- Update CLAUDE.md documentation
Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
Previously cost was hardcoded to 0.0. Now extracts cumulative cost
from MultiProvider.get_incurred_cost() after each step execution.
Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
Benchmarks now automatically save progress and resume from where they
left off. State is persisted to .benchmark_state.json in reports dir.
Features:
- Auto-resume: runs skip already-completed challenges
- --fresh: clear all state and start over
- --retry-failures: re-run only failed challenges
- --reset-strategy/model/challenge: selective resets
- `state show/clear/reset` subcommands for state management
- Config mismatch detection with auto-reset
Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
Add --ci flag that disables Rich Live display while preserving
completion blocks. Auto-detects CI environment via CI env var or
non-TTY stdout. Prints progress every 10 completions for visibility.
Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
- Calculate max columns based on terminal width (up to 10)
- Reduced panel width from 35 to 30 chars to fit more
- Wider terminals can now show more parallel runs side-by-side
Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
Fixes:
- Use run_key (config:challenge) instead of just config_name for tracking
active runs - allows multiple challenges from same config to run in parallel
- Add asyncio.sleep(0) yields to let multiple tasks acquire semaphore
and start before any proceed with work
- Always print completion blocks (not just failures) for visibility
This should properly show 8/8 active runs when running with --parallel 8.
Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
UI improvements:
- Multi-column layout: each active config gets its own panel showing
challenge name and step history (last 6 steps with status)
- Copy-paste completion blocks: when a challenge finishes (especially
failures), prints a detailed block with all steps for easy debugging
- Configurable logging: suppresses noisy LLM provider warnings unless
--debug flag is set
- Pass debug flag through harness to UI
Example active runs panel:
┌─ one_shot/claude ─┬─ rewoo/claude ────┐
│ ReadFile │ WriteFile │
│ ✓ #1 read_file │ ✓ #1 think │
│ ✓ #2 write_file │ ✓ #2 plan │
│ ● step 3: ... │ ● step 3: ... │
└───────────────────┴───────────────────┘
Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
- Add step callback to AgentRunner for real-time step logging
- BenchmarkUI now shows:
- Active runs with current step info
- Recent steps panel with colored config prefixes
- Proper Live display refresh (implements __rich_console__)
- Each config gets a distinct color for easy identification
- Verbose mode prints step logs immediately with config prefix
- Fix Live display not updating (pass UI object, not rendered content)
Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
- Remove vcrpy and pytest-recording dependencies
- Remove tests/vcr/ directory and vcr_cassettes submodule
- Remove .gitmodules (only had cassette submodule)
- Simplify CI workflow - no more cassette checkout/push/PAT_REVIEW
- Tests requiring API keys now skip if not set (fork PRs)
- Update CLAUDE.md files to remove cassette references
- Fix broken agbenchmark path in pyproject.toml
Security improvement: removes need for PAT with cross-repo write access.
Fork PRs will have API-dependent tests skipped (GitHub protects secrets).
Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
- Remove old benchmark/ folder with agbenchmark framework
- Move challenges to direct_benchmark/challenges/
- Move analysis tools (analyze_reports.py, analyze_failures.py) to direct_benchmark/
- Move challenges_already_beaten.json to direct_benchmark/
- Update CI workflow to use direct_benchmark
- Update CLAUDE.md files with new benchmarking instructions
- Add benchmarking section to original_autogpt/CLAUDE.md
The direct_benchmark harness directly instantiates agents without HTTP
server overhead, enabling parallel execution with asyncio semaphore.
Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
- Add NONINTERACTIVE_MODE env var support to AppConfig for disabling
user interaction during automated runs
- Benchmark harness now sets NONINTERACTIVE_MODE=True when starting agents
- Add agent configuration logging at server startup (model, strategy, etc.)
- Harness logs env vars being passed to agent for verification
- Add --agent-output flag to show full agent server output for debugging
Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
- Add model comparison support to test harness (claude, openai, gpt5, opus presets)
- Add --models, --smart-llm, --fast-llm, --list-models CLI args
- Add real-time logging with timestamps and progress indicators
- Fix success parsing bug: read results[0].success instead of non-existent metrics.success
- Fix agbenchmark TestResult validation: use exception typename when value is empty
- Fix WebArena challenge validation: use strings instead of integers in instantiation_dict
- Fix Agent type annotations: create AnyActionProposal union for all prompt strategies
- Add pytest integration tests for the strategy benchmark harness
Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
When tool calls fail validation, the error messages now include:
- What arguments were actually provided
- The expected parameter schema with types and required/optional indicators
This helps LLMs understand and fix their mistakes when retrying,
rather than just being told a parameter is missing.
Example improved error:
Invalid function call for write_file: 'contents' is a required property
You provided: {"filename": 'story.txt'}
Expected parameters: {"filename": string (required), "contents": string (required)}
Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
Add WebFetchComponent for fast HTTP-based page fetching without browser
overhead. Uses trafilatura for intelligent content extraction.
Commands:
- fetch_webpage: Extract main content as text/markdown/xml
- Removes navigation, ads, boilerplate automatically
- Extracts page metadata (title, description, author, date)
- Extracts and lists page links
- Much faster than Selenium-based read_webpage
- fetch_raw_html: Get raw HTML for structure inspection
- Optional truncation for large pages
Features:
- Trafilatura-powered content extraction (best-in-class accuracy)
- Automatic link extraction with relative URL resolution
- Page metadata extraction (OG tags, meta tags)
- Configurable timeout, max content length, max links
- Proper error handling for timeouts and HTTP errors
- 19 comprehensive tests
Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
- Extract _get_tool_error_message helper method
- Replace 20+ levels of nesting with simple for loop
- Improve readability of tool_result construction
- Update benchmark poetry.lock
Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
Replace basic DuckDuckGo-only search with a modern tiered system:
1. Tavily (primary) - AI-optimized results with content extraction
- AI-generated answer summaries
- Relevance scoring
- Full page content extraction via search_and_extract command
2. Serper (secondary) - Fast, cheap Google SERP results
- $0.30-1.00 per 1K queries
- Real Google results without scraping
3. DDGS multi-engine (fallback) - Free, no API key required
- Automatic fallback chain: DuckDuckGo → Bing → Brave → Google → etc.
- 8 search backends supported
Key changes:
- Upgrade duckduckgo-search to ddgs v9.10 (renamed successor package)
- Add Tavily and Serper API integrations
- Implement automatic provider selection and fallback chain
- Add search_and_extract command for research with content extraction
- Add TAVILY_API_KEY and SERPER_API_KEY to env templates
- Update benchmark httpx constraint for ddgs compatibility
- 23 comprehensive tests for all providers and fallback scenarios
Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
- Change Agent class to use ActionProposal instead of OneShotAgentActionProposal
to support multiple prompt strategy types
- Widen display_thoughts parameter type from AssistantThoughts to ModelWithSummary
- Fix speak attribute access in agent_protocol_server with hasattr check
- Add type: ignore comments for intentional thoughts field overrides in strategies
- Remove unused OneShotAgentActionProposal import
Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
- Fix 32+ flake8 E501 (line too long) errors by shortening descriptions
- Remove unused import in todo.py
- Fix test_todo.py argument order (config= keyword)
- Add type annotations to fix pyright errors where straightforward
- Add noqa comments for flake8 false positives in __init__.py
- Remove unused nonlocal declarations in main.py
- Run black and isort to fix formatting
- Update CLAUDE.md with improved linting commands
Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
Previously, when a user selected "Once" or "Always" with feedback (via Tab),
the command was NOT executed because UserFeedbackProvided was raised before
checking the approval scope. This fix changes the architecture from
exception-based to return-value-based.
Changes:
- Add PermissionCheckResult class with allowed, scope, and feedback fields
- Change check_command() to return PermissionCheckResult instead of bool
- Update prompt_fn signature to return (ApprovalScope, feedback) tuple
- Add pending_user_feedback mechanism to EpisodicActionHistory
- Update execute() to handle feedback after successful command execution
- Feedback message explicitly states "Command executed successfully"
- Add on_auto_approve callback for displaying auto-approved commands
- Add comprehensive tests for approval/denial with feedback scenarios
Behavior:
- Once + feedback → Execute command, then send feedback to agent
- Always + feedback → Execute command, save permission, send feedback
- Deny + feedback → Don't execute, send feedback to agent
Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>