AutoGPT

mirror of https://github.com/Significant-Gravitas/AutoGPT.git synced 2026-02-13 16:25:05 -05:00

Author	SHA1	Message	Date
Nick Tindle	711f0da63c	fix(classic): fix CI failures - install Playwright and auto-detect model - Add 'playwright install chromium' step to Forge CI workflow - Auto-detect default model from available API keys (ANTHROPIC_API_KEY, OPENAI_API_KEY, GROQ_API_KEY) in direct_benchmark harness - Prefer Claude > OpenAI > Groq, fallback to OpenAI if no keys found	2026-02-12 15:46:54 -06:00
Nicholas Tindle	0040636948	fix(permissions): update wildcard handling for command patterns	2026-01-26 12:42:21 -06:00
Nicholas Tindle	57fbab500b	feat(classic): add external benchmark adapters for GAIA, SWE-bench, and AgentBench Integrate standard AI agent benchmarks into the direct_benchmark infrastructure using a plugin-based adapter pattern: - Add BenchmarkAdapter base class with setup(), load_challenges(), and evaluate() - Implement GAIAAdapter for the GAIA benchmark (requires HF token) - Implement SWEBenchAdapter for SWE-bench (requires Docker) - Implement AgentBenchAdapter for AgentBench multi-environment benchmark - Extend HarnessConfig with benchmark options (--benchmark, --benchmark-split, etc.) - Modify ParallelExecutor to use adapter's evaluate() for external benchmarks - Fix runner to record finish step (was being skipped, breaking answer extraction) - Add optional benchmarks dependency group with datasets and huggingface-hub - Increase default benchmark timeout to 900s Usage: poetry run direct-benchmark run \ --benchmark agent-bench \ --benchmark-subset dbbench \ --strategies one_shot \ --models claude Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>	2026-01-21 13:06:32 -06:00
Nicholas Tindle	326554d89a	style(classic): update black to 24.10.0 and reformat Update black version to match pre-commit hook (24.10.0) and reformat all files with the new version. Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>	2026-01-20 10:51:54 -06:00
Nicholas Tindle	a4d7b0142f	fix(classic): resolve all pyright type errors - Add missing strategies (lats, multi_agent_debate) to PromptStrategyName - Fix method override signatures for reasoning_effort parameter - Fix Pydantic Field() overload issues with helper function - Fix BeautifulSoup Tag type narrowing in web_fetch.py - Fix Optional member access in playwright_browser.py and rewoo.py - Convert hasattr patterns to getattr for proper type narrowing - Add proper type casts for Literal types - Fix file storage path type conversions - Exclude legacy challenges/ from pyright checking Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>	2026-01-20 10:41:53 -06:00
Nicholas Tindle	5373a6eb6e	style(classic): fix code formatting with black Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>	2026-01-20 01:23:51 -06:00
Nicholas Tindle	98cde46ccb	style(classic): fix import sorting with isort Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>	2026-01-20 01:23:33 -06:00
Nicholas Tindle	60fdee1345	fix(classic): resolve linting and formatting issues for CI compliance - Update .flake8 config to exclude workspace directories and ignore E203 - Fix import sorting (isort) across multiple files - Fix code formatting (black) across multiple files - Remove unused imports and fix line length issues (flake8) - Fix f-strings without placeholders and unused variables Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>	2026-01-20 01:16:38 -06:00
Nicholas Tindle	6f2783468c	feat(classic): add sub-agent architecture and LATS/multi-agent debate strategies Add comprehensive sub-agent spawning infrastructure that enables prompt strategies to coordinate multiple agents for advanced reasoning patterns. New files: - forge/agent/execution_context.py: ExecutionContext, ResourceBudget, SubAgentHandle, and AgentFactory protocol for sub-agent lifecycle - agent_factory/default_factory.py: DefaultAgentFactory implementation - prompt_strategies/lats.py: Language Agent Tree Search using MCTS with sub-agents for action expansion and evaluation - prompt_strategies/multi_agent_debate.py: Multi-agent debate with proposal, critique, and consensus phases Key changes: - BaseMultiStepPromptStrategy gains spawn_sub_agent(), run_sub_agent(), spawn_and_run(), and run_parallel() methods - Agent class accepts optional ExecutionContext and injects it into strategies - Sub-agents enabled by default (enable_sub_agents=True) - Resource limits: max_depth=5, max_sub_agents=25, max_cycles=25 All 7 strategies now available in benchmark: one_shot, rewoo, plan_execute, reflexion, tree_of_thoughts, lats, multi_agent_debate Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>	2026-01-20 01:01:28 -06:00
Nicholas Tindle	b849eafb7f	feat(direct_benchmark): enable shell command execution with safety denylist Enable agents to execute shell commands during benchmarks by setting execute_local_commands=True and using denylist mode to block dangerous commands (rm, sudo, chmod, kill, etc.) while allowing safe operations. Also adds ExecutePython challenge to test code execution capability. Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>	2026-01-20 00:52:06 -06:00
Nicholas Tindle	572c3f5e0d	refactor(classic): consolidate Poetry projects into single pyproject.toml Merge forge/, original_autogpt/, and direct_benchmark/ into a single Poetry project to eliminate cross-project path dependency issues. Changes: - Create classic/pyproject.toml with merged dependencies from all three projects - Remove individual pyproject.toml and poetry.lock files from subdirectories - Update all CLAUDE.md files to reflect commands run from classic/ root - Update all README.md files with new installation and usage instructions All packages are now included via the packages directive: - forge/forge (core agent framework) - original_autogpt/autogpt (AutoGPT agent) - direct_benchmark/direct_benchmark (benchmark harness) CLI entry points preserved: autogpt, serve, direct-benchmark Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>	2026-01-20 00:49:56 -06:00
Nicholas Tindle	89003a585d	feat(direct_benchmark): show "would have passed" for timed-out challenges When a challenge times out but the agent's solution would have passed evaluation, this is now clearly indicated: - Completion blocks show "TIMEOUT (would have passed)" in yellow - Recent completions panel shows hourglass icon + "would pass" suffix - Summary table has new "Would Pass" column - Final summary shows "+N would pass" count - Success rate includes "would pass" challenges The evaluator still runs on timed-out challenges to calculate the score, but success remains False. This gives visibility into near-misses that just needed more time. Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>	2026-01-20 00:30:00 -06:00
Nicholas Tindle	0e65785228	fix(direct_benchmark): don't mark timed-out challenges as passed Previously, the evaluator would run on all results including timed-out challenges. If the agent happened to write a working solution before timing out, evaluation would pass and override success=True, resulting in contradictory output showing both PASS and "timed out". Now we skip evaluation for timed-out challenges - they cannot pass. Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>	2026-01-20 00:25:41 -06:00
Nicholas Tindle	f07dff1cdd	fix(direct_benchmark): add pytest dependency for challenge evaluation The TicTacToe and other challenges use pytest-based test files for evaluation. Without pytest installed in the benchmark virtualenv, these evaluations were silently failing. Root cause: test.py imports pytest but the package wasn't a dependency, causing ModuleNotFoundError during evaluation subprocess. Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>	2026-01-20 00:21:12 -06:00
Nicholas Tindle	00e02a4696	feat(direct_benchmark): add run ID to completion blocks Include config:challenge:attempt and timestamp in completion block header for easier debugging and log correlation. Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>	2026-01-20 00:14:23 -06:00
Nicholas Tindle	634bff8277	refactor(forge): replace Selenium with Playwright for web browsing - Remove selenium.py and test_selenium.py - Add playwright_browser.py with WebPlaywrightComponent - Update web component exports to use Playwright - Update dependencies in pyproject.toml/poetry.lock - Minor agent and reflexion strategy improvements - Update CLAUDE.md documentation Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>	2026-01-19 23:57:17 -06:00
Nicholas Tindle	d591f36c7b	fix(direct_benchmark): track cost from LLM provider Previously cost was hardcoded to 0.0. Now extracts cumulative cost from MultiProvider.get_incurred_cost() after each step execution. Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>	2026-01-19 23:37:12 -06:00
Nicholas Tindle	a347bed0b1	feat(direct_benchmark): add incremental resume and selective reset Benchmarks now automatically save progress and resume from where they left off. State is persisted to .benchmark_state.json in reports dir. Features: - Auto-resume: runs skip already-completed challenges - --fresh: clear all state and start over - --retry-failures: re-run only failed challenges - --reset-strategy/model/challenge: selective resets - `state show/clear/reset` subcommands for state management - Config mismatch detection with auto-reset Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>	2026-01-19 23:32:27 -06:00
Nicholas Tindle	4eeb6ee2b0	feat(direct_benchmark): add CI mode for non-interactive environments Add --ci flag that disables Rich Live display while preserving completion blocks. Auto-detects CI environment via CI env var or non-TTY stdout. Prints progress every 10 completions for visibility. Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>	2026-01-19 23:21:10 -06:00
Nicholas Tindle	7db962b9f9	feat(direct_benchmark): dynamic column layout up to 10 wide - Calculate max columns based on terminal width (up to 10) - Reduced panel width from 35 to 30 chars to fit more - Wider terminals can now show more parallel runs side-by-side Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>	2026-01-19 23:15:16 -06:00
Nicholas Tindle	9108b21541	fix(direct_benchmark): parallel execution and always show completion blocks Fixes: - Use run_key (config:challenge) instead of just config_name for tracking active runs - allows multiple challenges from same config to run in parallel - Add asyncio.sleep(0) yields to let multiple tasks acquire semaphore and start before any proceed with work - Always print completion blocks (not just failures) for visibility This should properly show 8/8 active runs when running with --parallel 8. Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>	2026-01-19 23:13:56 -06:00
Nicholas Tindle	ffe9325296	feat(direct_benchmark): multi-panel UI with copy-paste completion blocks UI improvements: - Multi-column layout: each active config gets its own panel showing challenge name and step history (last 6 steps with status) - Copy-paste completion blocks: when a challenge finishes (especially failures), prints a detailed block with all steps for easy debugging - Configurable logging: suppresses noisy LLM provider warnings unless --debug flag is set - Pass debug flag through harness to UI Example active runs panel: ┌─ one_shot/claude ─┬─ rewoo/claude ────┐ │ ReadFile │ WriteFile │ │ ✓ #1 read_file │ ✓ #1 think │ │ ✓ #2 write_file │ ✓ #2 plan │ │ ● step 3: ... │ ● step 3: ... │ └───────────────────┴───────────────────┘ Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>	2026-01-19 23:10:34 -06:00
Nicholas Tindle	0a616d9267	feat(direct_benchmark): add step-level logging with colored prefixes - Add step callback to AgentRunner for real-time step logging - BenchmarkUI now shows: - Active runs with current step info - Recent steps panel with colored config prefixes - Proper Live display refresh (implements __rich_console__) - Each config gets a distinct color for easy identification - Verbose mode prints step logs immediately with config prefix - Fix Live display not updating (pass UI object, not rendered content) Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>	2026-01-19 23:02:20 -06:00
Nicholas Tindle	804430e243	refactor(classic): migrate from agbenchmark to direct_benchmark harness - Remove old benchmark/ folder with agbenchmark framework - Move challenges to direct_benchmark/challenges/ - Move analysis tools (analyze_reports.py, analyze_failures.py) to direct_benchmark/ - Move challenges_already_beaten.json to direct_benchmark/ - Update CI workflow to use direct_benchmark - Update CLAUDE.md files with new benchmarking instructions - Add benchmarking section to original_autogpt/CLAUDE.md The direct_benchmark harness directly instantiates agents without HTTP server overhead, enabling parallel execution with asyncio semaphore. Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>	2026-01-19 22:29:51 -06:00

24 Commits