AutoGPT

mirror of https://github.com/Significant-Gravitas/AutoGPT.git synced 2026-01-28 08:28:00 -05:00

Author	SHA1	Message	Date
Nicholas Tindle	6faabef24d	fix(classic): always recreate Docker containers for code execution Docker containers cannot have their mount bindings updated after creation. When running benchmarks or multiple agent instances, the same container name could be reused with a different workspace directory, causing the container to still reference the OLD mount path. This resulted in "python: can't open file '/workspace/temp*.py'" errors. The fix: remove existing containers before creating new ones to ensure fresh mount bindings to the current workspace directory. Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>	2026-01-20 23:57:02 -06:00
Nicholas Tindle	a67d475a69	fix(classic): handle parallel tool calls in action history When prompts encourage parallel tool execution and the LLM makes multiple tool calls simultaneously, the Anthropic API requires a tool_result message for EACH tool_use. Previously, we only created one tool result for the first tool call, causing "tool_use ids were found without tool_result blocks" errors. This fix: - Adds _make_result_messages() to create results for ALL tool calls - Maps tool names to their outputs from parallel execution results - Handles errors per-tool from the _errors list - Falls back gracefully when results are missing Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>	2026-01-20 23:18:15 -06:00
Nicholas Tindle	326554d89a	style(classic): update black to 24.10.0 and reformat Update black version to match pre-commit hook (24.10.0) and reformat all files with the new version. Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>	2026-01-20 10:51:54 -06:00
Nicholas Tindle	5e22a1888a	chore: add classic benchmark reports and workspaces to gitignore Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>	2026-01-20 10:42:55 -06:00
Nicholas Tindle	a4d7b0142f	fix(classic): resolve all pyright type errors - Add missing strategies (lats, multi_agent_debate) to PromptStrategyName - Fix method override signatures for reasoning_effort parameter - Fix Pydantic Field() overload issues with helper function - Fix BeautifulSoup Tag type narrowing in web_fetch.py - Fix Optional member access in playwright_browser.py and rewoo.py - Convert hasattr patterns to getattr for proper type narrowing - Add proper type casts for Literal types - Fix file storage path type conversions - Exclude legacy challenges/ from pyright checking Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>	2026-01-20 10:41:53 -06:00
Nicholas Tindle	7d6375f59c	style(classic): fix flake8 line length issue Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>	2026-01-20 01:25:00 -06:00
Nicholas Tindle	aeec0ce509	chore: add test.db to gitignore	2026-01-20 01:24:22 -06:00
Nicholas Tindle	b32bfcaac5	chore: remove test.db from tracking	2026-01-20 01:24:00 -06:00
Nicholas Tindle	5373a6eb6e	style(classic): fix code formatting with black Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>	2026-01-20 01:23:51 -06:00
Nicholas Tindle	98cde46ccb	style(classic): fix import sorting with isort Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>	2026-01-20 01:23:33 -06:00
Nicholas Tindle	bd10da10d9	ci: update pre-commit hooks for consolidated classic Poetry project - Consolidate classic poetry-install hooks into single hook using classic/ - Update isort hook to work with consolidated project structure - Simplify flake8 hooks to use single classic/.flake8 config - Consolidate pyright hooks into single hook for classic/ - Add direct_benchmark to hook coverage Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>	2026-01-20 01:21:50 -06:00
Nicholas Tindle	60fdee1345	fix(classic): resolve linting and formatting issues for CI compliance - Update .flake8 config to exclude workspace directories and ignore E203 - Fix import sorting (isort) across multiple files - Fix code formatting (black) across multiple files - Remove unused imports and fix line length issues (flake8) - Fix f-strings without placeholders and unused variables Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>	2026-01-20 01:16:38 -06:00
Nicholas Tindle	6f2783468c	feat(classic): add sub-agent architecture and LATS/multi-agent debate strategies Add comprehensive sub-agent spawning infrastructure that enables prompt strategies to coordinate multiple agents for advanced reasoning patterns. New files: - forge/agent/execution_context.py: ExecutionContext, ResourceBudget, SubAgentHandle, and AgentFactory protocol for sub-agent lifecycle - agent_factory/default_factory.py: DefaultAgentFactory implementation - prompt_strategies/lats.py: Language Agent Tree Search using MCTS with sub-agents for action expansion and evaluation - prompt_strategies/multi_agent_debate.py: Multi-agent debate with proposal, critique, and consensus phases Key changes: - BaseMultiStepPromptStrategy gains spawn_sub_agent(), run_sub_agent(), spawn_and_run(), and run_parallel() methods - Agent class accepts optional ExecutionContext and injects it into strategies - Sub-agents enabled by default (enable_sub_agents=True) - Resource limits: max_depth=5, max_sub_agents=25, max_cycles=25 All 7 strategies now available in benchmark: one_shot, rewoo, plan_execute, reflexion, tree_of_thoughts, lats, multi_agent_debate Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>	2026-01-20 01:01:28 -06:00
Nicholas Tindle	c1031b286d	ci(classic): update CI workflows for consolidated Poetry project Update all classic CI workflows to use the single consolidated pyproject.toml at classic/ instead of individual project directories. Changes: - classic-autogpt-ci.yml: Run from classic/, update cache key and test paths - classic-forge-ci.yml: Run from classic/, update cache key and test paths - classic-benchmark-ci.yml: Run from classic/, use direct-benchmark command - classic-python-checks.yml: Simplify to single job (no matrix needed) - classic-autogpts-ci.yml: Update to use direct-benchmark for smoke tests Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>	2026-01-20 00:53:50 -06:00
Nicholas Tindle	b849eafb7f	feat(direct_benchmark): enable shell command execution with safety denylist Enable agents to execute shell commands during benchmarks by setting execute_local_commands=True and using denylist mode to block dangerous commands (rm, sudo, chmod, kill, etc.) while allowing safe operations. Also adds ExecutePython challenge to test code execution capability. Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>	2026-01-20 00:52:06 -06:00
Nicholas Tindle	572c3f5e0d	refactor(classic): consolidate Poetry projects into single pyproject.toml Merge forge/, original_autogpt/, and direct_benchmark/ into a single Poetry project to eliminate cross-project path dependency issues. Changes: - Create classic/pyproject.toml with merged dependencies from all three projects - Remove individual pyproject.toml and poetry.lock files from subdirectories - Update all CLAUDE.md files to reflect commands run from classic/ root - Update all README.md files with new installation and usage instructions All packages are now included via the packages directive: - forge/forge (core agent framework) - original_autogpt/autogpt (AutoGPT agent) - direct_benchmark/direct_benchmark (benchmark harness) CLI entry points preserved: autogpt, serve, direct-benchmark Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>	2026-01-20 00:49:56 -06:00
Nicholas Tindle	89003a585d	feat(direct_benchmark): show "would have passed" for timed-out challenges When a challenge times out but the agent's solution would have passed evaluation, this is now clearly indicated: - Completion blocks show "TIMEOUT (would have passed)" in yellow - Recent completions panel shows hourglass icon + "would pass" suffix - Summary table has new "Would Pass" column - Final summary shows "+N would pass" count - Success rate includes "would pass" challenges The evaluator still runs on timed-out challenges to calculate the score, but success remains False. This gives visibility into near-misses that just needed more time. Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>	2026-01-20 00:30:00 -06:00
Nicholas Tindle	0e65785228	fix(direct_benchmark): don't mark timed-out challenges as passed Previously, the evaluator would run on all results including timed-out challenges. If the agent happened to write a working solution before timing out, evaluation would pass and override success=True, resulting in contradictory output showing both PASS and "timed out". Now we skip evaluation for timed-out challenges - they cannot pass. Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>	2026-01-20 00:25:41 -06:00
Nicholas Tindle	f07dff1cdd	fix(direct_benchmark): add pytest dependency for challenge evaluation The TicTacToe and other challenges use pytest-based test files for evaluation. Without pytest installed in the benchmark virtualenv, these evaluations were silently failing. Root cause: test.py imports pytest but the package wasn't a dependency, causing ModuleNotFoundError during evaluation subprocess. Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>	2026-01-20 00:21:12 -06:00
Nicholas Tindle	00e02a4696	feat(direct_benchmark): add run ID to completion blocks Include config:challenge:attempt and timestamp in completion block header for easier debugging and log correlation. Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>	2026-01-20 00:14:23 -06:00
Nicholas Tindle	634bff8277	refactor(forge): replace Selenium with Playwright for web browsing - Remove selenium.py and test_selenium.py - Add playwright_browser.py with WebPlaywrightComponent - Update web component exports to use Playwright - Update dependencies in pyproject.toml/poetry.lock - Minor agent and reflexion strategy improvements - Update CLAUDE.md documentation Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>	2026-01-19 23:57:17 -06:00
Nicholas Tindle	d591f36c7b	fix(direct_benchmark): track cost from LLM provider Previously cost was hardcoded to 0.0. Now extracts cumulative cost from MultiProvider.get_incurred_cost() after each step execution. Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>	2026-01-19 23:37:12 -06:00
Nicholas Tindle	a347bed0b1	feat(direct_benchmark): add incremental resume and selective reset Benchmarks now automatically save progress and resume from where they left off. State is persisted to .benchmark_state.json in reports dir. Features: - Auto-resume: runs skip already-completed challenges - --fresh: clear all state and start over - --retry-failures: re-run only failed challenges - --reset-strategy/model/challenge: selective resets - `state show/clear/reset` subcommands for state management - Config mismatch detection with auto-reset Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>	2026-01-19 23:32:27 -06:00
Nicholas Tindle	4eeb6ee2b0	feat(direct_benchmark): add CI mode for non-interactive environments Add --ci flag that disables Rich Live display while preserving completion blocks. Auto-detects CI environment via CI env var or non-TTY stdout. Prints progress every 10 completions for visibility. Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>	2026-01-19 23:21:10 -06:00
Nicholas Tindle	7db962b9f9	feat(direct_benchmark): dynamic column layout up to 10 wide - Calculate max columns based on terminal width (up to 10) - Reduced panel width from 35 to 30 chars to fit more - Wider terminals can now show more parallel runs side-by-side Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>	2026-01-19 23:15:16 -06:00
Nicholas Tindle	9108b21541	fix(direct_benchmark): parallel execution and always show completion blocks Fixes: - Use run_key (config:challenge) instead of just config_name for tracking active runs - allows multiple challenges from same config to run in parallel - Add asyncio.sleep(0) yields to let multiple tasks acquire semaphore and start before any proceed with work - Always print completion blocks (not just failures) for visibility This should properly show 8/8 active runs when running with --parallel 8. Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>	2026-01-19 23:13:56 -06:00
Nicholas Tindle	ffe9325296	feat(direct_benchmark): multi-panel UI with copy-paste completion blocks UI improvements: - Multi-column layout: each active config gets its own panel showing challenge name and step history (last 6 steps with status) - Copy-paste completion blocks: when a challenge finishes (especially failures), prints a detailed block with all steps for easy debugging - Configurable logging: suppresses noisy LLM provider warnings unless --debug flag is set - Pass debug flag through harness to UI Example active runs panel: ┌─ one_shot/claude ─┬─ rewoo/claude ────┐ │ ReadFile │ WriteFile │ │ ✓ #1 read_file │ ✓ #1 think │ │ ✓ #2 write_file │ ✓ #2 plan │ │ ● step 3: ... │ ● step 3: ... │ └───────────────────┴───────────────────┘ Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>	2026-01-19 23:10:34 -06:00
Nicholas Tindle	0a616d9267	feat(direct_benchmark): add step-level logging with colored prefixes - Add step callback to AgentRunner for real-time step logging - BenchmarkUI now shows: - Active runs with current step info - Recent steps panel with colored config prefixes - Proper Live display refresh (implements __rich_console__) - Each config gets a distinct color for easy identification - Verbose mode prints step logs immediately with config prefix - Fix Live display not updating (pass UI object, not rendered content) Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>	2026-01-19 23:02:20 -06:00
Nicholas Tindle	ab95077e5b	refactor(forge): remove VCR cassettes, use real API calls with skip for forks - Remove vcrpy and pytest-recording dependencies - Remove tests/vcr/ directory and vcr_cassettes submodule - Remove .gitmodules (only had cassette submodule) - Simplify CI workflow - no more cassette checkout/push/PAT_REVIEW - Tests requiring API keys now skip if not set (fork PRs) - Update CLAUDE.md files to remove cassette references - Fix broken agbenchmark path in pyproject.toml Security improvement: removes need for PAT with cross-repo write access. Fork PRs will have API-dependent tests skipped (GitHub protects secrets). Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>	2026-01-19 22:51:57 -06:00
Nicholas Tindle	e477150979	Merge branch 'dev' into make-old-work	2026-01-19 22:30:46 -06:00
Nicholas Tindle	804430e243	refactor(classic): migrate from agbenchmark to direct_benchmark harness - Remove old benchmark/ folder with agbenchmark framework - Move challenges to direct_benchmark/challenges/ - Move analysis tools (analyze_reports.py, analyze_failures.py) to direct_benchmark/ - Move challenges_already_beaten.json to direct_benchmark/ - Update CI workflow to use direct_benchmark - Update CLAUDE.md files with new benchmarking instructions - Add benchmarking section to original_autogpt/CLAUDE.md The direct_benchmark harness directly instantiates agents without HTTP server overhead, enabling parallel execution with asyncio semaphore. Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>	2026-01-19 22:29:51 -06:00
Nicholas Tindle	acb320d32d	feat(classic): add noninteractive mode env var and benchmark config logging - Add NONINTERACTIVE_MODE env var support to AppConfig for disabling user interaction during automated runs - Benchmark harness now sets NONINTERACTIVE_MODE=True when starting agents - Add agent configuration logging at server startup (model, strategy, etc.) - Harness logs env vars being passed to agent for verification - Add --agent-output flag to show full agent server output for debugging Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>	2026-01-19 19:40:24 -06:00
Nicholas Tindle	32f68d5999	feat(classic): add failure analysis tool and improve benchmark output Benchmark improvements: - Add analyze_failures.py for pattern detection and failure analysis - Add informative step output: tool name, args, result status, cost - Add --all and --matrix flags for comprehensive model/strategy testing - Add --analyze-only and --no-analyze flags for flexible analysis control - Auto-run failure analysis after benchmarks with markdown export - Fix directory creation bug in ReportManager (add parents=True) Prompt strategy enhancements: - Implement full plan_execute, reflexion, rewoo, tree_of_thoughts strategies - Add PROMPT_STRATEGY env var support for strategy selection - Add extended thinking support for Anthropic models - Add reasoning effort support for OpenAI o-series models LLM provider improvements: - Add thinking_budget_tokens config for Anthropic extended thinking - Add reasoning_effort config for OpenAI reasoning models - Improve error feedback for LLM self-correction Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>	2026-01-19 18:58:41 -06:00
Nicholas Tindle	49f56b4e8d	feat(classic): enhance strategy benchmark harness with model comparison and bug fixes - Add model comparison support to test harness (claude, openai, gpt5, opus presets) - Add --models, --smart-llm, --fast-llm, --list-models CLI args - Add real-time logging with timestamps and progress indicators - Fix success parsing bug: read results[0].success instead of non-existent metrics.success - Fix agbenchmark TestResult validation: use exception typename when value is empty - Fix WebArena challenge validation: use strings instead of integers in instantiation_dict - Fix Agent type annotations: create AnyActionProposal union for all prompt strategies - Add pytest integration tests for the strategy benchmark harness Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>	2026-01-19 18:07:14 -06:00
Swifty	bc75d70e7d	refactor(backend): Improve Langfuse tracing with v3 SDK patterns and @observe decorators (#11803 ) <!-- Clearly explain the need for these changes: --> This PR improves the Langfuse tracing implementation in the chat feature by adopting the v3 SDK patterns, resulting in cleaner code and better observability. ### Changes 🏗️ - Simplified Langfuse client usage: Replace manual client initialization with `langfuse.get_client()` global singleton - Use v3 context managers: Switch to `start_as_current_observation()` and `propagate_attributes()` for automatic trace propagation - Auto-instrument OpenAI calls: Use `langfuse.openai` wrapper for automatic LLM call tracing instead of manual generation tracking - Add `@observe` decorators: All chat tools now have `@observe(as_type="tool")` decorators for automatic tool execution tracing: - `add_understanding` - `view_agent_output` (renamed from `agent_output`) - `create_agent` - `edit_agent` - `find_agent` - `find_block` - `find_library_agent` - `get_doc_page` - `run_agent` - `run_block` - `search_docs` - Remove manual trace lifecycle: Eliminated the verbose `finally` block that manually ended traces/generations - Rename tool: `agent_output` → `view_agent_output` for clarity ### Checklist 📋 #### For code changes: - [x] I have clearly listed my changes in the PR description - [x] I have made a test plan - [x] I have tested my changes according to the test plan: - [x] Verified chat feature works with Langfuse tracing enabled - [x] Confirmed traces appear correctly in Langfuse dashboard with tool spans - [x] Tested tool execution flows show up as nested observations #### For configuration changes: - [x] `.env.default` is updated or already compatible with my changes - [x] `docker-compose.yml` is updated or already compatible with my changes - [x] I have included a list of my configuration changes in the PR description (under Changes) No configuration changes required - uses existing Langfuse environment variables.	2026-01-19 20:56:51 +00:00
Nicholas Tindle	bead811e73	docs(classic): add workspace, settings, and permissions documentation Document the layered configuration system including: - Workspace structure (.autogpt/ directory layout) - Settings location (environment variables, workspace YAML, agent YAML) - Permission system (check order, pattern syntax, approval scopes) - Default security behavior Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>	2026-01-19 12:17:10 -06:00
Nicholas Tindle	013f728ebf	feat(forge): improve tool call error feedback for LLM self-correction When tool calls fail validation, the error messages now include: - What arguments were actually provided - The expected parameter schema with types and required/optional indicators This helps LLMs understand and fix their mistakes when retrying, rather than just being told a parameter is missing. Example improved error: Invalid function call for write_file: 'contents' is a required property You provided: {"filename": 'story.txt'} Expected parameters: {"filename": string (required), "contents": string (required)} Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>	2026-01-19 11:49:17 -06:00
Nicholas Tindle	cda9572acd	feat(forge): add lightweight web fetch component Add WebFetchComponent for fast HTTP-based page fetching without browser overhead. Uses trafilatura for intelligent content extraction. Commands: - fetch_webpage: Extract main content as text/markdown/xml - Removes navigation, ads, boilerplate automatically - Extracts page metadata (title, description, author, date) - Extracts and lists page links - Much faster than Selenium-based read_webpage - fetch_raw_html: Get raw HTML for structure inspection - Optional truncation for large pages Features: - Trafilatura-powered content extraction (best-in-class accuracy) - Automatic link extraction with relative URL resolution - Page metadata extraction (OG tags, meta tags) - Configurable timeout, max content length, max links - Proper error handling for timeouts and HTTP errors - 19 comprehensive tests Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>	2026-01-19 01:04:22 -06:00
Nicholas Tindle	c1a1767034	feat(docs): Add block documentation auto-generation system (#11707 ) - Add generate_block_docs.py script that introspects block code to generate markdown - Support manual content preservation via <!-- MANUAL: --> markers - Add migrate_block_docs.py to preserve existing manual content from git HEAD - Add CI workflow (docs-block-sync.yml) to fail if docs drift from code - Add Claude PR review workflow (docs-claude-review.yml) for doc changes - Add manual LLM enhancement workflow (docs-enhance.yml) - Add GitBook configuration (.gitbook.yaml, SUMMARY.md) - Fix non-deterministic category ordering (categories is a set) - Add comprehensive test suite (32 tests) - Generate docs for 444 blocks with 66 preserved manual sections 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com> <!-- Clearly explain the need for these changes: --> ### Changes 🏗️ <!-- Concisely describe all of the changes made in this pull request: --> ### Checklist 📋 #### For code changes: - [x] I have clearly listed my changes in the PR description - [x] I have made a test plan - [x] I have tested my changes according to the test plan: <!-- Put your test plan here: --> - [x] Extensively test code generation for the docs pages <!-- CURSOR_SUMMARY --> --- > [!NOTE] > Introduces an automated documentation pipeline for blocks and integrates it into CI. > > - Adds `scripts/generate_block_docs.py` (+ tests) to introspect blocks and generate `docs/integrations/`, preserving `<!-- MANUAL: -->` sections > - New CI workflows: docs-block-sync (fails if docs drift), docs-claude-review (AI review for block/docs PRs), and docs-enhance** (optional LLM improvements) > - Updates existing Claude workflows to use `CLAUDE_CODE_OAUTH_TOKEN` instead of `ANTHROPIC_API_KEY` > - Improves numerous block descriptions/typos and links across backend blocks to standardize docs output > - Commits initial generated docs including `docs/integrations/README.md` and many provider/category pages > > <sup>Written by [Cursor Bugbot](https://cursor.com/dashboard?tab=bugbot) for commit `631e53e0f6`. This will update automatically on new commits. Configure [here](https://cursor.com/dashboard?tab=bugbot).</sup> <!-- /CURSOR_SUMMARY --> --------- Co-authored-by: Claude Opus 4.5 <noreply@anthropic.com>	2026-01-19 07:03:19 +00:00
Nicholas Tindle	e0784f8f6b	refactor(forge): simplify deeply nested error handling in Anthropic provider - Extract _get_tool_error_message helper method - Replace 20+ levels of nesting with simple for loop - Improve readability of tool_result construction - Update benchmark poetry.lock Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>	2026-01-19 00:15:33 -06:00
Nicholas Tindle	3040f39136	feat(forge): modernize web search with tiered provider system Replace basic DuckDuckGo-only search with a modern tiered system: 1. Tavily (primary) - AI-optimized results with content extraction - AI-generated answer summaries - Relevance scoring - Full page content extraction via search_and_extract command 2. Serper (secondary) - Fast, cheap Google SERP results - $0.30-1.00 per 1K queries - Real Google results without scraping 3. DDGS multi-engine (fallback) - Free, no API key required - Automatic fallback chain: DuckDuckGo → Bing → Brave → Google → etc. - 8 search backends supported Key changes: - Upgrade duckduckgo-search to ddgs v9.10 (renamed successor package) - Add Tavily and Serper API integrations - Implement automatic provider selection and fallback chain - Add search_and_extract command for research with content extraction - Add TAVILY_API_KEY and SERPER_API_KEY to env templates - Update benchmark httpx constraint for ddgs compatibility - 23 comprehensive tests for all providers and fallback scenarios Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>	2026-01-19 00:06:42 -06:00
Nicholas Tindle	515504c604	fix(classic): resolve pyright type errors in original_autogpt - Change Agent class to use ActionProposal instead of OneShotAgentActionProposal to support multiple prompt strategy types - Widen display_thoughts parameter type from AssistantThoughts to ModelWithSummary - Fix speak attribute access in agent_protocol_server with hasattr check - Add type: ignore comments for intentional thoughts field overrides in strategies - Remove unused OneShotAgentActionProposal import Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>	2026-01-18 23:53:23 -06:00
Nicholas Tindle	18edeaeaf4	fix(classic): fix linting and formatting errors across codebase - Fix 32+ flake8 E501 (line too long) errors by shortening descriptions - Remove unused import in todo.py - Fix test_todo.py argument order (config= keyword) - Add type annotations to fix pyright errors where straightforward - Add noqa comments for flake8 false positives in __init__.py - Remove unused nonlocal declarations in main.py - Run black and isort to fix formatting - Update CLAUDE.md with improved linting commands Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>	2026-01-18 23:37:28 -06:00
Nicholas Tindle	44182aff9c	feat(classic): add strategy benchmark test harness for CI - Add test_prompt_strategies.py harness to compare prompt strategies - Add pytest wrapper (test_strategy_benchmark.py) for CI integration - Fix serve command (remove invalid --port flag, use AP_SERVER_PORT env) - Fix test category (interface -> general) - Add aiohttp-retry dependency for agbenchmark - Add pytest markers: slow, integration, requires_agent Usage: poetry run python agbenchmark_config/test_prompt_strategies.py --quick poetry run pytest tests/integration/test_strategy_benchmark.py -v Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>	2026-01-18 23:36:19 -06:00
Nicholas Tindle	864c5a7846	fix(classic): approve+feedback now executes command then sends feedback Previously, when a user selected "Once" or "Always" with feedback (via Tab), the command was NOT executed because UserFeedbackProvided was raised before checking the approval scope. This fix changes the architecture from exception-based to return-value-based. Changes: - Add PermissionCheckResult class with allowed, scope, and feedback fields - Change check_command() to return PermissionCheckResult instead of bool - Update prompt_fn signature to return (ApprovalScope, feedback) tuple - Add pending_user_feedback mechanism to EpisodicActionHistory - Update execute() to handle feedback after successful command execution - Feedback message explicitly states "Command executed successfully" - Add on_auto_approve callback for displaying auto-approved commands - Add comprehensive tests for approval/denial with feedback scenarios Behavior: - Once + feedback → Execute command, then send feedback to agent - Always + feedback → Execute command, save permission, send feedback - Deny + feedback → Don't execute, send feedback to agent Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>	2026-01-18 22:32:43 -06:00
Nicholas Tindle	699fffb1a8	feat(classic): add Rich interactive selector for command approval Adds a custom Rich-based interactive selector for the command approval workflow. Features include: - Arrow key navigation for selecting approval options - Tab to add context to any selection (e.g., "Once + also check file x") - Dedicated inline feedback option with shadow placeholder text - Quick select with number keys 1-5 - Works within existing asyncio event loop (no prompt_toolkit dependency) Also adds UIProvider abstraction pattern for future UI implementations. Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>	2026-01-18 21:49:43 -06:00
Nicholas Tindle	f0641c2d26	fix(classic): auto-advance plan steps in Plan-Execute strategy The strategy was stuck in a loop because it tracked plan steps but never advanced them - the record_step_success() method existed but was never called by the agent's execution loop. Fix by using a _pending_step_advance flag to track when an action has been proposed. On the next parse_response_content() call, advance the previous step before processing the new response. This keeps step tracking self-contained in the strategy without requiring agent changes. Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>	2026-01-18 21:14:16 -06:00
Nicholas Tindle	94b6f74c95	feat(classic): add multiple prompt strategies for agent reasoning Implement four new prompt strategies based on research papers: - ReWOO: Reasoning Without Observation (5x token efficiency) - Plan-and-Execute: Separate planning from execution phases - Reflexion: Verbal reinforcement learning with episodic memory - Tree of Thoughts: Deliberate problem solving with tree search Each strategy extends a new BaseMultiStepPromptStrategy base class with shared utilities. Strategies are selectable via PROMPT_STRATEGY environment variable or config.prompt_strategy setting. Fix JSONSchema generation issue where Optional/Union types created anyOf schemas without direct type field - resolved by storing plan/phase state in strategy instances rather than ActionProposal. Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>	2026-01-18 20:33:10 -06:00
Nicholas Tindle	46aabab3ea	feat(classic): upgrade to Python 3.12+ with CI testing on 3.12, 3.13, 3.14 - Update Python version constraint from ^3.10 to ^3.12 in all pyproject.toml - Update classifiers to reflect Python 3.12, 3.13, 3.14 support - Update dependencies for Python 3.13+ compatibility: - chromadb: ^0.4.10 -> ^1.4.0 - numpy: >=1.26.0,<2.0.0 -> >=2.0.0 - watchdog: 4.0.0 -> ^6.0.0 - spacy: ^3.0.0 -> ^3.8.0 (numpy 2.x compatibility) - en-core-web-sm model: 3.7.1 -> 3.8.0 - httpx (benchmark): ^0.24.0 -> ^0.27.0 - Update tool configuration: - Black target-version: py310 -> py312 - Pyright pythonVersion: 3.10 -> 3.12 - Update Dockerfiles to use Python 3.12 - Update CI workflows to test on Python 3.12, 3.13, and 3.14 - Regenerate all poetry.lock files Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>	2026-01-18 20:25:11 -06:00
Nicholas Tindle	0a65df5102	fix(classic): always use native tool calling, fix N/A command loop - Remove openai_functions config option - native tool calling is now always enabled - Remove use_functions_api from BaseAgentConfiguration and prompt strategy - Add use_prefill config to disable prefill for Anthropic (prefill + tools incompatible) - Update anthropic dependency to ^0.45.0 for tools API support - Simplify prompt strategy to always expect tool_calls from LLM response This fixes the N/A command loop bug where models would output "N/A" as a command name when function calling was disabled. With native tool calling always enabled, models are forced to pick from valid tools only. Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>	2026-01-18 19:54:40 -06:00

1 2 3 4 5 ...

7806 Commits