AutoGPT

mirror of https://github.com/Significant-Gravitas/AutoGPT.git synced 2026-04-08 03:00:28 -04:00

Author	SHA1	Message	Date
Zamil Majdy	8b25e62959	feat(backend,frontend): add explicit safe mode toggles for HITL and sensitive actions (#11756 ) ## Summary This PR introduces two explicit safe mode toggles for controlling agent execution behavior, providing clearer and more granular control over when agents should pause for human review. ### Key Changes New Safe Mode Settings: - `human_in_the_loop_safe_mode` (bool, default `true`) - Controls whether human-in-the-loop (HITL) blocks pause for review - `sensitive_action_safe_mode` (bool, default `false`) - Controls whether sensitive action blocks pause for review New Computed Properties on LibraryAgent: - `has_human_in_the_loop` - Indicates if agent contains HITL blocks - `has_sensitive_action` - Indicates if agent contains sensitive action blocks Block Changes: - Renamed `requires_human_review` to `is_sensitive_action` on blocks for clarity - Blocks marked as `is_sensitive_action=True` pause only when `sensitive_action_safe_mode=True` - HITL blocks pause when `human_in_the_loop_safe_mode=True` Frontend Changes: - Two separate toggles in Agent Settings based on block types present - Toggle visibility based on `has_human_in_the_loop` and `has_sensitive_action` computed properties - Settings cog hidden if neither toggle applies - Proper state management for both toggles with defaults AI-Generated Agent Behavior: - AI-generated agents set `sensitive_action_safe_mode=True` by default - This ensures sensitive actions are reviewed for AI-generated content ## Changes Backend: - `backend/data/graph.py` - Updated `GraphSettings` with two boolean toggles (non-optional with defaults), added `has_sensitive_action` computed property - `backend/data/block.py` - Renamed `requires_human_review` to `is_sensitive_action`, updated review logic - `backend/data/execution.py` - Updated `ExecutionContext` with both safe mode fields - `backend/api/features/library/model.py` - Added `has_human_in_the_loop` and `has_sensitive_action` to `LibraryAgent` - `backend/api/features/library/db.py` - Updated to use `sensitive_action_safe_mode` parameter - `backend/executor/utils.py` - Simplified execution context creation Frontend: - `useAgentSafeMode.ts` - Rewritten to support two independent toggles - `AgentSettingsModal.tsx` - Shows two separate toggles - `SelectedSettingsView.tsx` - Shows two separate toggles - Regenerated API types with new schema ## Test Plan - [x] All backend tests pass (Python 3.11, 3.12, 3.13) - [x] All frontend tests pass - [x] Backend format and lint pass - [x] Frontend format and lint pass - [x] Pre-commit hooks pass --------- Co-authored-by: Nicholas Tindle <nicholas.tindle@agpt.co>	2026-01-21 00:56:02 +00:00
Zamil Majdy	35a13e3df5	fix(backend): Use explicit schema qualification for pgvector types (#11805 ) ## Summary - Fix intermittent "type 'vector' does not exist" errors when using PgBouncer in transaction mode - The issue was that `SET search_path` and the actual query could run on different backend connections - Use explicit schema qualification (`{schema}.vector`, `OPERATOR({schema}.<=>)`) instead of relying on search_path ## Test plan - [x] Tested vector type cast on local: `'[1,2,3]'::platform.vector` works - [x] Tested OPERATOR syntax on local: `OPERATOR(platform.<=>)` works - [x] Tested on dev via kubectl exec: both work correctly - [ ] Deploy to dev and verify backfill_missing_embeddings endpoint no longer errors ## Related Issues Fixes: AUTOGPT-SERVER-763, AUTOGPT-SERVER-764 --------- Co-authored-by: Claude Opus 4.5 <noreply@anthropic.com>	2026-01-20 22:18:16 +00:00
Mewael Tsegay Desta	2169b433c9	feat(backend/blocks): add ConcatenateListsBlock (#11567 ) # feat(backend/blocks): add ConcatenateListsBlock ## Description This PR implements a new block `ConcatenateListsBlock` that concatenates multiple lists into a single list. This addresses the "good first issue" for implementing a list concatenation block in the platform/blocks area. The block takes a list of lists as input and combines all elements in order into a single concatenated list. This is useful for workflows that need to merge data from multiple sources or combine results from different operations. ### Changes 🏗️ - Added `ConcatenateListsBlock` class in `autogpt_platform/backend/backend/blocks/data_manipulation.py` - Input: `lists: List[List[Any]]` - accepts a list of lists to concatenate - Output: `concatenated_list: List[Any]` - returns a single concatenated list - Error output: `error: str` - provides clear error messages for invalid input types - Block ID: `3cf9298b-5817-4141-9d80-7c2cc5199c8e` - Category: `BlockCategory.BASIC` (consistent with other list manipulation blocks) - Added comprehensive test suite in `autogpt_platform/backend/test/blocks/test_concatenate_lists.py` - Tests using built-in `test_input`/`test_output` validation - Manual test cases covering edge cases (empty lists, single list, empty input) - Error handling tests for invalid input types - Category consistency verification - All tests passing - Implementation details: - Uses `extend()` method for efficient list concatenation - Preserves element order from all input lists - Runtime type validation: Explicitly checks `isinstance(lst, list)` before calling `extend()` to prevent: - Strings being iterated character-by-character (e.g., `extend("abc")` → `['a', 'b', 'c']`) - Non-iterable types causing `TypeError` (e.g., `extend(1)`) - Clear error messages indicating which index has invalid input - Handles edge cases: empty lists, empty input, single list, None values - Follows existing block patterns and conventions ### Checklist 📋 #### For code changes: - [x] I have clearly listed my changes in the PR description - [x] I have made a test plan - [x] I have tested my changes according to the test plan: - [x] Run `poetry run pytest test/blocks/test_concatenate_lists.py -v` - all tests pass - [x] Verified block can be imported and instantiated - [x] Tested with built-in test cases (4 test scenarios) - [x] Tested manual edge cases (empty lists, single list, empty input) - [x] Tested error handling for invalid input types - [x] Verified category is `BASIC` for consistency - [x] Verified no linting errors - [x] Confirmed block follows same patterns as other blocks in `data_manipulation.py` #### Code Quality: - [x] Code follows existing patterns and conventions - [x] Type hints are properly used - [x] Documentation strings are clear and descriptive - [x] Runtime type validation implemented - [x] Error handling with clear error messages - [x] No linting errors - [x] Prisma client generated successfully ### Testing Test Results: ``` test/blocks/test_concatenate_lists.py::test_concatenate_lists_block_builtin_tests PASSED test/blocks/test_concatenate_lists.py::test_concatenate_lists_manual PASSED ============================== 2 passed in 8.35s ============================== ``` Test Coverage: - Basic concatenation: `[[1, 2, 3], [4, 5, 6]]` → `[1, 2, 3, 4, 5, 6]` - Mixed types: `[["a", "b"], ["c"], ["d", "e", "f"]]` → `["a", "b", "c", "d", "e", "f"]` - Empty list handling: `[[1, 2], []]` → `[1, 2]` - Empty input: `[]` → `[]` - Single list: `[[1, 2, 3]]` → `[1, 2, 3]` - Error handling: Invalid input types (strings, non-lists) produce clear error messages - Category verification: Confirmed `BlockCategory.BASIC` for consistency ### Review Feedback Addressed - Category Consistency: Changed from `BlockCategory.DATA` to `BlockCategory.BASIC` to match other list manipulation blocks (`AddToListBlock`, `FindInListBlock`, etc.) - Type Robustness: Added explicit runtime validation with `isinstance(lst, list)` check before calling `extend()` to prevent: - Strings being iterated character-by-character - Non-iterable types causing `TypeError` - Error Handling: Added `error` output field with clear, descriptive error messages indicating which index has invalid input - Test Coverage: Added test case for error handling with invalid input types ### Related Issues - Addresses: "Implement block to concatenate lists" (good first issue, platform/blocks, hacktoberfest) ### Notes - This is a straightforward data manipulation block that doesn't require external dependencies - The block will be automatically discovered by the block loading system - No database or configuration changes required - Compatible with existing workflow system - All review feedback has been addressed and incorporated <!-- CURSOR_SUMMARY --> --- > [!NOTE] > Adds a new list utility and updates docs. > > - New block: `ConcatenateListsBlock` in `backend/blocks/data_manipulation.py` > - Input `lists: List[List[Any]]`; outputs `concatenated_list` or `error` > - Skips `None` entries; emits error for non-list items; preserves order > - Docs: Adds "Concatenate Lists" section to `docs/integrations/basic.md` and links it in `docs/integrations/README.md` > - Contributor guide: New `docs/CLAUDE.md` with manual doc section guidelines > > <sup>Written by [Cursor Bugbot](https://cursor.com/dashboard?tab=bugbot) for commit `4f56dd86c2`. This will update automatically on new commits. Configure [here](https://cursor.com/dashboard?tab=bugbot).</sup> <!-- /CURSOR_SUMMARY --> --------- Co-authored-by: Zamil Majdy <zamil.majdy@agpt.co> Co-authored-by: Nicholas Tindle <nicholas.tindle@agpt.co> Co-authored-by: Claude Opus 4.5 <noreply@anthropic.com>	2026-01-20 18:04:12 +00:00
Nicholas Tindle	326554d89a	style(classic): update black to 24.10.0 and reformat Update black version to match pre-commit hook (24.10.0) and reformat all files with the new version. Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>	2026-01-20 10:51:54 -06:00
Nicholas Tindle	5e22a1888a	chore: add classic benchmark reports and workspaces to gitignore Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>	2026-01-20 10:42:55 -06:00
Nicholas Tindle	a4d7b0142f	fix(classic): resolve all pyright type errors - Add missing strategies (lats, multi_agent_debate) to PromptStrategyName - Fix method override signatures for reasoning_effort parameter - Fix Pydantic Field() overload issues with helper function - Fix BeautifulSoup Tag type narrowing in web_fetch.py - Fix Optional member access in playwright_browser.py and rewoo.py - Convert hasattr patterns to getattr for proper type narrowing - Add proper type casts for Literal types - Fix file storage path type conversions - Exclude legacy challenges/ from pyright checking Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>	2026-01-20 10:41:53 -06:00
Nicholas Tindle	fa0b7029dd	fix(platform): make chat credentials type selection deterministic (#11795 ) ## Background When using chat to run blocks/agents that support multiple credential types (e.g., GitHub blocks support both `api_key` and `oauth2`), users reported that the credentials setup UI would randomly show either "Add API key" or "Connect account (OAuth)" - seemingly at random between requests or server restarts. ## Root Cause The bug was in how the backend selected which credential type to return when building the missing credentials response: ```python cred_type = next(iter(field_info.supported_types), "api_key") ``` The problem is that `supported_types` is a frozenset. When you call `iter()` on a frozenset and take `next()`, the iteration order is non-deterministic due to Python's hash randomization. This means: - `frozenset({'api_key', 'oauth2'})` could iterate as either `['api_key', 'oauth2']` or `['oauth2', 'api_key']` - The order varies between Python process restarts and sometimes between requests - This caused the UI to randomly show different credential options ### Changes 🏗️ Backend (`utils.py`, `run_block.py`, `run_agent.py`): - Added `_serialize_missing_credential()` helper that uses `sorted()` for deterministic ordering - Added `build_missing_credentials_from_graph()` and `build_missing_credentials_from_field_info()` utilities - Now returns both `type` (first sorted type, for backwards compat) and `types` (full array with ALL supported types) Frontend (`helpers.ts`, `ChatCredentialsSetup.tsx`, `useChatMessage.ts`): - Updated to read the `types` array from backend response - Changed `credentialType` (single) to `credentialTypes` (array) throughout the chat credentials flow - Passes all supported types to `CredentialsInput` via `credentials_types` schema field ### Result Now `useCredentials.ts` correctly sets both `supportsApiKey=true` AND `supportsOAuth2=true` when both are supported, ensuring: 1. Deterministic behavior - no more random type selection 2. All saved credentials shown - credentials of any supported type appear in the selection list ### Checklist 📋 #### For code changes: - [x] I have clearly listed my changes in the PR description - [x] I have made a test plan - [x] I have tested my changes according to the test plan: - [x] Verified GitHub block shows consistent credential options across page reloads - [x] Verified both OAuth and API key credentials appear in selection when user has both saved - [x] Verified backend returns `types: ["api_key", "oauth2"]` array (checked via Python REPL) <!-- CURSOR_SUMMARY --> --- > [!NOTE] > Ensures deterministic credential type selection and surfaces all supported types end-to-end. > > - Backend: add `_serialize_missing_credential`, `build_missing_credentials_from_graph/field_info`; `run_agent`/`run_block` now return missing credentials with stable ordering and both `type` (first) and `types` (all). > - Frontend: chat helpers and UI (`helpers.ts`, `ChatCredentialsSetup.tsx`, `useChatMessage.ts`) now read `types`, switch from single `credentialType` to `credentialTypes`, and pass all supported `credentials_types` in schemas. > > <sup>Written by [Cursor Bugbot](https://cursor.com/dashboard?tab=bugbot) for commit `7d80f4f0e0`. This will update automatically on new commits. Configure [here](https://cursor.com/dashboard?tab=bugbot).</sup> <!-- /CURSOR_SUMMARY --> --------- Co-authored-by: Claude Opus 4.5 <noreply@anthropic.com> Co-authored-by: claude[bot] <41898282+claude[bot]@users.noreply.github.com> Co-authored-by: Nicholas Tindle <ntindle@users.noreply.github.com>	2026-01-20 16:19:57 +00:00
Abhimanyu Yadav	c20ca47bb0	feat(frontend): enhance RunGraph and RunInputDialog components with loading states and improved UI (#11808 ) ### Changes 🏗️ - Enhanced UI for the Run Graph button with improved loading states and animations - Added color-coded edges in the flow editor based on output data types - Improved the layout of the Run Input Dialog with a two-column grid design - Refined the styling of flow editor controls with consistent icon sizes and colors - Updated tutorial icons with better color and size customization - Fixed credential field display to show provider name with "credential" suffix - Optimized draft saving by excluding node position changes to prevent excessive saves when dragging nodes ### Checklist 📋 #### For code changes: - [x] I have clearly listed my changes in the PR description - [x] I have made a test plan - [x] I have tested my changes according to the test plan: - [x] Verified that the Run Graph button shows proper loading states - [x] Confirmed that edges display correct colors based on data types - [x] Tested the Run Input Dialog layout with various input configurations - [x] Checked that flow editor controls display consistently - [x] Verified that tutorial icons render properly - [x] Confirmed credential fields show proper provider names - [x] Tested that dragging nodes doesn't trigger unnecessary draft saves	2026-01-20 15:50:23 +00:00
Abhimanyu Yadav	7756e2d12d	refactor(frontend): refactor credentials input with unified CredentialsGroupedView component (#11801 ) ### Changes 🏗️ - Refactored the credentials input handling in the RunInputDialog to use the shared CredentialsGroupedView component - Moved CredentialsGroupedView from agent library to a shared component location for reuse - Fixed source name handling in edge creation to properly handle tool source names - Improved node output UI by replacing custom expand/collapse with Accordion component - Fixed timing of hardcoded values synchronization with handle IDs to ensure proper loading - Enabled NEW_FLOW_EDITOR and BUILDER_VIEW_SWITCH feature flags by default ### Checklist 📋 #### For code changes: - [x] I have clearly listed my changes in the PR description - [x] I have made a test plan - [x] I have tested my changes according to the test plan: - [x] Verified credentials input works in both agent run dialog and builder run dialog - [x] Confirmed node output accordion works correctly - [x] Tested flow editor with tools to ensure source name handling works properly - [x] Verified hardcoded values sync correctly with handle IDs #### For configuration changes: - [x] `.env.default` is updated or already compatible with my changes - [x] `docker-compose.yml` is updated or already compatible with my changes - [x] I have included a list of my configuration changes in the PR description (under Changes)	2026-01-20 12:20:25 +00:00
Nicholas Tindle	7d6375f59c	style(classic): fix flake8 line length issue Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>	2026-01-20 01:25:00 -06:00
Nicholas Tindle	aeec0ce509	chore: add test.db to gitignore	2026-01-20 01:24:22 -06:00
Nicholas Tindle	b32bfcaac5	chore: remove test.db from tracking	2026-01-20 01:24:00 -06:00
Nicholas Tindle	5373a6eb6e	style(classic): fix code formatting with black Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>	2026-01-20 01:23:51 -06:00
Nicholas Tindle	98cde46ccb	style(classic): fix import sorting with isort Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>	2026-01-20 01:23:33 -06:00
Nicholas Tindle	bd10da10d9	ci: update pre-commit hooks for consolidated classic Poetry project - Consolidate classic poetry-install hooks into single hook using classic/ - Update isort hook to work with consolidated project structure - Simplify flake8 hooks to use single classic/.flake8 config - Consolidate pyright hooks into single hook for classic/ - Add direct_benchmark to hook coverage Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>	2026-01-20 01:21:50 -06:00
Nicholas Tindle	60fdee1345	fix(classic): resolve linting and formatting issues for CI compliance - Update .flake8 config to exclude workspace directories and ignore E203 - Fix import sorting (isort) across multiple files - Fix code formatting (black) across multiple files - Remove unused imports and fix line length issues (flake8) - Fix f-strings without placeholders and unused variables Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>	2026-01-20 01:16:38 -06:00
Nicholas Tindle	6f2783468c	feat(classic): add sub-agent architecture and LATS/multi-agent debate strategies Add comprehensive sub-agent spawning infrastructure that enables prompt strategies to coordinate multiple agents for advanced reasoning patterns. New files: - forge/agent/execution_context.py: ExecutionContext, ResourceBudget, SubAgentHandle, and AgentFactory protocol for sub-agent lifecycle - agent_factory/default_factory.py: DefaultAgentFactory implementation - prompt_strategies/lats.py: Language Agent Tree Search using MCTS with sub-agents for action expansion and evaluation - prompt_strategies/multi_agent_debate.py: Multi-agent debate with proposal, critique, and consensus phases Key changes: - BaseMultiStepPromptStrategy gains spawn_sub_agent(), run_sub_agent(), spawn_and_run(), and run_parallel() methods - Agent class accepts optional ExecutionContext and injects it into strategies - Sub-agents enabled by default (enable_sub_agents=True) - Resource limits: max_depth=5, max_sub_agents=25, max_cycles=25 All 7 strategies now available in benchmark: one_shot, rewoo, plan_execute, reflexion, tree_of_thoughts, lats, multi_agent_debate Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>	2026-01-20 01:01:28 -06:00
Nicholas Tindle	c1031b286d	ci(classic): update CI workflows for consolidated Poetry project Update all classic CI workflows to use the single consolidated pyproject.toml at classic/ instead of individual project directories. Changes: - classic-autogpt-ci.yml: Run from classic/, update cache key and test paths - classic-forge-ci.yml: Run from classic/, update cache key and test paths - classic-benchmark-ci.yml: Run from classic/, use direct-benchmark command - classic-python-checks.yml: Simplify to single job (no matrix needed) - classic-autogpts-ci.yml: Update to use direct-benchmark for smoke tests Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>	2026-01-20 00:53:50 -06:00
Nicholas Tindle	b849eafb7f	feat(direct_benchmark): enable shell command execution with safety denylist Enable agents to execute shell commands during benchmarks by setting execute_local_commands=True and using denylist mode to block dangerous commands (rm, sudo, chmod, kill, etc.) while allowing safe operations. Also adds ExecutePython challenge to test code execution capability. Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>	2026-01-20 00:52:06 -06:00
Nicholas Tindle	572c3f5e0d	refactor(classic): consolidate Poetry projects into single pyproject.toml Merge forge/, original_autogpt/, and direct_benchmark/ into a single Poetry project to eliminate cross-project path dependency issues. Changes: - Create classic/pyproject.toml with merged dependencies from all three projects - Remove individual pyproject.toml and poetry.lock files from subdirectories - Update all CLAUDE.md files to reflect commands run from classic/ root - Update all README.md files with new installation and usage instructions All packages are now included via the packages directive: - forge/forge (core agent framework) - original_autogpt/autogpt (AutoGPT agent) - direct_benchmark/direct_benchmark (benchmark harness) CLI entry points preserved: autogpt, serve, direct-benchmark Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>	2026-01-20 00:49:56 -06:00
Nicholas Tindle	89003a585d	feat(direct_benchmark): show "would have passed" for timed-out challenges When a challenge times out but the agent's solution would have passed evaluation, this is now clearly indicated: - Completion blocks show "TIMEOUT (would have passed)" in yellow - Recent completions panel shows hourglass icon + "would pass" suffix - Summary table has new "Would Pass" column - Final summary shows "+N would pass" count - Success rate includes "would pass" challenges The evaluator still runs on timed-out challenges to calculate the score, but success remains False. This gives visibility into near-misses that just needed more time. Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>	2026-01-20 00:30:00 -06:00
Nicholas Tindle	0e65785228	fix(direct_benchmark): don't mark timed-out challenges as passed Previously, the evaluator would run on all results including timed-out challenges. If the agent happened to write a working solution before timing out, evaluation would pass and override success=True, resulting in contradictory output showing both PASS and "timed out". Now we skip evaluation for timed-out challenges - they cannot pass. Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>	2026-01-20 00:25:41 -06:00
Nicholas Tindle	f07dff1cdd	fix(direct_benchmark): add pytest dependency for challenge evaluation The TicTacToe and other challenges use pytest-based test files for evaluation. Without pytest installed in the benchmark virtualenv, these evaluations were silently failing. Root cause: test.py imports pytest but the package wasn't a dependency, causing ModuleNotFoundError during evaluation subprocess. Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>	2026-01-20 00:21:12 -06:00
Nicholas Tindle	00e02a4696	feat(direct_benchmark): add run ID to completion blocks Include config:challenge:attempt and timestamp in completion block header for easier debugging and log correlation. Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>	2026-01-20 00:14:23 -06:00
Nicholas Tindle	634bff8277	refactor(forge): replace Selenium with Playwright for web browsing - Remove selenium.py and test_selenium.py - Add playwright_browser.py with WebPlaywrightComponent - Update web component exports to use Playwright - Update dependencies in pyproject.toml/poetry.lock - Minor agent and reflexion strategy improvements - Update CLAUDE.md documentation Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>	2026-01-19 23:57:17 -06:00
Nicholas Tindle	d591f36c7b	fix(direct_benchmark): track cost from LLM provider Previously cost was hardcoded to 0.0. Now extracts cumulative cost from MultiProvider.get_incurred_cost() after each step execution. Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>	2026-01-19 23:37:12 -06:00
Nicholas Tindle	a347bed0b1	feat(direct_benchmark): add incremental resume and selective reset Benchmarks now automatically save progress and resume from where they left off. State is persisted to .benchmark_state.json in reports dir. Features: - Auto-resume: runs skip already-completed challenges - --fresh: clear all state and start over - --retry-failures: re-run only failed challenges - --reset-strategy/model/challenge: selective resets - `state show/clear/reset` subcommands for state management - Config mismatch detection with auto-reset Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>	2026-01-19 23:32:27 -06:00
Nicholas Tindle	4eeb6ee2b0	feat(direct_benchmark): add CI mode for non-interactive environments Add --ci flag that disables Rich Live display while preserving completion blocks. Auto-detects CI environment via CI env var or non-TTY stdout. Prints progress every 10 completions for visibility. Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>	2026-01-19 23:21:10 -06:00
Nicholas Tindle	7db962b9f9	feat(direct_benchmark): dynamic column layout up to 10 wide - Calculate max columns based on terminal width (up to 10) - Reduced panel width from 35 to 30 chars to fit more - Wider terminals can now show more parallel runs side-by-side Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>	2026-01-19 23:15:16 -06:00
Nicholas Tindle	9108b21541	fix(direct_benchmark): parallel execution and always show completion blocks Fixes: - Use run_key (config:challenge) instead of just config_name for tracking active runs - allows multiple challenges from same config to run in parallel - Add asyncio.sleep(0) yields to let multiple tasks acquire semaphore and start before any proceed with work - Always print completion blocks (not just failures) for visibility This should properly show 8/8 active runs when running with --parallel 8. Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>	2026-01-19 23:13:56 -06:00
Nicholas Tindle	ffe9325296	feat(direct_benchmark): multi-panel UI with copy-paste completion blocks UI improvements: - Multi-column layout: each active config gets its own panel showing challenge name and step history (last 6 steps with status) - Copy-paste completion blocks: when a challenge finishes (especially failures), prints a detailed block with all steps for easy debugging - Configurable logging: suppresses noisy LLM provider warnings unless --debug flag is set - Pass debug flag through harness to UI Example active runs panel: ┌─ one_shot/claude ─┬─ rewoo/claude ────┐ │ ReadFile │ WriteFile │ │ ✓ #1 read_file │ ✓ #1 think │ │ ✓ #2 write_file │ ✓ #2 plan │ │ ● step 3: ... │ ● step 3: ... │ └───────────────────┴───────────────────┘ Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>	2026-01-19 23:10:34 -06:00
Nicholas Tindle	0a616d9267	feat(direct_benchmark): add step-level logging with colored prefixes - Add step callback to AgentRunner for real-time step logging - BenchmarkUI now shows: - Active runs with current step info - Recent steps panel with colored config prefixes - Proper Live display refresh (implements __rich_console__) - Each config gets a distinct color for easy identification - Verbose mode prints step logs immediately with config prefix - Fix Live display not updating (pass UI object, not rendered content) Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>	2026-01-19 23:02:20 -06:00
Nicholas Tindle	ab95077e5b	refactor(forge): remove VCR cassettes, use real API calls with skip for forks - Remove vcrpy and pytest-recording dependencies - Remove tests/vcr/ directory and vcr_cassettes submodule - Remove .gitmodules (only had cassette submodule) - Simplify CI workflow - no more cassette checkout/push/PAT_REVIEW - Tests requiring API keys now skip if not set (fork PRs) - Update CLAUDE.md files to remove cassette references - Fix broken agbenchmark path in pyproject.toml Security improvement: removes need for PAT with cross-repo write access. Fork PRs will have API-dependent tests skipped (GitHub protects secrets). Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>	2026-01-19 22:51:57 -06:00
Nicholas Tindle	e477150979	Merge branch 'dev' into make-old-work	2026-01-19 22:30:46 -06:00
Nicholas Tindle	804430e243	refactor(classic): migrate from agbenchmark to direct_benchmark harness - Remove old benchmark/ folder with agbenchmark framework - Move challenges to direct_benchmark/challenges/ - Move analysis tools (analyze_reports.py, analyze_failures.py) to direct_benchmark/ - Move challenges_already_beaten.json to direct_benchmark/ - Update CI workflow to use direct_benchmark - Update CLAUDE.md files with new benchmarking instructions - Add benchmarking section to original_autogpt/CLAUDE.md The direct_benchmark harness directly instantiates agents without HTTP server overhead, enabling parallel execution with asyncio semaphore. Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>	2026-01-19 22:29:51 -06:00
Nicholas Tindle	acb320d32d	feat(classic): add noninteractive mode env var and benchmark config logging - Add NONINTERACTIVE_MODE env var support to AppConfig for disabling user interaction during automated runs - Benchmark harness now sets NONINTERACTIVE_MODE=True when starting agents - Add agent configuration logging at server startup (model, strategy, etc.) - Harness logs env vars being passed to agent for verification - Add --agent-output flag to show full agent server output for debugging Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>	2026-01-19 19:40:24 -06:00
Nicholas Tindle	32f68d5999	feat(classic): add failure analysis tool and improve benchmark output Benchmark improvements: - Add analyze_failures.py for pattern detection and failure analysis - Add informative step output: tool name, args, result status, cost - Add --all and --matrix flags for comprehensive model/strategy testing - Add --analyze-only and --no-analyze flags for flexible analysis control - Auto-run failure analysis after benchmarks with markdown export - Fix directory creation bug in ReportManager (add parents=True) Prompt strategy enhancements: - Implement full plan_execute, reflexion, rewoo, tree_of_thoughts strategies - Add PROMPT_STRATEGY env var support for strategy selection - Add extended thinking support for Anthropic models - Add reasoning effort support for OpenAI o-series models LLM provider improvements: - Add thinking_budget_tokens config for Anthropic extended thinking - Add reasoning_effort config for OpenAI reasoning models - Improve error feedback for LLM self-correction Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>	2026-01-19 18:58:41 -06:00
Nicholas Tindle	49f56b4e8d	feat(classic): enhance strategy benchmark harness with model comparison and bug fixes - Add model comparison support to test harness (claude, openai, gpt5, opus presets) - Add --models, --smart-llm, --fast-llm, --list-models CLI args - Add real-time logging with timestamps and progress indicators - Fix success parsing bug: read results[0].success instead of non-existent metrics.success - Fix agbenchmark TestResult validation: use exception typename when value is empty - Fix WebArena challenge validation: use strings instead of integers in instantiation_dict - Fix Agent type annotations: create AnyActionProposal union for all prompt strategies - Add pytest integration tests for the strategy benchmark harness Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>	2026-01-19 18:07:14 -06:00
Swifty	bc75d70e7d	refactor(backend): Improve Langfuse tracing with v3 SDK patterns and @observe decorators (#11803 ) <!-- Clearly explain the need for these changes: --> This PR improves the Langfuse tracing implementation in the chat feature by adopting the v3 SDK patterns, resulting in cleaner code and better observability. ### Changes 🏗️ - Simplified Langfuse client usage: Replace manual client initialization with `langfuse.get_client()` global singleton - Use v3 context managers: Switch to `start_as_current_observation()` and `propagate_attributes()` for automatic trace propagation - Auto-instrument OpenAI calls: Use `langfuse.openai` wrapper for automatic LLM call tracing instead of manual generation tracking - Add `@observe` decorators: All chat tools now have `@observe(as_type="tool")` decorators for automatic tool execution tracing: - `add_understanding` - `view_agent_output` (renamed from `agent_output`) - `create_agent` - `edit_agent` - `find_agent` - `find_block` - `find_library_agent` - `get_doc_page` - `run_agent` - `run_block` - `search_docs` - Remove manual trace lifecycle: Eliminated the verbose `finally` block that manually ended traces/generations - Rename tool: `agent_output` → `view_agent_output` for clarity ### Checklist 📋 #### For code changes: - [x] I have clearly listed my changes in the PR description - [x] I have made a test plan - [x] I have tested my changes according to the test plan: - [x] Verified chat feature works with Langfuse tracing enabled - [x] Confirmed traces appear correctly in Langfuse dashboard with tool spans - [x] Tested tool execution flows show up as nested observations #### For configuration changes: - [x] `.env.default` is updated or already compatible with my changes - [x] `docker-compose.yml` is updated or already compatible with my changes - [x] I have included a list of my configuration changes in the PR description (under Changes) No configuration changes required - uses existing Langfuse environment variables.	2026-01-19 20:56:51 +00:00
Nicholas Tindle	bead811e73	docs(classic): add workspace, settings, and permissions documentation Document the layered configuration system including: - Workspace structure (.autogpt/ directory layout) - Settings location (environment variables, workspace YAML, agent YAML) - Permission system (check order, pattern syntax, approval scopes) - Default security behavior Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>	2026-01-19 12:17:10 -06:00
Nicholas Tindle	013f728ebf	feat(forge): improve tool call error feedback for LLM self-correction When tool calls fail validation, the error messages now include: - What arguments were actually provided - The expected parameter schema with types and required/optional indicators This helps LLMs understand and fix their mistakes when retrying, rather than just being told a parameter is missing. Example improved error: Invalid function call for write_file: 'contents' is a required property You provided: {"filename": 'story.txt'} Expected parameters: {"filename": string (required), "contents": string (required)} Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>	2026-01-19 11:49:17 -06:00
Nicholas Tindle	cda9572acd	feat(forge): add lightweight web fetch component Add WebFetchComponent for fast HTTP-based page fetching without browser overhead. Uses trafilatura for intelligent content extraction. Commands: - fetch_webpage: Extract main content as text/markdown/xml - Removes navigation, ads, boilerplate automatically - Extracts page metadata (title, description, author, date) - Extracts and lists page links - Much faster than Selenium-based read_webpage - fetch_raw_html: Get raw HTML for structure inspection - Optional truncation for large pages Features: - Trafilatura-powered content extraction (best-in-class accuracy) - Automatic link extraction with relative URL resolution - Page metadata extraction (OG tags, meta tags) - Configurable timeout, max content length, max links - Proper error handling for timeouts and HTTP errors - 19 comprehensive tests Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>	2026-01-19 01:04:22 -06:00
Nicholas Tindle	c1a1767034	feat(docs): Add block documentation auto-generation system (#11707 ) - Add generate_block_docs.py script that introspects block code to generate markdown - Support manual content preservation via <!-- MANUAL: --> markers - Add migrate_block_docs.py to preserve existing manual content from git HEAD - Add CI workflow (docs-block-sync.yml) to fail if docs drift from code - Add Claude PR review workflow (docs-claude-review.yml) for doc changes - Add manual LLM enhancement workflow (docs-enhance.yml) - Add GitBook configuration (.gitbook.yaml, SUMMARY.md) - Fix non-deterministic category ordering (categories is a set) - Add comprehensive test suite (32 tests) - Generate docs for 444 blocks with 66 preserved manual sections 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com> <!-- Clearly explain the need for these changes: --> ### Changes 🏗️ <!-- Concisely describe all of the changes made in this pull request: --> ### Checklist 📋 #### For code changes: - [x] I have clearly listed my changes in the PR description - [x] I have made a test plan - [x] I have tested my changes according to the test plan: <!-- Put your test plan here: --> - [x] Extensively test code generation for the docs pages <!-- CURSOR_SUMMARY --> --- > [!NOTE] > Introduces an automated documentation pipeline for blocks and integrates it into CI. > > - Adds `scripts/generate_block_docs.py` (+ tests) to introspect blocks and generate `docs/integrations/`, preserving `<!-- MANUAL: -->` sections > - New CI workflows: docs-block-sync (fails if docs drift), docs-claude-review (AI review for block/docs PRs), and docs-enhance** (optional LLM improvements) > - Updates existing Claude workflows to use `CLAUDE_CODE_OAUTH_TOKEN` instead of `ANTHROPIC_API_KEY` > - Improves numerous block descriptions/typos and links across backend blocks to standardize docs output > - Commits initial generated docs including `docs/integrations/README.md` and many provider/category pages > > <sup>Written by [Cursor Bugbot](https://cursor.com/dashboard?tab=bugbot) for commit `631e53e0f6`. This will update automatically on new commits. Configure [here](https://cursor.com/dashboard?tab=bugbot).</sup> <!-- /CURSOR_SUMMARY --> --------- Co-authored-by: Claude Opus 4.5 <noreply@anthropic.com>	2026-01-19 07:03:19 +00:00
Nicholas Tindle	e0784f8f6b	refactor(forge): simplify deeply nested error handling in Anthropic provider - Extract _get_tool_error_message helper method - Replace 20+ levels of nesting with simple for loop - Improve readability of tool_result construction - Update benchmark poetry.lock Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>	2026-01-19 00:15:33 -06:00
Nicholas Tindle	3040f39136	feat(forge): modernize web search with tiered provider system Replace basic DuckDuckGo-only search with a modern tiered system: 1. Tavily (primary) - AI-optimized results with content extraction - AI-generated answer summaries - Relevance scoring - Full page content extraction via search_and_extract command 2. Serper (secondary) - Fast, cheap Google SERP results - $0.30-1.00 per 1K queries - Real Google results without scraping 3. DDGS multi-engine (fallback) - Free, no API key required - Automatic fallback chain: DuckDuckGo → Bing → Brave → Google → etc. - 8 search backends supported Key changes: - Upgrade duckduckgo-search to ddgs v9.10 (renamed successor package) - Add Tavily and Serper API integrations - Implement automatic provider selection and fallback chain - Add search_and_extract command for research with content extraction - Add TAVILY_API_KEY and SERPER_API_KEY to env templates - Update benchmark httpx constraint for ddgs compatibility - 23 comprehensive tests for all providers and fallback scenarios Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>	2026-01-19 00:06:42 -06:00
Nicholas Tindle	515504c604	fix(classic): resolve pyright type errors in original_autogpt - Change Agent class to use ActionProposal instead of OneShotAgentActionProposal to support multiple prompt strategy types - Widen display_thoughts parameter type from AssistantThoughts to ModelWithSummary - Fix speak attribute access in agent_protocol_server with hasattr check - Add type: ignore comments for intentional thoughts field overrides in strategies - Remove unused OneShotAgentActionProposal import Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>	2026-01-18 23:53:23 -06:00
Nicholas Tindle	18edeaeaf4	fix(classic): fix linting and formatting errors across codebase - Fix 32+ flake8 E501 (line too long) errors by shortening descriptions - Remove unused import in todo.py - Fix test_todo.py argument order (config= keyword) - Add type annotations to fix pyright errors where straightforward - Add noqa comments for flake8 false positives in __init__.py - Remove unused nonlocal declarations in main.py - Run black and isort to fix formatting - Update CLAUDE.md with improved linting commands Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>	2026-01-18 23:37:28 -06:00
Nicholas Tindle	44182aff9c	feat(classic): add strategy benchmark test harness for CI - Add test_prompt_strategies.py harness to compare prompt strategies - Add pytest wrapper (test_strategy_benchmark.py) for CI integration - Fix serve command (remove invalid --port flag, use AP_SERVER_PORT env) - Fix test category (interface -> general) - Add aiohttp-retry dependency for agbenchmark - Add pytest markers: slow, integration, requires_agent Usage: poetry run python agbenchmark_config/test_prompt_strategies.py --quick poetry run pytest tests/integration/test_strategy_benchmark.py -v Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>	2026-01-18 23:36:19 -06:00
Nicholas Tindle	864c5a7846	fix(classic): approve+feedback now executes command then sends feedback Previously, when a user selected "Once" or "Always" with feedback (via Tab), the command was NOT executed because UserFeedbackProvided was raised before checking the approval scope. This fix changes the architecture from exception-based to return-value-based. Changes: - Add PermissionCheckResult class with allowed, scope, and feedback fields - Change check_command() to return PermissionCheckResult instead of bool - Update prompt_fn signature to return (ApprovalScope, feedback) tuple - Add pending_user_feedback mechanism to EpisodicActionHistory - Update execute() to handle feedback after successful command execution - Feedback message explicitly states "Command executed successfully" - Add on_auto_approve callback for displaying auto-approved commands - Add comprehensive tests for approval/denial with feedback scenarios Behavior: - Once + feedback → Execute command, then send feedback to agent - Always + feedback → Execute command, save permission, send feedback - Deny + feedback → Don't execute, send feedback to agent Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>	2026-01-18 22:32:43 -06:00
Nicholas Tindle	699fffb1a8	feat(classic): add Rich interactive selector for command approval Adds a custom Rich-based interactive selector for the command approval workflow. Features include: - Arrow key navigation for selecting approval options - Tab to add context to any selection (e.g., "Once + also check file x") - Dedicated inline feedback option with shadow placeholder text - Quick select with number keys 1-5 - Works within existing asyncio event loop (no prompt_toolkit dependency) Also adds UIProvider abstraction pattern for future UI implementations. Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>	2026-01-18 21:49:43 -06:00

... 2 3 4 5 6 ...

7960 Commits