Commit Graph

7786 Commits

Author SHA1 Message Date
Nicholas Tindle
634bff8277 refactor(forge): replace Selenium with Playwright for web browsing
- Remove selenium.py and test_selenium.py
- Add playwright_browser.py with WebPlaywrightComponent
- Update web component exports to use Playwright
- Update dependencies in pyproject.toml/poetry.lock
- Minor agent and reflexion strategy improvements
- Update CLAUDE.md documentation

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
2026-01-19 23:57:17 -06:00
Nicholas Tindle
d591f36c7b fix(direct_benchmark): track cost from LLM provider
Previously cost was hardcoded to 0.0. Now extracts cumulative cost
from MultiProvider.get_incurred_cost() after each step execution.

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
2026-01-19 23:37:12 -06:00
Nicholas Tindle
a347bed0b1 feat(direct_benchmark): add incremental resume and selective reset
Benchmarks now automatically save progress and resume from where they
left off. State is persisted to .benchmark_state.json in reports dir.

Features:
- Auto-resume: runs skip already-completed challenges
- --fresh: clear all state and start over
- --retry-failures: re-run only failed challenges
- --reset-strategy/model/challenge: selective resets
- `state show/clear/reset` subcommands for state management
- Config mismatch detection with auto-reset

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
2026-01-19 23:32:27 -06:00
Nicholas Tindle
4eeb6ee2b0 feat(direct_benchmark): add CI mode for non-interactive environments
Add --ci flag that disables Rich Live display while preserving
completion blocks. Auto-detects CI environment via CI env var or
non-TTY stdout. Prints progress every 10 completions for visibility.

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
2026-01-19 23:21:10 -06:00
Nicholas Tindle
7db962b9f9 feat(direct_benchmark): dynamic column layout up to 10 wide
- Calculate max columns based on terminal width (up to 10)
- Reduced panel width from 35 to 30 chars to fit more
- Wider terminals can now show more parallel runs side-by-side

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
2026-01-19 23:15:16 -06:00
Nicholas Tindle
9108b21541 fix(direct_benchmark): parallel execution and always show completion blocks
Fixes:
- Use run_key (config:challenge) instead of just config_name for tracking
  active runs - allows multiple challenges from same config to run in parallel
- Add asyncio.sleep(0) yields to let multiple tasks acquire semaphore
  and start before any proceed with work
- Always print completion blocks (not just failures) for visibility

This should properly show 8/8 active runs when running with --parallel 8.

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
2026-01-19 23:13:56 -06:00
Nicholas Tindle
ffe9325296 feat(direct_benchmark): multi-panel UI with copy-paste completion blocks
UI improvements:
- Multi-column layout: each active config gets its own panel showing
  challenge name and step history (last 6 steps with status)
- Copy-paste completion blocks: when a challenge finishes (especially
  failures), prints a detailed block with all steps for easy debugging
- Configurable logging: suppresses noisy LLM provider warnings unless
  --debug flag is set
- Pass debug flag through harness to UI

Example active runs panel:
┌─ one_shot/claude ─┬─ rewoo/claude ────┐
│ ReadFile          │ WriteFile         │
│   ✓ #1 read_file  │   ✓ #1 think      │
│   ✓ #2 write_file │   ✓ #2 plan       │
│   ● step 3: ...   │   ● step 3: ...   │
└───────────────────┴───────────────────┘

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
2026-01-19 23:10:34 -06:00
Nicholas Tindle
0a616d9267 feat(direct_benchmark): add step-level logging with colored prefixes
- Add step callback to AgentRunner for real-time step logging
- BenchmarkUI now shows:
  - Active runs with current step info
  - Recent steps panel with colored config prefixes
  - Proper Live display refresh (implements __rich_console__)
- Each config gets a distinct color for easy identification
- Verbose mode prints step logs immediately with config prefix
- Fix Live display not updating (pass UI object, not rendered content)

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
2026-01-19 23:02:20 -06:00
Nicholas Tindle
ab95077e5b refactor(forge): remove VCR cassettes, use real API calls with skip for forks
- Remove vcrpy and pytest-recording dependencies
- Remove tests/vcr/ directory and vcr_cassettes submodule
- Remove .gitmodules (only had cassette submodule)
- Simplify CI workflow - no more cassette checkout/push/PAT_REVIEW
- Tests requiring API keys now skip if not set (fork PRs)
- Update CLAUDE.md files to remove cassette references
- Fix broken agbenchmark path in pyproject.toml

Security improvement: removes need for PAT with cross-repo write access.
Fork PRs will have API-dependent tests skipped (GitHub protects secrets).

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
2026-01-19 22:51:57 -06:00
Nicholas Tindle
e477150979 Merge branch 'dev' into make-old-work 2026-01-19 22:30:46 -06:00
Nicholas Tindle
804430e243 refactor(classic): migrate from agbenchmark to direct_benchmark harness
- Remove old benchmark/ folder with agbenchmark framework
- Move challenges to direct_benchmark/challenges/
- Move analysis tools (analyze_reports.py, analyze_failures.py) to direct_benchmark/
- Move challenges_already_beaten.json to direct_benchmark/
- Update CI workflow to use direct_benchmark
- Update CLAUDE.md files with new benchmarking instructions
- Add benchmarking section to original_autogpt/CLAUDE.md

The direct_benchmark harness directly instantiates agents without HTTP
server overhead, enabling parallel execution with asyncio semaphore.

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
2026-01-19 22:29:51 -06:00
Nicholas Tindle
acb320d32d feat(classic): add noninteractive mode env var and benchmark config logging
- Add NONINTERACTIVE_MODE env var support to AppConfig for disabling
  user interaction during automated runs
- Benchmark harness now sets NONINTERACTIVE_MODE=True when starting agents
- Add agent configuration logging at server startup (model, strategy, etc.)
- Harness logs env vars being passed to agent for verification
- Add --agent-output flag to show full agent server output for debugging

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
2026-01-19 19:40:24 -06:00
Nicholas Tindle
32f68d5999 feat(classic): add failure analysis tool and improve benchmark output
Benchmark improvements:
- Add analyze_failures.py for pattern detection and failure analysis
- Add informative step output: tool name, args, result status, cost
- Add --all and --matrix flags for comprehensive model/strategy testing
- Add --analyze-only and --no-analyze flags for flexible analysis control
- Auto-run failure analysis after benchmarks with markdown export
- Fix directory creation bug in ReportManager (add parents=True)

Prompt strategy enhancements:
- Implement full plan_execute, reflexion, rewoo, tree_of_thoughts strategies
- Add PROMPT_STRATEGY env var support for strategy selection
- Add extended thinking support for Anthropic models
- Add reasoning effort support for OpenAI o-series models

LLM provider improvements:
- Add thinking_budget_tokens config for Anthropic extended thinking
- Add reasoning_effort config for OpenAI reasoning models
- Improve error feedback for LLM self-correction

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
2026-01-19 18:58:41 -06:00
Nicholas Tindle
49f56b4e8d feat(classic): enhance strategy benchmark harness with model comparison and bug fixes
- Add model comparison support to test harness (claude, openai, gpt5, opus presets)
- Add --models, --smart-llm, --fast-llm, --list-models CLI args
- Add real-time logging with timestamps and progress indicators
- Fix success parsing bug: read results[0].success instead of non-existent metrics.success
- Fix agbenchmark TestResult validation: use exception typename when value is empty
- Fix WebArena challenge validation: use strings instead of integers in instantiation_dict
- Fix Agent type annotations: create AnyActionProposal union for all prompt strategies
- Add pytest integration tests for the strategy benchmark harness

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
2026-01-19 18:07:14 -06:00
Swifty
bc75d70e7d refactor(backend): Improve Langfuse tracing with v3 SDK patterns and @observe decorators (#11803)
<!-- Clearly explain the need for these changes: -->

This PR improves the Langfuse tracing implementation in the chat feature
by adopting the v3 SDK patterns, resulting in cleaner code and better
observability.

### Changes 🏗️

- **Simplified Langfuse client usage**: Replace manual client
initialization with `langfuse.get_client()` global singleton
- **Use v3 context managers**: Switch to
`start_as_current_observation()` and `propagate_attributes()` for
automatic trace propagation
- **Auto-instrument OpenAI calls**: Use `langfuse.openai` wrapper for
automatic LLM call tracing instead of manual generation tracking
- **Add `@observe` decorators**: All chat tools now have
`@observe(as_type="tool")` decorators for automatic tool execution
tracing:
  - `add_understanding`
  - `view_agent_output` (renamed from `agent_output`)
  - `create_agent`
  - `edit_agent`
  - `find_agent`
  - `find_block`
  - `find_library_agent`
  - `get_doc_page`
  - `run_agent`
  - `run_block`
  - `search_docs`
- **Remove manual trace lifecycle**: Eliminated the verbose `finally`
block that manually ended traces/generations
- **Rename tool**: `agent_output` → `view_agent_output` for clarity

### Checklist 📋

#### For code changes:
- [x] I have clearly listed my changes in the PR description
- [x] I have made a test plan
- [x] I have tested my changes according to the test plan:
  - [x] Verified chat feature works with Langfuse tracing enabled
- [x] Confirmed traces appear correctly in Langfuse dashboard with tool
spans
  - [x] Tested tool execution flows show up as nested observations

#### For configuration changes:

- [x] `.env.default` is updated or already compatible with my changes
- [x] `docker-compose.yml` is updated or already compatible with my
changes
- [x] I have included a list of my configuration changes in the PR
description (under **Changes**)

No configuration changes required - uses existing Langfuse environment
variables.
2026-01-19 20:56:51 +00:00
Nicholas Tindle
bead811e73 docs(classic): add workspace, settings, and permissions documentation
Document the layered configuration system including:
- Workspace structure (.autogpt/ directory layout)
- Settings location (environment variables, workspace YAML, agent YAML)
- Permission system (check order, pattern syntax, approval scopes)
- Default security behavior

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
2026-01-19 12:17:10 -06:00
Nicholas Tindle
013f728ebf feat(forge): improve tool call error feedback for LLM self-correction
When tool calls fail validation, the error messages now include:
- What arguments were actually provided
- The expected parameter schema with types and required/optional indicators

This helps LLMs understand and fix their mistakes when retrying,
rather than just being told a parameter is missing.

Example improved error:
  Invalid function call for write_file: 'contents' is a required property
  You provided: {"filename": 'story.txt'}
  Expected parameters: {"filename": string (required), "contents": string (required)}

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
2026-01-19 11:49:17 -06:00
Nicholas Tindle
cda9572acd feat(forge): add lightweight web fetch component
Add WebFetchComponent for fast HTTP-based page fetching without browser
overhead. Uses trafilatura for intelligent content extraction.

Commands:
- fetch_webpage: Extract main content as text/markdown/xml
  - Removes navigation, ads, boilerplate automatically
  - Extracts page metadata (title, description, author, date)
  - Extracts and lists page links
  - Much faster than Selenium-based read_webpage

- fetch_raw_html: Get raw HTML for structure inspection
  - Optional truncation for large pages

Features:
- Trafilatura-powered content extraction (best-in-class accuracy)
- Automatic link extraction with relative URL resolution
- Page metadata extraction (OG tags, meta tags)
- Configurable timeout, max content length, max links
- Proper error handling for timeouts and HTTP errors
- 19 comprehensive tests

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
2026-01-19 01:04:22 -06:00
Nicholas Tindle
c1a1767034 feat(docs): Add block documentation auto-generation system (#11707)
- Add generate_block_docs.py script that introspects block code to
generate markdown
- Support manual content preservation via <!-- MANUAL: --> markers
- Add migrate_block_docs.py to preserve existing manual content from git
HEAD
- Add CI workflow (docs-block-sync.yml) to fail if docs drift from code
- Add Claude PR review workflow (docs-claude-review.yml) for doc changes
- Add manual LLM enhancement workflow (docs-enhance.yml)
- Add GitBook configuration (.gitbook.yaml, SUMMARY.md)
- Fix non-deterministic category ordering (categories is a set)
- Add comprehensive test suite (32 tests)
- Generate docs for 444 blocks with 66 preserved manual sections

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>

<!-- Clearly explain the need for these changes: -->

### Changes 🏗️

<!-- Concisely describe all of the changes made in this pull request:
-->

### Checklist 📋

#### For code changes:
- [x] I have clearly listed my changes in the PR description
- [x] I have made a test plan
- [x] I have tested my changes according to the test plan:
  <!-- Put your test plan here: -->
  - [x] Extensively test code generation for the docs pages



<!-- CURSOR_SUMMARY -->
---

> [!NOTE]
> Introduces an automated documentation pipeline for blocks and
integrates it into CI.
> 
> - Adds `scripts/generate_block_docs.py` (+ tests) to introspect blocks
and generate `docs/integrations/**`, preserving `<!-- MANUAL: -->`
sections
> - New CI workflows: **docs-block-sync** (fails if docs drift),
**docs-claude-review** (AI review for block/docs PRs), and
**docs-enhance** (optional LLM improvements)
> - Updates existing Claude workflows to use `CLAUDE_CODE_OAUTH_TOKEN`
instead of `ANTHROPIC_API_KEY`
> - Improves numerous block descriptions/typos and links across backend
blocks to standardize docs output
> - Commits initial generated docs including
`docs/integrations/README.md` and many provider/category pages
> 
> <sup>Written by [Cursor
Bugbot](https://cursor.com/dashboard?tab=bugbot) for commit
631e53e0f6. This will update automatically
on new commits. Configure
[here](https://cursor.com/dashboard?tab=bugbot).</sup>
<!-- /CURSOR_SUMMARY -->

---------

Co-authored-by: Claude Opus 4.5 <noreply@anthropic.com>
2026-01-19 07:03:19 +00:00
Nicholas Tindle
e0784f8f6b refactor(forge): simplify deeply nested error handling in Anthropic provider
- Extract _get_tool_error_message helper method
- Replace 20+ levels of nesting with simple for loop
- Improve readability of tool_result construction
- Update benchmark poetry.lock

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
2026-01-19 00:15:33 -06:00
Nicholas Tindle
3040f39136 feat(forge): modernize web search with tiered provider system
Replace basic DuckDuckGo-only search with a modern tiered system:

1. Tavily (primary) - AI-optimized results with content extraction
   - AI-generated answer summaries
   - Relevance scoring
   - Full page content extraction via search_and_extract command

2. Serper (secondary) - Fast, cheap Google SERP results
   - $0.30-1.00 per 1K queries
   - Real Google results without scraping

3. DDGS multi-engine (fallback) - Free, no API key required
   - Automatic fallback chain: DuckDuckGo → Bing → Brave → Google → etc.
   - 8 search backends supported

Key changes:
- Upgrade duckduckgo-search to ddgs v9.10 (renamed successor package)
- Add Tavily and Serper API integrations
- Implement automatic provider selection and fallback chain
- Add search_and_extract command for research with content extraction
- Add TAVILY_API_KEY and SERPER_API_KEY to env templates
- Update benchmark httpx constraint for ddgs compatibility
- 23 comprehensive tests for all providers and fallback scenarios

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
2026-01-19 00:06:42 -06:00
Nicholas Tindle
515504c604 fix(classic): resolve pyright type errors in original_autogpt
- Change Agent class to use ActionProposal instead of OneShotAgentActionProposal
  to support multiple prompt strategy types
- Widen display_thoughts parameter type from AssistantThoughts to ModelWithSummary
- Fix speak attribute access in agent_protocol_server with hasattr check
- Add type: ignore comments for intentional thoughts field overrides in strategies
- Remove unused OneShotAgentActionProposal import

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
2026-01-18 23:53:23 -06:00
Nicholas Tindle
18edeaeaf4 fix(classic): fix linting and formatting errors across codebase
- Fix 32+ flake8 E501 (line too long) errors by shortening descriptions
- Remove unused import in todo.py
- Fix test_todo.py argument order (config= keyword)
- Add type annotations to fix pyright errors where straightforward
- Add noqa comments for flake8 false positives in __init__.py
- Remove unused nonlocal declarations in main.py
- Run black and isort to fix formatting
- Update CLAUDE.md with improved linting commands

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
2026-01-18 23:37:28 -06:00
Nicholas Tindle
44182aff9c feat(classic): add strategy benchmark test harness for CI
- Add test_prompt_strategies.py harness to compare prompt strategies
- Add pytest wrapper (test_strategy_benchmark.py) for CI integration
- Fix serve command (remove invalid --port flag, use AP_SERVER_PORT env)
- Fix test category (interface -> general)
- Add aiohttp-retry dependency for agbenchmark
- Add pytest markers: slow, integration, requires_agent

Usage:
  poetry run python agbenchmark_config/test_prompt_strategies.py --quick
  poetry run pytest tests/integration/test_strategy_benchmark.py -v

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
2026-01-18 23:36:19 -06:00
Nicholas Tindle
864c5a7846 fix(classic): approve+feedback now executes command then sends feedback
Previously, when a user selected "Once" or "Always" with feedback (via Tab),
the command was NOT executed because UserFeedbackProvided was raised before
checking the approval scope. This fix changes the architecture from
exception-based to return-value-based.

Changes:
- Add PermissionCheckResult class with allowed, scope, and feedback fields
- Change check_command() to return PermissionCheckResult instead of bool
- Update prompt_fn signature to return (ApprovalScope, feedback) tuple
- Add pending_user_feedback mechanism to EpisodicActionHistory
- Update execute() to handle feedback after successful command execution
- Feedback message explicitly states "Command executed successfully"
- Add on_auto_approve callback for displaying auto-approved commands
- Add comprehensive tests for approval/denial with feedback scenarios

Behavior:
- Once + feedback → Execute command, then send feedback to agent
- Always + feedback → Execute command, save permission, send feedback
- Deny + feedback → Don't execute, send feedback to agent

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
2026-01-18 22:32:43 -06:00
Nicholas Tindle
699fffb1a8 feat(classic): add Rich interactive selector for command approval
Adds a custom Rich-based interactive selector for the command approval
workflow. Features include:
- Arrow key navigation for selecting approval options
- Tab to add context to any selection (e.g., "Once + also check file x")
- Dedicated inline feedback option with shadow placeholder text
- Quick select with number keys 1-5
- Works within existing asyncio event loop (no prompt_toolkit dependency)

Also adds UIProvider abstraction pattern for future UI implementations.

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
2026-01-18 21:49:43 -06:00
Nicholas Tindle
f0641c2d26 fix(classic): auto-advance plan steps in Plan-Execute strategy
The strategy was stuck in a loop because it tracked plan steps but never
advanced them - the record_step_success() method existed but was never
called by the agent's execution loop.

Fix by using a _pending_step_advance flag to track when an action has
been proposed. On the next parse_response_content() call, advance the
previous step before processing the new response. This keeps step
tracking self-contained in the strategy without requiring agent changes.

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
2026-01-18 21:14:16 -06:00
Nicholas Tindle
94b6f74c95 feat(classic): add multiple prompt strategies for agent reasoning
Implement four new prompt strategies based on research papers:

- ReWOO: Reasoning Without Observation (5x token efficiency)
- Plan-and-Execute: Separate planning from execution phases
- Reflexion: Verbal reinforcement learning with episodic memory
- Tree of Thoughts: Deliberate problem solving with tree search

Each strategy extends a new BaseMultiStepPromptStrategy base class
with shared utilities. Strategies are selectable via PROMPT_STRATEGY
environment variable or config.prompt_strategy setting.

Fix JSONSchema generation issue where Optional/Union types created
anyOf schemas without direct type field - resolved by storing
plan/phase state in strategy instances rather than ActionProposal.

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
2026-01-18 20:33:10 -06:00
Nicholas Tindle
46aabab3ea feat(classic): upgrade to Python 3.12+ with CI testing on 3.12, 3.13, 3.14
- Update Python version constraint from ^3.10 to ^3.12 in all pyproject.toml
- Update classifiers to reflect Python 3.12, 3.13, 3.14 support
- Update dependencies for Python 3.13+ compatibility:
  - chromadb: ^0.4.10 -> ^1.4.0
  - numpy: >=1.26.0,<2.0.0 -> >=2.0.0
  - watchdog: 4.0.0 -> ^6.0.0
  - spacy: ^3.0.0 -> ^3.8.0 (numpy 2.x compatibility)
  - en-core-web-sm model: 3.7.1 -> 3.8.0
  - httpx (benchmark): ^0.24.0 -> ^0.27.0
- Update tool configuration:
  - Black target-version: py310 -> py312
  - Pyright pythonVersion: 3.10 -> 3.12
- Update Dockerfiles to use Python 3.12
- Update CI workflows to test on Python 3.12, 3.13, and 3.14
- Regenerate all poetry.lock files

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
2026-01-18 20:25:11 -06:00
Nicholas Tindle
0a65df5102 fix(classic): always use native tool calling, fix N/A command loop
- Remove openai_functions config option - native tool calling is now always enabled
- Remove use_functions_api from BaseAgentConfiguration and prompt strategy
- Add use_prefill config to disable prefill for Anthropic (prefill + tools incompatible)
- Update anthropic dependency to ^0.45.0 for tools API support
- Simplify prompt strategy to always expect tool_calls from LLM response

This fixes the N/A command loop bug where models would output "N/A" as a
command name when function calling was disabled. With native tool calling
always enabled, models are forced to pick from valid tools only.

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
2026-01-18 19:54:40 -06:00
Nicholas Tindle
6fbd208fe3 chore: ignore .claude/settings.local.json in all directories
Update gitignore to use glob pattern for settings.local.json files
in any .claude directory. Also untrack the existing file.

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
2026-01-18 18:54:42 -06:00
Nicholas Tindle
8fc174ca87 refactor(classic): simplify log format by removing timestamps
Remove asctime from log formats since terminal output already has
timestamps from the logging infrastructure. Makes logs cleaner
and easier to read.

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
2026-01-18 18:52:47 -06:00
Nicholas Tindle
cacc89790f feat(classic): improve AutoGPT configuration and setup
Environment loading:
- Search for .env in multiple locations (cwd, ~/.autogpt, ~/.config/autogpt)
- Allows running autogpt from any directory
- Document search order in .env.template

Setup simplification:
- Remove interactive AI settings revision (was broken/unused)
- Simplify to just printing current settings
- Clean up unused imports

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
2026-01-18 18:52:38 -06:00
Nicholas Tindle
b9113bee02 feat(classic): enhance existing components with new capabilities
CodeExecutorComponent:
- Add timeout and env_vars parameters to execution commands
- Add execute_shell_popen for streaming output
- Improve error handling with CodeTimeoutError

FileManagerComponent:
- Add file_info, file_search, file_copy, file_move commands
- Add directory_create, directory_list_tree commands
- Better path validation and error messages

GitOperationsComponent:
- Add git_log, git_show, git_branch commands
- Add git_stash, git_stash_pop, git_stash_list commands
- Add git_cherry_pick, git_revert, git_reset commands
- Add git_remote, git_fetch, git_pull, git_push commands

UserInteractionComponent:
- Add ask_multiple_choice for structured options
- Add notify_user for non-blocking notifications
- Add confirm_action for yes/no confirmations

WebSearchComponent:
- Minor error handling improvements

WebSeleniumComponent:
- Add get_page_content, execute_javascript commands
- Add take_element_screenshot command
- Add wait_for_element, scroll_page commands
- Improve element interaction reliability

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
2026-01-18 18:52:27 -06:00
Nicholas Tindle
3f65da03e7 feat(classic): add new exception types for enhanced error handling
Add specialized exception classes for better error reporting:
- CodeTimeoutError: For code execution timeouts
- HTTPError: For HTTP request failures with status code/URL
- DataProcessingError: For JSON/CSV processing errors

Each exception includes helpful hints for users.

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
2026-01-18 18:52:10 -06:00
Nicholas Tindle
9e96d11b2d feat(classic): add utility components for agent capabilities
Add 6 new utility components to expand agent functionality:

- ArchiveHandlerComponent: ZIP/TAR archive operations (create, extract, list)
- ClipboardComponent: In-memory clipboard for copy/paste operations
- DataProcessorComponent: CSV/JSON data manipulation and analysis
- HTTPClientComponent: HTTP requests (GET, POST, PUT, DELETE)
- MathUtilsComponent: Mathematical calculations and statistics
- TextUtilsComponent: Text processing (regex, diff, encoding, hashing)

All components follow the forge component pattern with:
- CommandProvider for exposing commands
- DirectiveProvider for resources/best practices
- Comprehensive parameter validation

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
2026-01-18 18:50:52 -06:00
Nicholas Tindle
4c264b7ae9 feat(classic): add TodoComponent with LLM-powered decomposition
Add a task management component modeled after Claude Code's TodoWrite:
- TodoItem with recursive sub_items for hierarchical task structure
- todo_write: atomic list replacement with sub-items support
- todo_read: retrieve current todos with nested structure
- todo_clear: clear all todos
- todo_decompose: use smart LLM to break down tasks into sub-steps

Features:
- Hierarchical task tracking with independent status per sub-item
- MessageProvider shows todos in LLM context with proper indentation
- DirectiveProvider adds best practices for task management
- Graceful fallback when LLM provider not configured

Integrates with:
- original_autogpt Agent (full LLM decomposition support)
- ForgeAgent (basic task tracking, no decomposition)

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
2026-01-18 18:49:48 -06:00
Nicholas Tindle
0adbc0bd05 fix(classic): update CI for removed frontend and helper scripts
Remove references to deleted files (./run, cli.py, setup.py, frontend/)
from CI workflows. Replace ./run agent start with direct poetry commands
to start agent servers in background.

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
2026-01-18 17:41:11 -06:00
Nicholas Tindle
8f3291bc92 feat(classic): add workspace permissions system for agent commands
Add a layered permission system that controls agent command execution:

- Create autogpt.yaml in .autogpt/ folder with default allow/deny rules
- File operations in workspace allowed by default
- Sensitive files (.env, .key, .pem) blocked by default
- Dangerous shell commands (sudo, rm -rf) blocked by default
- Interactive prompts for unknown commands (y=agent, Y=workspace, n=deny)
- Agent-specific permissions stored in .autogpt/agents/{id}/permissions.yaml

Files added:
- forge/forge/config/workspace_settings.py - Pydantic models for settings
- forge/forge/permissions.py - CommandPermissionManager with pattern matching

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
2026-01-18 17:39:33 -06:00
Nicholas Tindle
7a20de880d chore: add .autogpt/ to gitignore
The .autogpt/ directory is where AutoGPT stores agent data when running
from any directory. This should not be committed to version control.

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
2026-01-18 17:02:47 -06:00
Nicholas Tindle
ef8a6d2528 feat(classic): make AutoGPT installable and runnable from any directory
Add --workspace option to CLI that defaults to current working directory,
allowing users to run `autogpt` from any folder. Agent data is now stored
in `.autogpt/` subdirectory of the workspace instead of a hardcoded path.

Changes:
- Add -w/--workspace CLI option to run and serve commands
- Remove dependency on forge package location for PROJECT_ROOT
- Update config to use workspace instead of project_root
- Store agent data in .autogpt/ within workspace directory
- Update pyproject.toml files with proper PyPI metadata
- Fix outdated tests to match current implementation

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
2026-01-18 17:00:36 -06:00
Nicholas Tindle
fd66be2aaa chore(classic): remove unneeded files and add CLAUDE.md docs
- Remove deprecated Flutter frontend (replaced by autogpt_platform)
- Remove shell scripts (run, setup, autogpt.sh, etc.)
- Remove tutorials (outdated)
- Remove CLI-USAGE.md and FORGE-QUICKSTART.md
- Add CLAUDE.md files for Claude Code guidance

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
2026-01-18 16:17:35 -06:00
Nicholas Tindle
ae2cc97dc4 feat(classic): add modern Anthropic models and fix deprecated API
- Add Claude 3.5 v2, Claude 4 Sonnet, Claude 4 Opus, and Claude 4.5 Opus models
- Add rolling aliases (CLAUDE_SONNET, CLAUDE_OPUS, CLAUDE_HAIKU)
- Fix deprecated beta.tools.messages.create API call to use standard messages.create
- Update anthropic SDK from ^0.25.1 to >=0.40,<1.0

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
2026-01-18 16:15:16 -06:00
Nicholas Tindle
1b56ff13d9 test 2026-01-18 15:32:10 -06:00
Zamil Majdy
f31c160043 feat(platform): add endedAt field and fix execution analytics timestamps (#11759)
## Summary

This PR adds proper execution end time tracking and fixes timestamp
handling throughout the execution analytics system.

### Key Changes

1. **Added `endedAt` field to database schema** - Executions now have a
dedicated field for tracking when they finish
2. **Fixed timestamp nullable handling** - `started_at` and `ended_at`
are now properly nullable in types
3. **Fixed chart aggregation** - Reduced threshold from ≥3 to ≥1
executions per day
4. **Improved timestamp display** - Moved timestamps to expandable
details section in analytics table
5. **Fixed nullable timestamp bugs** - Updated all frontend code to
handle null timestamps correctly

## Problem Statement

### Issue 1: Missing Execution End Times
Previously, executions used `updatedAt` (last DB update) as a proxy for
"end time". This broke when adding correctness scores retroactively -
the end time would change to whenever the score was added, not when the
execution actually finished.

### Issue 2: Chart Shows Only One Data Point
The accuracy trends chart showed only one data point despite having
executions across multiple days. Root cause: aggregation required ≥3
executions per day.

### Issue 3: Incorrect Type Definitions
Manually maintained types defined `started_at` and `ended_at` as
non-nullable `Date`, contradicting reality where QUEUED executions
haven't started yet.

## Solution

### Database Schema (`schema.prisma`)
```prisma
model AgentGraphExecution {
  // ...
  startedAt DateTime?
  endedAt   DateTime?  // NEW FIELD
  // ...
}
```

### Execution Lifecycle
- **QUEUED**: `startedAt = null`, `endedAt = null` (not started)
- **RUNNING**: `startedAt = set`, `endedAt = null` (in progress)  
- **COMPLETED/FAILED/TERMINATED**: `startedAt = set`, `endedAt = set`
(finished)

### Migration Strategy
```sql
-- Add endedAt column
ALTER TABLE "AgentGraphExecution" ADD COLUMN "endedAt" TIMESTAMP(3);

-- Backfill ONLY terminal executions (prevents marking RUNNING executions as ended)
UPDATE "AgentGraphExecution"
SET "endedAt" = "updatedAt"
WHERE "endedAt" IS NULL
  AND "executionStatus" IN ('COMPLETED', 'FAILED', 'TERMINATED');
```

## Changes by Component

### Backend

**`schema.prisma`**
- Added `endedAt` field to `AgentGraphExecution`

**`execution.py`**
- Made `started_at` and `ended_at` optional with Field descriptions
- Updated `from_db()` to use `endedAt` instead of `updatedAt`
- `update_graph_execution_stats()` sets `endedAt` when status becomes
terminal

**`execution_analytics_routes.py`**
- Removed `created_at`/`updated_at` from `ExecutionAnalyticsResult` (DB
metadata, not execution data)
- Kept only `started_at`/`ended_at` (actual execution runtime)
- Made settings global (avoid recreation)
- Moved OpenAI key validation to `_process_batch` (only check when LLM
actually runs)

**`analytics.py`**
- Fixed aggregation: `COUNT(*) >= 1` (was 3) - include all days with ≥1
execution
- Uses `createdAt` for chart grouping (when execution was queued)

**`late_execution_monitor.py`**
- Handle optional `started_at` with fallback to `datetime.min` for
sorting
- Display "Not started" when `started_at` is null

### Frontend

**Type Definitions**
- Fixed manually maintained `types.ts`: `started_at: Date | null` (was
non-nullable)
- Generated types were already correct

**Analytics Components**
- `AnalyticsResultsTable.tsx`: Show only `started_at`/`ended_at` in
2-column expandable grid
- `ExecutionAnalyticsForm.tsx`: Added filter explanation UI

**Monitoring Components** - Fixed null handling bugs:
- `OldAgentLibraryView.tsx`: Handle null in reduce function
- `agent-runs-selector-list.tsx`: Safe sorting with `?.getTime() ?? 0`
- `AgentFlowList.tsx`: Filter/sort with null checks
- `FlowRunsStatus.tsx`: Filter null timestamps
- `FlowRunsTimeline.tsx`: Filter executions with null timestamps before
rendering
- `monitoring/page.tsx`: Safe sorting
- `ActivityItem.tsx`: Fallback to "recently" for null timestamps

## Benefits

 **Accurate End Times**: `endedAt` is frozen when execution finishes,
not updated later
 **Type Safety**: Nullable types match reality, exposing real bugs  
 **Better UX**: Chart shows all days with data (not just days with ≥3
executions)
 **Bug Fixes**: 7+ frontend components now handle null timestamps
correctly
 **Documentation**: Field descriptions explain when timestamps are null

## Testing

### Backend
```bash
cd autogpt_platform/backend
poetry run format  #  All checks passed
poetry run lint    #  All checks passed
```

### Frontend  
```bash
cd autogpt_platform/frontend
pnpm format        #  All checks passed
pnpm lint          #  All checks passed
pnpm types         #  All type errors fixed
```

### Test Data Generation
Created script to generate 35 test executions across 7 days with
correctness scores:
```bash
poetry run python scripts/generate_test_analytics_data.py
```

## Migration Notes

⚠️ **Important**: The migration only backfills `endedAt` for executions
with terminal status (COMPLETED, FAILED, TERMINATED). Active executions
(QUEUED, RUNNING) correctly keep `endedAt = null`.

## Breaking Changes

None - this is backward compatible:
- `endedAt` is nullable, existing code that doesn't use it is unaffected
- Frontend already used generated types which were correct
- Migration safely backfills historical data

<!-- CURSOR_SUMMARY -->
---

> [!NOTE]
> Introduces explicit execution end-time tracking and normalizes
timestamp handling across backend and frontend.
> 
> - Adds `endedAt` to `AgentGraphExecution` (schema + migration);
backfills terminal executions; sets `endedAt` on terminal status updates
> - Makes `GraphExecutionMeta.started_at/ended_at` optional; updates
`from_db()` to use DB `endedAt`; exposes timestamps in
`ExecutionAnalyticsResult`
> - Moves OpenAI key validation into batch processing; instantiates
`Settings` once
> - Accuracy trends: reduce daily aggregation threshold to `>= 1`;
optional historical series
> - Monitoring/analytics UI: results table shows/export
`started_at`/`ended_at`; adds chart filter explainer
> - Frontend null-safety: update types (`Date | null`) and fix
sorting/filtering/rendering for nullable timestamps across monitoring
and library views
> - Late execution monitor: safe sorting/display when `started_at` is
null
> - OpenAPI specs updated for new/nullable fields
> 
> <sup>Written by [Cursor
Bugbot](https://cursor.com/dashboard?tab=bugbot) for commit
1d987ca6e5. This will update automatically
on new commits. Configure
[here](https://cursor.com/dashboard?tab=bugbot).</sup>
<!-- /CURSOR_SUMMARY -->

---------

Co-authored-by: Nicholas Tindle <nicholas.tindle@agpt.co>
2026-01-16 21:44:24 +00:00
Nicholas Tindle
06550a87eb feat(backend): add missed default credentials (#11760)
### Changes 🏗️

**Fixed missing default credentials and provider name mismatch in the
credentials store:**

1. **Provider name correction** (`credentials_store.py:97-103`)
- Changed `provider="unreal"` → `provider="unreal_speech"` to match the
existing `unreal_speech_api_key` setting and block usage
- Updated title from "Use Credits for Unreal" → "Use Credits for Unreal
Speech" for clarity

2. **Added missing OpenWeatherMap credentials**
(`credentials_store.py:219-226`)
- New `openweathermap_credentials` definition with `APIKeyCredentials`
- Uses existing `settings.secrets.openweathermap_api_key` setting that
was previously defined but had no credential object
   - Added to `DEFAULT_CREDENTIALS` list

3. **Fixed credentials not exposed in `get_all_creds()`**
(`credentials_store.py:343-354`)
- Added `llama_api_credentials` conditional append (was defined but not
returned to users)
- Added `v0_credentials` conditional append (was defined but not
returned to users)
   - Added `openweathermap_credentials` conditional append

### Checklist 📋

#### For code changes:
- [x] I have clearly listed my changes in the PR description
- [x] I have made a test plan
- [x] I have tested my changes according to the test plan:
- [x] Verified provider name `unreal_speech` matches block usage in
`text_to_speech_block.py`
  - [x] Confirmed `openweathermap_api_key` setting exists in secrets
- [x] Confirmed `llama_api_key` and `v0_api_key` settings exist in
secrets

<!-- CURSOR_SUMMARY -->
---

> [!NOTE]
> Aligns backend credential definitions and exposes missing system
creds; updates frontend to hide new built-ins.
> 
> - Backend `credentials_store.py`:
>   - Corrects `provider` to `unreal_speech` and updates title
> - Adds `openweathermap_credentials`; includes in `DEFAULT_CREDENTIALS`
and `get_all_creds()` when key present
> - Ensures `llama_api_credentials` and `v0_credentials` are returned by
`get_all_creds()`
> - Frontend `integrations/page.tsx`:
> - Extends `hiddenCredentials` with IDs for `v0`, `webshare_proxy`, and
`openweathermap`
> 
> <sup>Written by [Cursor
Bugbot](https://cursor.com/dashboard?tab=bugbot) for commit
e7d46b76c6. This will update automatically
on new commits. Configure
[here](https://cursor.com/dashboard?tab=bugbot).</sup>
<!-- /CURSOR_SUMMARY -->

---------

Co-authored-by: Claude <noreply@anthropic.com>
Co-authored-by: claude[bot] <41898282+claude[bot]@users.noreply.github.com>
Co-authored-by: Nicholas Tindle <ntindle@users.noreply.github.com>
2026-01-16 21:18:12 +00:00
Nicholas Tindle
088b9998dc fix(frontend): Fix flaky agent-activity tests by targeting correct agent (#11790)
This PR fixes flaky agent-activity Playwright tests that were failing
intermittently in CI.

Closes #11789

### Changes 🏗️

- **Navigate to specific agent by name**: Replace
`LibraryPage.clickFirstAgent(page)` with
`LibraryPage.navigateToAgentByName(page, "Test Agent")` to ensure we're
testing the correct agent rather than relying on the first agent in the
list
- **Add retry mechanism for async data loading**: Replace direct
visibility check with `expect(...).toPass({ timeout: 15000 })` pattern
to properly handle asynchronous agent data fetching
- **Increase timeout**: Extended timeout from 8000ms to 15000ms to
accommodate slower CI environments

### Checklist 📋

#### For code changes:
- [x] I have clearly listed my changes in the PR description
- [x] I have made a test plan
- [x] I have tested my changes according to the test plan:
  - [x] Verified the test file syntax is correct
- [x] Changes target the correct file
(`autogpt_platform/frontend/src/tests/agent-activity.spec.ts`)
- [x] The retry mechanism follows Playwright best practices using
`toPass()`

#### For configuration changes:

- [x] `.env.default` is updated or already compatible with my changes
(N/A - no config changes)
- [x] `docker-compose.yml` is updated or already compatible with my
changes (N/A - no config changes)
- [x] I have included a list of my configuration changes in the PR
description (under **Changes**) (N/A - no config changes)

---------

Co-authored-by: claude[bot] <41898282+claude[bot]@users.noreply.github.com>
Co-authored-by: Nicholas Tindle <ntindle@users.noreply.github.com>
2026-01-16 20:33:47 +00:00
Nicholas Tindle
05c89fa5c0 feat(claude): add vercel-react-best-practices skill (#11777) 2026-01-16 09:40:58 -07:00
Swifty
8cc8295f14 feat(backend): add agent generator tools for chat copilot (#11781)
This PR adds the ability to create and edit agents from natural language
descriptions in the chat copilot.

### Changes 🏗️

- Added `agent_generator/` module with:
  - LLM client for OpenAI API calls
- Core generation logic for decomposing goals and generating agent JSON
  - Fixer module to correct common LLM generation errors
  - Validator to ensure generated agents are structurally valid
  - Prompts for goal decomposition and agent generation
  - Utility functions for blocks info and agent saving
- Added `CreateAgentTool` - creates new agents from natural language
descriptions
- Added `EditAgentTool` - edits existing agents using natural language
patches
- Added response models: `AgentPreviewResponse`, `AgentSavedResponse`,
`ClarificationNeededResponse`
- Registered new tools in the tools registry

### Checklist 📋

#### For code changes:
- [x] I have clearly listed my changes in the PR description
- [x] I have made a test plan
- [x] I have tested my changes according to the test plan:
  - [x] Run `poetry run format` to ensure code passes linting
- [x] Test creating an agent via chat with a natural language
description
  - [x] Test editing an existing agent via chat
2026-01-16 17:11:57 +01:00
Swifty
e55f05c7a8 feat(backend): add chat search tools and BM25 reranking (#11782)
This PR adds new chat tools for searching blocks and documentation,
along with BM25 reranking for improved search relevance.

### Changes 🏗️

**New Chat Tools:**
- `find_block` - Search for available blocks by name/description using
hybrid search
- `run_block` - Execute a block directly with provided inputs and
credentials
- `search_docs` - Search documentation with section-level granularity  
- `get_doc_page` - Retrieve full documentation page content

**Search Improvements:**
- Added BM25 reranking to hybrid search for better lexical relevance
- Documentation handler now chunks markdown by headings (##) for
finer-grained embeddings
- Section-based content IDs (`doc_path::section_index`) for precise doc
retrieval
- Startup embedding backfill in scheduler for immediate searchability

**Other Changes:**
- New response models for block and documentation search results
- Updated orphan cleanup to handle section-based doc embeddings
- Added `rank-bm25` dependency for BM25 scoring
- Removed max message limit check in chat service

### Checklist 📋

#### For code changes:
- [x] I have clearly listed my changes in the PR description
- [x] I have made a test plan
- [x] I have tested my changes according to the test plan:
  - [x] Run find_block tool to search for blocks (e.g., "current time")
  - [x] Run run_block tool to execute a found block
  - [x] Run search_docs tool to search documentation
  - [x] Run get_doc_page tool to retrieve full doc content
- [x] Verify BM25 reranking improves search relevance for exact term
matches
  - [x] Verify documentation sections are properly chunked and embedded

#### For configuration changes:
- [x] `.env.default` is updated or already compatible with my changes
- [x] `docker-compose.yml` is updated or already compatible with my
changes
- [x] I have included a list of my configuration changes in the PR
description (under **Changes**)

**Dependencies added:** `rank-bm25` for BM25 scoring algorithm
2026-01-16 16:18:10 +01:00