AutoGPT

mirror of https://github.com/Significant-Gravitas/AutoGPT.git synced 2026-02-06 21:05:13 -05:00

Author	SHA1	Message	Date
Nicholas Tindle	7668c17d9c	feat(platform): add User Workspace for persistent CoPilot file storage (#11867 ) Implements persistent User Workspace storage for CoPilot, enabling blocks to save and retrieve files across sessions. Files are stored in session-scoped virtual paths (`/sessions/{session_id}/`). Fixes SECRT-1833 ### Changes 🏗️ Database & Storage: - Add `UserWorkspace` and `UserWorkspaceFile` Prisma models - Implement `WorkspaceStorageBackend` abstraction (GCS for cloud, local filesystem for self-hosted) - Add `workspace_id` and `session_id` fields to `ExecutionContext` Backend API: - Add REST endpoints: `GET/POST /api/workspace/files`, `GET/DELETE /api/workspace/files/{id}`, `GET /api/workspace/files/{id}/download` - Add CoPilot tools: `list_workspace_files`, `read_workspace_file`, `write_workspace_file` - Integrate workspace storage into `store_media_file()` - returns `workspace://file-id` references Block Updates: - Refactor all file-handling blocks to use unified `ExecutionContext` parameter - Update media-generating blocks to persist outputs to workspace (AIImageGenerator, AIImageCustomizer, FluxKontext, TalkingHead, FAL video, Bannerbear, etc.) Frontend: - Render `workspace://` image references in chat via proxy endpoint - Add "AI cannot see this image" overlay indicator CoPilot Context Mapping: - Session = Agent (graph_id) = Run (graph_exec_id) - Files scoped to `/sessions/{session_id}/` ### Checklist 📋 #### For code changes: - [x] I have clearly listed my changes in the PR description - [x] I have made a test plan - [ ] I have tested my changes according to the test plan: - [ ] Create CoPilot session, generate image with AIImageGeneratorBlock - [ ] Verify image returns `workspace://file-id` (not base64) - [ ] Verify image renders in chat with visibility indicator - [ ] Verify workspace files persist across sessions - [ ] Test list/read/write workspace files via CoPilot tools - [ ] Test local storage backend for self-hosted deployments #### For configuration changes: - [x] `.env.default` is updated or already compatible with my changes - [x] `docker-compose.yml` is updated or already compatible with my changes - [x] I have included a list of my configuration changes in the PR description (under Changes) 🤖 Generated with [Claude Code](https://claude.ai/code) <!-- CURSOR_SUMMARY --> --- > [!NOTE] > Medium Risk > Introduces a new persistent file-storage surface area (DB tables, storage backends, download API, and chat tools) and rewires `store_media_file()`/block execution context across many blocks, so regressions could impact file handling, access control, or storage costs. > > Overview > Adds a persistent per-user Workspace (new `UserWorkspace`/`UserWorkspaceFile` models plus `WorkspaceManager` + `WorkspaceStorageBackend` with GCS/local implementations) and wires it into the API via a new `/api/workspace/files/{file_id}/download` route (including header-sanitized `Content-Disposition`) and shutdown lifecycle hooks. > > Extends `ExecutionContext` to carry execution identity + `workspace_id`/`session_id`, updates executor tooling to clone node-specific contexts, and updates `run_block` (CoPilot) to create a session-scoped workspace and synthetic graph/run/node IDs. > > Refactors `store_media_file()` to require `execution_context` + `return_format` and to support `workspace://` references; migrates many media/file-handling blocks and related tests to the new API and to persist generated media as `workspace://...` (or fall back to data URIs outside CoPilot), and adds CoPilot chat tools for listing/reading/writing/deleting workspace files with safeguards against context bloat. > > <sup>Written by [Cursor Bugbot](https://cursor.com/dashboard?tab=bugbot) for commit `6abc70f793`. This will update automatically on new commits. Configure [here](https://cursor.com/dashboard?tab=bugbot).</sup> <!-- /CURSOR_SUMMARY --> --------- Co-authored-by: Claude Opus 4.5 <noreply@anthropic.com> Co-authored-by: Reinier van der Leer <pwuts@agpt.co>	2026-01-29 05:49:47 +00:00
Zamil Majdy	fb58827c61	feat(backend;frontend): Implement node-specific auto-approval, safety popup, and race condition fixes (#11810 ) ## Summary This PR implements comprehensive improvements to the human-in-the-loop (HITL) review system, including safety features, architectural changes, and bug fixes: ### Key Features - SECRT-1798: One-time safety popup - Shows informational popup before first run of AI-generated agents with sensitive actions/HITL blocks - SECRT-1795: Auto-approval toggle UX - Toggle in pending reviews panel to auto-approve future actions from the same node - Node-specific auto-approval - Changed from execution-specific to node-specific using special key pattern `auto_approve_{graph_exec_id}_{node_id}` - Consolidated approval checking - Merged `check_auto_approval` into `check_approval` using single OR query for better performance - Race condition prevention - Added execution status check before resuming to prevent duplicate execution when approving while graph is running - Parallel auto-approval creation - Uses `asyncio.gather` for better performance when creating multiple auto-approval records ## Changes ### Backend Architecture - `human_review.py`: - Added `check_approval()` function that checks both normal and auto-approval in single query - Added `create_auto_approval_record()` for node-specific auto-approval using special key pattern - Added `get_auto_approve_key()` helper to generate consistent auto-approval keys - `review/routes.py`: - Added execution status check before resuming to prevent race conditions - Refactored auto-approval record creation to use parallel execution with `asyncio.gather` - Removed obvious comments for cleaner code - `review/model.py`: Added `auto_approve_future_actions` field to `ReviewRequest` - `blocks/helpers/review.py`: Updated to use consolidated `check_approval` via database manager client - `executor/database.py`: Exposed `check_approval` through DatabaseManager RPC for block execution context - `data/block.py`: Fixed safe mode checks for sensitive action blocks ### Frontend - New `AIAgentSafetyPopup` component with localStorage-based one-time display - `PendingReviewsList`: - Replaced "Approve all future actions" button with toggle - Toggle resets data to original values and disables editing when enabled - Shows warning message explaining auto-approval behavior - `RunAgentModal`: Integrated safety popup before first run - `usePendingReviews`: Added polling for real-time badge updates - `FloatingSafeModeToggle` & `SafeModeToggle`: Simplified visibility logic - `local-storage.ts`: Added localStorage key for popup state tracking ### Bug Fixes - Fixed "Client is not connected to query engine" error by using database manager client pattern - Fixed race condition where approving reviews while graph is RUNNING could queue execution twice - Fixed migration to only drop FK constraint, not non-existent column - Fixed card data reset when auto-approve toggle changes ### Code Quality - Removed duplicate/obvious comments - Moved imports to top-level instead of local scope in tests - Used walrus operator for cleaner conditional assignments - Parallel execution for auto-approval record creation ## Test plan - [ ] Create an AI-generated agent with sensitive actions (e.g., email sending) - [ ] First run should show the safety popup before starting - [ ] Subsequent runs should not show the popup - [ ] Clear localStorage (`AI_AGENT_SAFETY_POPUP_SHOWN`) to verify popup shows again - [ ] Create an agent with human-in-the-loop blocks - [ ] Run it and verify the pending reviews panel appears - [ ] Enable the "Auto-approve all future actions" toggle - [ ] Verify editing is disabled and shows warning message - [ ] Click "Approve" and verify subsequent blocks from same node auto-approve - [ ] Verify auto-approval persists across multiple executions of same graph - [ ] Disable toggle and verify editing works normally - [ ] Verify "Reject" button still works regardless of toggle state - [ ] Test race condition: Approve reviews while graph is RUNNING (should skip resume) - [ ] Test race condition: Approve reviews while graph is REVIEW (should resume) - [ ] Verify pending reviews badge updates in real-time when new reviews are created	2026-01-25 04:05:25 +07:00
Zamil Majdy	8b25e62959	feat(backend,frontend): add explicit safe mode toggles for HITL and sensitive actions (#11756 ) ## Summary This PR introduces two explicit safe mode toggles for controlling agent execution behavior, providing clearer and more granular control over when agents should pause for human review. ### Key Changes New Safe Mode Settings: - `human_in_the_loop_safe_mode` (bool, default `true`) - Controls whether human-in-the-loop (HITL) blocks pause for review - `sensitive_action_safe_mode` (bool, default `false`) - Controls whether sensitive action blocks pause for review New Computed Properties on LibraryAgent: - `has_human_in_the_loop` - Indicates if agent contains HITL blocks - `has_sensitive_action` - Indicates if agent contains sensitive action blocks Block Changes: - Renamed `requires_human_review` to `is_sensitive_action` on blocks for clarity - Blocks marked as `is_sensitive_action=True` pause only when `sensitive_action_safe_mode=True` - HITL blocks pause when `human_in_the_loop_safe_mode=True` Frontend Changes: - Two separate toggles in Agent Settings based on block types present - Toggle visibility based on `has_human_in_the_loop` and `has_sensitive_action` computed properties - Settings cog hidden if neither toggle applies - Proper state management for both toggles with defaults AI-Generated Agent Behavior: - AI-generated agents set `sensitive_action_safe_mode=True` by default - This ensures sensitive actions are reviewed for AI-generated content ## Changes Backend: - `backend/data/graph.py` - Updated `GraphSettings` with two boolean toggles (non-optional with defaults), added `has_sensitive_action` computed property - `backend/data/block.py` - Renamed `requires_human_review` to `is_sensitive_action`, updated review logic - `backend/data/execution.py` - Updated `ExecutionContext` with both safe mode fields - `backend/api/features/library/model.py` - Added `has_human_in_the_loop` and `has_sensitive_action` to `LibraryAgent` - `backend/api/features/library/db.py` - Updated to use `sensitive_action_safe_mode` parameter - `backend/executor/utils.py` - Simplified execution context creation Frontend: - `useAgentSafeMode.ts` - Rewritten to support two independent toggles - `AgentSettingsModal.tsx` - Shows two separate toggles - `SelectedSettingsView.tsx` - Shows two separate toggles - Regenerated API types with new schema ## Test Plan - [x] All backend tests pass (Python 3.11, 3.12, 3.13) - [x] All frontend tests pass - [x] Backend format and lint pass - [x] Frontend format and lint pass - [x] Pre-commit hooks pass --------- Co-authored-by: Nicholas Tindle <nicholas.tindle@agpt.co>	2026-01-21 00:56:02 +00:00
Reinier van der Leer	b01ea3fcbd	fix(backend/executor): Centralize `increment_runs` calls & make `add_graph_execution` more robust (#11764 ) [OPEN-2946: \[Scheduler\] Error executing graph <graph_id> after 19.83s: ClientNotConnectedError: Client is not connected to the query engine, you must call `connect()` before attempting to query data.](https://linear.app/autogpt/issue/OPEN-2946) - Follow-up to #11375 <sub>(broken `increment_runs` call)</sub> - Follow-up to #11380 <sub>(direct `get_graph_execution` call)</sub> ### Changes 🏗️ - Move `increment_runs` call from `scheduler._execute_graph` to `executor.utils.add_graph_execution` so it can be made through `DatabaseManager` - Add `increment_onboarding_runs` to `DatabaseManager` - Remove now-redundant `increment_onboarding_runs` calls in other places - Make `add_graph_execution` more resilient - Split up large try/except block - Fix direct `get_graph_execution` call ### Checklist 📋 #### For code changes: - [x] I have clearly listed my changes in the PR description - [x] I have made a test plan - [x] I have tested my changes according to the test plan: - CI + a thorough review	2026-01-15 04:08:19 +00:00
Nicholas Tindle	47a3a5ef41	feat(backend,frontend): optional credentials flag for blocks at agent level (#11716 ) This feature allows agent makers to mark credential fields as optional. When credentials are not configured for an optional block, the block will be skipped during execution rather than causing a validation error. Use case: An agent with multiple notification channels (Discord, Twilio, Slack) where the user only needs to configure one - unconfigured channels are simply skipped. ### Changes 🏗️ #### Backend Data Model Changes: - `backend/data/graph.py`: Added `credentials_optional` property to `Node` model that reads from node metadata - `backend/data/execution.py`: Added `nodes_to_skip` field to `GraphExecutionEntry` model to track nodes that should be skipped Validation Changes: - `backend/executor/utils.py`: - Updated `_validate_node_input_credentials()` to return a tuple of `(credential_errors, nodes_to_skip)` - Nodes with `credentials_optional=True` and missing credentials are added to `nodes_to_skip` instead of raising validation errors - Updated `validate_graph_with_credentials()` to propagate `nodes_to_skip` set - Updated `validate_and_construct_node_execution_input()` to return `nodes_to_skip` - Updated `add_graph_execution()` to pass `nodes_to_skip` to execution entry Execution Changes: - `backend/executor/manager.py`: - Added skip logic in `_on_graph_execution()` dispatch loop - When a node is in `nodes_to_skip`, it is marked as `COMPLETED` without execution - No outputs are produced, so downstream nodes won't trigger #### Frontend Node Store: - `frontend/src/app/(platform)/build/stores/nodeStore.ts`: - Added `credentials_optional` to node metadata serialization in `convertCustomNodeToBackendNode()` - Added `getCredentialsOptional()` and `setCredentialsOptional()` helper methods Credential Field Component: - `frontend/src/components/renderers/input-renderer/fields/CredentialField/CredentialField.tsx`: - Added "Optional - skip block if not configured" switch toggle - Switch controls the `credentials_optional` metadata flag - Placeholder text updates based on optional state Credential Field Hook: - `frontend/src/components/renderers/input-renderer/fields/CredentialField/useCredentialField.ts`: - Added `disableAutoSelect` parameter - When credentials are optional, auto-selection of credentials is disabled Feature Flags: - `frontend/src/services/feature-flags/use-get-flag.ts`: Minor refactor (condition ordering) ### Checklist 📋 #### For code changes: - [x] I have clearly listed my changes in the PR description - [x] I have made a test plan - [x] I have tested my changes according to the test plan: - [x] Build an agent using smart decision maker and down stream blocks to test this <!-- CURSOR_SUMMARY --> --- > [!NOTE] > Introduces optional credentials across graph execution and UI, allowing nodes to be skipped (no outputs, no downstream triggers) when their credentials are not configured. > > - Backend > - Adds `Node.credentials_optional` (from node `metadata`) and computes required credential fields in `Graph.credentials_input_schema` based on usage. > - Validates credentials with `_validate_node_input_credentials` → returns `(errors, nodes_to_skip)`; plumbs `nodes_to_skip` through `validate_graph_with_credentials`, `_construct_starting_node_execution_input`, `validate_and_construct_node_execution_input`, and `add_graph_execution` into `GraphExecutionEntry`. > - Executor: dispatch loop skips nodes in `nodes_to_skip` (marks `COMPLETED`); `execute_node`/`on_node_execution` accept `nodes_to_skip`; `SmartDecisionMakerBlock.run` filters tool functions whose `_sink_node_id` is in `nodes_to_skip` and errors only if all tools are filtered. > - Models: `GraphExecutionEntry` gains `nodes_to_skip` field. Tests and snapshots updated accordingly. > > - Frontend > - Builder: credential field uses `custom/credential_field` with an "Optional – skip block if not configured" toggle; `nodeStore` persists `credentials_optional` and history; UI hides optional toggle in run dialogs. > - Run dialogs: compute required credentials from `credentials_input_schema.required`; allow selecting "None"; avoid auto-select for optional; filter out incomplete creds before execute. > - Minor schema/UI wiring updates (`uiSchema`, form context flags). > > <sup>Written by [Cursor Bugbot](https://cursor.com/dashboard?tab=bugbot) for commit `5e01fd6a3e`. This will update automatically on new commits. Configure [here](https://cursor.com/dashboard?tab=bugbot).</sup> <!-- /CURSOR_SUMMARY --> --------- Co-authored-by: Claude <noreply@anthropic.com> Co-authored-by: claude[bot] <41898282+claude[bot]@users.noreply.github.com> Co-authored-by: Nicholas Tindle <ntindle@users.noreply.github.com>	2026-01-09 14:11:35 +00:00
Zamil Majdy	7b951c977e	feat(platform): implement graph-level Safe Mode toggle for HITL blocks (#11455 ) ## Summary This PR implements a graph-level Safe Mode toggle system for Human-in-the-Loop (HITL) blocks. When Safe Mode is ON (default), HITL blocks require manual review before proceeding. When OFF, they execute automatically. ## 🔧 Backend Changes - Database: Added `metadata` JSON column to `AgentGraph` table with migration - API: Updated `execute_graph` endpoint to accept `safe_mode` parameter - Execution: Enhanced execution context to use graph metadata as default with API override capability - Auto-detection: Automatically populate `has_human_in_the_loop` for graphs containing HITL blocks - Block Detection: HITL block ID: `8b2a7b3c-6e9d-4a5f-8c1b-2e3f4a5b6c7d` ## 🎨 Frontend Changes - Component: New `FloatingSafeModeToggle` with dual variants: - White variant: For library pages, integrates with action buttons - Black variant: For builders, floating positioned - Integration: Added toggles to both new/legacy builders and library pages - API Integration: Direct graph metadata updates via `usePutV1UpdateGraphVersion` - Query Management: React Query cache invalidation for consistent UI updates - Conditional Display: Toggle only appears when graph contains HITL blocks ## 🛠 Technical Implementation - Safe Mode ON (default): HITL blocks require manual review before proceeding - Safe Mode OFF: HITL blocks execute automatically without intervention - Priority: Backend API `safe_mode` parameter takes precedence over graph metadata - Detection: Auto-populates `has_human_in_the_loop` metadata field - Positioning: Proper z-index and responsive positioning for floating elements ## 🚧 Known Issues (Work in Progress) ### High Priority - [ ] Toggle state persistence: Always shows "ON" regardless of actual state - query invalidation issue - [ ] LibraryAgent metadata: Missing metadata field causing TypeScript errors - [ ] Tooltip z-index: Still covered by some UI elements despite high z-index ### Medium Priority - [ ] HITL detection: Logic needs improvement for reliable block detection - [ ] Error handling: Removing HITL blocks from graph causes save errors - [ ] TypeScript: Fix type mismatches between GraphModel and LibraryAgent ### Low Priority - [ ] Frontend API: Add `safe_mode` parameter to execution calls once OpenAPI is regenerated - [ ] Performance: Consider debouncing rapid toggle clicks ## 🧪 Test Plan - [ ] Verify toggle appears only when graph has HITL blocks - [ ] Test toggle persistence across page refreshes - [ ] Confirm API calls update graph metadata correctly - [ ] Validate execution behavior respects safe mode setting - [ ] Check styling consistency across builder and library contexts ## 🔗 Related - Addresses requirements for graph-level HITL configuration - Builds on existing FloatingReviewsPanel infrastructure - Integrates with existing graph metadata system 🤖 Generated with [Claude Code](https://claude.ai/code)	2025-12-02 09:55:55 +00:00
Zamil Majdy	3d08c22dd5	feat(platform): add Human In The Loop block with review workflow (#11380 ) ## Summary This PR implements a comprehensive Human In The Loop (HITL) block that allows agents to pause execution and wait for human approval/modification of data before continuing. https://github.com/user-attachments/assets/c027d731-17d3-494c-85ca-97c3bf33329c ## Key Features - Added WAITING_FOR_REVIEW status to AgentExecutionStatus enum - Created PendingHumanReview database table for storing review requests - Implemented HumanInTheLoopBlock that extracts input data and creates review entries - Added API endpoints at /api/executions/review for fetching and reviewing pending data - Updated execution manager to properly handle waiting status and resume after approval ## Frontend Components - PendingReviewCard for individual review handling - PendingReviewsList for multiple reviews - FloatingReviewsPanel for graph builder integration - Integrated review UI into 3 locations: legacy library, new library, and graph builder ## Technical Implementation - Added proper type safety throughout with SafeJson handling - Optimized database queries using count functions instead of full data fetching - Fixed imports to be top-level instead of local - All formatters and linters pass ## Test plan - [ ] Test Human In The Loop block creation in graph builder - [ ] Test block execution pauses and creates pending review - [ ] Test review UI appears in all 3 locations - [ ] Test data modification and approval workflow - [ ] Test rejection workflow - [ ] Test execution resumes after approval 🤖 Generated with [Claude Code](https://claude.ai/code) <!-- This is an auto-generated comment: release notes by coderabbit.ai --> ## Summary by CodeRabbit * New Features * Added Human-In-The-Loop review workflows to pause executions for human validation. * Users can approve or reject pending tasks, optionally editing submitted data and adding a message. * New "Waiting for Review" execution status with UI indicators across run lists, badges, and activity views. * Review management UI: pending review cards, list view, and a floating reviews panel for quick access. <!-- end of auto-generated comment: release notes by coderabbit.ai --> --------- Co-authored-by: Claude <noreply@anthropic.com>	2025-11-27 12:07:46 +07:00
Swifty	a66219fc1f	fix(platform): Remove un-runnable agents from schedule (#11374 ) Currently when an agent fails validation during a scheduled run, we raise an error then try again, regardless of why. This change removed the agent schedule and notifies the user ### Changes 🏗️ - add schedule_id to the GraphExecutionJobArgs - add agent_name to the GraphExecutionJobArgs - Delete schedule on GraphValidationError - Notify the user with a message that include the agent name ### Checklist 📋 #### For code changes: - [x] I have clearly listed my changes in the PR description - [x] I have made a test plan - [x] I have tested my changes according to the test plan: <!-- Put your test plan here: --> - [x] I have ensured the scheduler tests work with these changes	2025-11-17 15:24:40 +00:00
Reinier van der Leer	d68dceb9c1	fix(backend/executor): Improve graph execution permission check (#11323 ) - Resolves #11316 - Durable fix to replace #11318 ### Changes 🏗️ - Expand graph execution permissions check - Don't require library membership for execution as sub-graph ### Checklist 📋 #### For code changes: - [x] I have clearly listed my changes in the PR description - [x] I have made a test plan - [x] I have tested my changes according to the test plan: - [x] Can run sub-agent with non-latest graph version - [x] Can run sub-agent that is available in Marketplace but not added to Library	2025-11-05 17:13:41 +00:00
Zamil Majdy	5506d59da1	fix(backend/executor): make graph execution permission check version-agnostic (#11283 ) ## Summary Fix critical issue where pre-execution permission validation broke execution of graphs that reference older versions of sub-graphs. ## Problem The `validate_graph_execution_permissions` function was checking for the specific version of a graph in the user's library. This caused failures when: 1. A parent graph references an older version of a sub-graph 2. The user updates the sub-graph to a newer version 3. The older version is no longer in their library 4. Execution of the parent graph fails with `GraphNotInLibraryError` ## Root Cause In `backend/executor/utils.py` line 523, the function was checking for the exact version, but sub-graphs legitimately reference older versions that may no longer be in the library. ## Solution ### 1. Remove Version-Specific Check (backend/executor/utils.py) - Remove `graph_version=graph.version` parameter from validation call - Add explanatory comment about version-agnostic behavior - Now only checks that the graph ID exists in user's library (any version) ### 2. Enhance Documentation (backend/data/graph.py) - Update function docstring to explain version-agnostic behavior - Document that `None` (now default) allows execution of any version - Clarify this is important for sub-graph version compatibility ## Technical Details The `validate_graph_execution_permissions` function was already designed to handle version-agnostic checks when `graph_version=None`. By omitting the version parameter, we skip the version check and only verify: - Graph exists in user's library - Graph is not deleted/archived - User has execution permissions ## Impact - ✅ Parent graphs can execute even when they reference older sub-graph versions - ✅ Sub-graph updates don't break existing parent graphs - ✅ Maintains security: still checks library membership and permissions - ✅ No breaking changes: version-specific validation still available when needed ## Example Scenario Fixed 1. User creates parent graph that uses sub-graph v1 2. User updates sub-graph to v2 (v1 removed from library) 3. Parent graph still references sub-graph v1 4. Before: Execution fails with `GraphNotInLibraryError` 5. After: Execution succeeds (version-agnostic permission check) ## Testing - [x] Code formatting and linting passes - [x] Type checking passes - [x] No breaking changes to existing functionality - [x] Security still maintained through library membership checks ## Files Changed - `backend/executor/utils.py`: Remove version-specific permission check - `backend/data/graph.py`: Enhanced documentation for version-agnostic behavior Closes #[issue-number-if-applicable] Co-authored-by: Claude <noreply@anthropic.com>	2025-10-29 14:13:23 +00:00
Zamil Majdy	4922f88851	feat(backend/executor): Implement cascading stop for nested graph executions (#11277 ) ## Summary Fixes critical issue where child executions spawned by `AgentExecutorBlock` continue running after parent execution is stopped. Implements parent-child execution tracking and recursive cascading stop logic to ensure entire execution trees are terminated together. ## Background When a parent graph execution containing `AgentExecutorBlock` nodes is stopped, only the parent was terminated. Child executions continued running, leading to: - ❌ Orphaned child executions consuming credits - ❌ No user control over execution trees - ❌ Race conditions where children start after parent stops - ❌ Resource leaks from abandoned executions ## Core Changes ### 1. Database Schema (`schema.prisma` + migration) ```sql -- Add nullable parent tracking field ALTER TABLE "AgentGraphExecution" ADD COLUMN "parentGraphExecutionId" TEXT; -- Add self-referential foreign key with graceful deletion ALTER TABLE "AgentGraphExecution" ADD CONSTRAINT "AgentGraphExecution_parentGraphExecutionId_fkey" FOREIGN KEY ("parentGraphExecutionId") REFERENCES "AgentGraphExecution"("id") ON DELETE SET NULL ON UPDATE CASCADE; -- Add index for efficient child queries CREATE INDEX "AgentGraphExecution_parentGraphExecutionId_idx" ON "AgentGraphExecution"("parentGraphExecutionId"); ``` ### 2. Parent ID Propagation (`backend/blocks/agent.py`) ```python # Extract current graph execution ID and pass as parent to child execution = add_graph_execution( # ... other params parent_graph_exec_id=graph_exec_id, # NEW: Track parent relationship ) ``` ### 3. Data Layer (`backend/data/execution.py`) ```python async def get_child_graph_executions(parent_exec_id: str) -> list[GraphExecution]: """Get all child executions of a parent execution.""" children = await AgentGraphExecution.prisma().find_many( where={"parentGraphExecutionId": parent_exec_id, "isDeleted": False} ) return [GraphExecution.from_db(child) for child in children] ``` ### 4. Cascading Stop Logic (`backend/executor/utils.py`) ```python async def stop_graph_execution( user_id: str, graph_exec_id: str, wait_timeout: float = 15.0, cascade: bool = True, # NEW parameter ): # 1. Find all child executions if cascade: children = await _get_child_executions(graph_exec_id) # 2. Stop all children recursively in parallel if children: await asyncio.gather( [stop_graph_execution(user_id, child.id, wait_timeout, True) for child in children], return_exceptions=True, # Don't fail parent if child fails ) # 3. Stop the parent execution # ... existing stop logic ``` ### 5. Race Condition Prevention (`backend/executor/manager.py`) ```python # Before executing queued child, check if parent was terminated if parent_graph_exec_id: parent_exec = get_db_client().get_graph_execution_meta(parent_graph_exec_id, user_id) if parent_exec and parent_exec.status == ExecutionStatus.TERMINATED: # Skip execution, mark child as terminated get_db_client().update_graph_execution_stats( graph_exec_id=graph_exec_id, status=ExecutionStatus.TERMINATED, ) return # Don't start orphaned child ``` ## How It Works ### Before (Broken) ``` User stops parent execution ↓ Parent terminates ✓ ↓ Child executions keep running ✗ ↓ User cannot stop children ✗ ``` ### After (Fixed) ``` User stops parent execution ↓ Query database for all children ↓ Recursively stop all children in parallel ↓ Wait for children to terminate ↓ Stop parent execution ↓ All executions in tree stopped ✓ ``` ### Race Prevention ``` Child in QUEUED status ↓ Parent stopped ↓ Child picked up by executor ↓ Pre-flight check: parent TERMINATED? ↓ Yes → Skip execution, mark child TERMINATED ↓ Child never runs ✓ ``` ## Edge Cases Handled ✅ Deep nesting* - Recursive cascading handles multi-level trees ✅ Queued children - Pre-flight check prevents execution ✅ Race conditions - Child spawned during stop operation ✅ Partial failures - `return_exceptions=True` continues on error ✅ Multiple children - Parallel stop via `asyncio.gather()` ✅ No parent - Backward compatible (nullable field) ✅ Already completed - Existing status check handles it ## Performance Impact - Stop operation: O(depth) with parallel execution vs O(1) before - Memory: +36 bytes per execution (one UUID reference) - Database: +1 query per tree level, indexed for efficiency ## API Changes (Backward Compatible) ### `stop_graph_execution()` - New Optional Parameter ```python # Before async def stop_graph_execution(user_id: str, graph_exec_id: str, wait_timeout: float = 15.0) # After async def stop_graph_execution(user_id: str, graph_exec_id: str, wait_timeout: float = 15.0, cascade: bool = True) ``` Default `cascade=True` means existing callers get the new behavior automatically. ### `add_graph_execution()` - New Optional Parameter ```python async def add_graph_execution(..., parent_graph_exec_id: Optional[str] = None) ``` ## Security & Safety - ✅ User verification - Users can only stop their own executions (parent + children) - ✅ No cycles - Self-referential FK prevents infinite loops - ✅ Graceful degradation - Errors in child stops don't block parent stop - ✅ Rate limits - Existing execution rate limits still apply ## Testing Checklist ### Database Migration - [x] Migration runs successfully - [x] Prisma client regenerates without errors - [x] Existing tests pass ### Core Functionality - [ ] Manual test: Stop parent with running child → child stops - [ ] Manual test: Stop parent with queued child → child never starts - [ ] Unit test: Cascading stop with multiple children - [ ] Unit test: Deep nesting (3+ levels) - [ ] Integration test: Race condition prevention ## Breaking Changes None - All changes are backward compatible with existing code. ## Rollback Plan If issues arise: 1. Code rollback: Revert PR, redeploy 2. Database rollback: Drop column and constraints (non-destructive) --- Note: This branch contains additional unrelated changes from merging with `dev`. The core cascading stop feature involves only: - `schema.prisma` + migration - `backend/data/execution.py` - `backend/executor/utils.py` - `backend/blocks/agent.py` - `backend/executor/manager.py` All other file changes are from dev branch updates and not part of this feature. 🤖 Generated with [Claude Code](https://claude.ai/code) <!-- This is an auto-generated comment: release notes by coderabbit.ai --> ## Summary by CodeRabbit * New Features * Nested graph executions: parent-child tracking and retrieval of child executions * Improvements * Cascading stop: stopping a parent optionally terminates child executions * Parent execution IDs propagated through runs and surfaced in logs * Per-user/graph concurrent execution limits enforced * Bug Fixes * Skip enqueuing children if parent is terminated; robust handling when parent-status checks fail * Tests * Updated tests to cover parent linkage in graph creation <!-- end of auto-generated comment: release notes by coderabbit.ai --> --------- Co-authored-by: Claude <noreply@anthropic.com>	2025-10-29 11:11:22 +00:00
Zamil Majdy	de70ede54a	fix(backend): prevent execution of deleted agents and cleanup orphaned resources (#11243 ) ## Summary Fix critical bug where deleted agents continue running scheduled and triggered executions indefinitely, consuming credits without user control. ## Problem When agents are deleted from user libraries, their schedules and webhook triggers remain active, leading to: - ❌ Uncontrolled resource consumption - ❌ "Unknown agent" executions that charge credits - ❌ No way for users to stop orphaned executions - ❌ Accumulation of orphaned database records ## Solution ### 1. Prevention: Library Validation Before Execution - Add `is_graph_in_user_library()` function with efficient database queries - Validate graph accessibility before all executions in `validate_and_construct_node_execution_input()` - Use specific `GraphNotInLibraryError` for clear error handling ### 2. Cleanup: Remove Schedules & Webhooks on Deletion - Enhanced `delete_library_agent()` to clean up associated schedules and webhooks - Comprehensive cleanup functions for both scheduled and triggered executions - Proper database transaction handling ### 3. Error-Based Cleanup: Handle Existing Orphaned Resources - Catch `GraphNotInLibraryError` in scheduler and webhook handlers - Automatically clean up orphaned resources when execution fails - Graceful degradation without breaking existing workflows ### 4. Migration: Clean Up Historical Orphans - SQL migration to remove existing orphaned schedules and webhooks - Performance index for faster cleanup queries - Proper logging and error handling ## Key Changes ### Core Library Validation ```python # backend/data/graph.py - Single source of truth async def is_graph_in_user_library(graph_id: str, user_id: str, graph_version: Optional[int] = None) -> bool: where_clause = {"userId": user_id, "agentGraphId": graph_id, "isDeleted": False, "isArchived": False} if graph_version is not None: where_clause["agentGraphVersion"] = graph_version count = await LibraryAgent.prisma().count(where=where_clause) return count > 0 ``` ### Enhanced Agent Deletion ```python # backend/server/v2/library/db.py async def delete_library_agent(library_agent_id: str, user_id: str, soft_delete: bool = True) -> None: # ... existing deletion logic ... await _cleanup_schedules_for_graph(graph_id=graph_id, user_id=user_id) await _cleanup_webhooks_for_graph(graph_id=graph_id, user_id=user_id) ``` ### Execution Prevention ```python # backend/executor/utils.py if not await gdb.is_graph_in_user_library(graph_id=graph_id, user_id=user_id, graph_version=graph.version): raise GraphNotInLibraryError(f"Graph #{graph_id} is not accessible in your library") ``` ### Error-Based Cleanup ```python # backend/executor/scheduler.py & backend/server/integrations/router.py except GraphNotInLibraryError as e: logger.warning(f"Execution blocked for deleted/archived graph {graph_id}") await _cleanup_orphaned_resources_for_graph(graph_id, user_id) ``` ## Technical Implementation ### Database Efficiency - Use `count()` instead of `find_first()` for faster queries - Add performance index: `idx_library_agent_user_graph_active` - Follow existing `prisma.is_connected()` patterns ### Error Handling Hierarchy - `GraphNotInLibraryError`: Specific exception for deleted/archived graphs - `NotAuthorizedError`: Generic authorization errors (preserved for user ID mismatches) - Clear error messages for better debugging ### Code Organization - Single source of truth for library validation in `backend/data/graph.py` - Import from centralized location to avoid duplication - Top-level imports following codebase conventions ## Testing & Validation ### Functional Testing - ✅ Library validation prevents execution of deleted agents - ✅ Cleanup functions remove schedules and webhooks properly - ✅ Error-based cleanup handles orphaned resources gracefully - ✅ Migration removes existing orphaned records ### Integration Testing - ✅ All existing tests pass (including `test_store_listing_graph`) - ✅ No breaking changes to existing functionality - ✅ Proper error propagation and handling ### Performance Testing - ✅ Efficient database queries with proper indexing - ✅ Minimal overhead for normal execution flows - ✅ Cleanup operations don't impact performance ## Impact ### User Experience - 🎯 Immediate: Deleted agents stop running automatically - 🎯 Ongoing: No more unexpected credit charges from orphaned executions - 🎯 Cleanup: Historical orphaned resources are removed ### System Reliability - 🔒 Security: Users can only execute agents they have access to - 🧹 Cleanup: Automatic removal of orphaned database records - 📈 Performance: Efficient validation with minimal overhead ### Developer Experience - 🎯 Clear Errors: Specific exception types for better debugging - 🔧 Maintainable: Centralized library validation logic - 📚 Documented: Comprehensive error handling patterns ## Files Modified - `backend/data/graph.py` - Library validation function - `backend/server/v2/library/db.py` - Enhanced agent deletion with cleanup - `backend/executor/utils.py` - Execution validation and prevention - `backend/executor/scheduler.py` - Error-based cleanup for schedules - `backend/server/integrations/router.py` - Error-based cleanup for webhooks - `backend/util/exceptions.py` - Specific error type for deleted graphs - `migrations/20251023000000_cleanup_orphaned_schedules_and_webhooks/migration.sql` - Historical cleanup ## Breaking Changes None. All changes are backward compatible and preserve existing functionality. ## Follow-up Tasks - [ ] Monitor cleanup effectiveness in production - [ ] Consider adding metrics for orphaned resource detection - [ ] Potential optimization of cleanup batch operations 🤖 Generated with [Claude Code](https://claude.ai/code) Co-Authored-By: Claude <noreply@anthropic.com> --------- Co-authored-by: Claude <noreply@anthropic.com>	2025-10-28 23:48:35 +00:00
Reinier van der Leer	04df981115	fix(backend): Fix structured logging for cloud environments (#11227 ) - Resolves #11226 ### Changes 🏗️ - Drop use of `CloudLoggingHandler` which docs state isn't for use in GKE - For cloud logging, output only structured log entries to `stdout` ### Checklist 📋 #### For code changes: - [x] I have clearly listed my changes in the PR description - [x] I have made a test plan - [x] I have tested my changes according to the test plan: - [x] Test deploy to dev and check logs	2025-10-21 12:48:41 +00:00
Zamil Majdy	11d55f6055	fix(backend/executor): Avoid running direct query in executor (#11224 ) ## Summary - Fixes database connection warnings in executor logs: "Client is not connected to the query engine, you must call `connect()` before attempting to query data" - Implements resilient database client pattern already used elsewhere in the codebase - Adds caching to reduce database load for user context lookups ## Changes - Updated `get_user_context()` to check `prisma.is_connected()` and fall back to database manager client - Added `@cached(maxsize=1000, ttl_seconds=3600)` decorator for performance optimization - Updated database manager to expose `get_user_by_id` method ## Test plan - [x] Verify executor pods no longer show Prisma connection warnings - [x] Confirm user timezone is still correctly retrieved - [x] Test fallback behavior when Prisma is disconnected 🤖 Generated with [Claude Code](https://claude.ai/code) Co-authored-by: Claude <noreply@anthropic.com>	2025-10-21 08:46:40 +00:00
Zamil Majdy	4e1557e498	fix(backend): Add dynamic input pin support for Smart Decision Maker Block (#11082 ) ## Summary - Centralize dynamic field delimiters and helpers in backend/data/dynamic_fields.py. - Refactor SmartDecisionMaker: build function signatures with dynamic-field mapping and re-map tool outputs back to original dynamic names. - Deterministic retry loop with retry-only feedback to avoid polluting final conversation history. - Update executor/utils.py and data/graph.py to use centralized utilities. - Update and extend tests: dynamic-field E2E flow, mapping verification, output yielding, and retry validation; switch mocked llm_call to AsyncMock; align tool-name expectations. - Add a single-tool fallback in schema lookup to support mocked scenarios. ## Validation - Full backend test suite: 1125 passed, 88 skipped, 53 warnings (local). - Backend lint/format pass. ## Scope - Minimal and localized to SmartDecisionMaker and dynamic-field utilities; unrelated pyright warnings remain unchanged. ## Risks/Mitigations - Behavior is backward-compatible; dynamic-field constants are centralized and reused. - Output re-mapping only affects SmartDecisionMaker tool outputs and matches existing link naming conventions. ## Checklist - [x] Formatted and linted - [x] All updated tests pass locally - [x] No secrets introduced --------- Co-authored-by: Claude <noreply@anthropic.com>	2025-10-04 14:23:13 +00:00
Zamil Majdy	27fccdbf31	fix(backend/executor): Make graph execution status transitions atomic and enforce state machine (#10863 ) ## Summary - Fixed race condition issues in `update_graph_execution_stats` function - Implemented atomic status transitions using database-level constraints - Added state machine enforcement to prevent invalid status transitions - Eliminated code duplication and improved error handling ## Problem The `update_graph_execution_stats` function had race condition vulnerabilities where concurrent status updates could cause invalid transitions like RUNNING → QUEUED. The function was not durable and could result in executions moving backwards in their lifecycle, causing confusion and potential system inconsistencies. ## Root Cause Analysis 1. Race Conditions: The function used a broad OR clause that allowed updates from multiple source statuses without validating the specific transition 2. No Atomicity: No atomic check to ensure the status hadn't changed between read and write operations 3. Missing State Machine: No enforcement of valid state transitions according to execution lifecycle rules ## Solution Implementation ### 1. Atomic Status Transitions - Use database-level atomicity by including the current allowed source statuses in the WHERE clause during updates - This ensures only valid transitions can occur at the database level ### 2. State Machine Enforcement Define valid transitions as a module constant `VALID_STATUS_TRANSITIONS`: - `INCOMPLETE` → `QUEUED`, `RUNNING`, `FAILED`, `TERMINATED` - `QUEUED` → `RUNNING`, `FAILED`, `TERMINATED` - `RUNNING` → `COMPLETED`, `TERMINATED`, `FAILED` - `TERMINATED` → `RUNNING` (for resuming halted execution) - `COMPLETED` and `FAILED` are terminal states with no allowed transitions ### 3. Improved Error Handling - Early validation with clear error messages for invalid parameters - Graceful handling when transitions fail - return current state instead of None - Proper logging of invalid transition attempts ### 4. Code Quality Improvements - Eliminated code duplication in fetch logic - Added proper type hints and casting - Made status transitions constant for better maintainability ## Benefits ✅ Prevents Invalid Regressions: No more RUNNING → QUEUED transitions ✅ Atomic Operations: Database-level consistency guarantees ✅ Clear Error Messages: Better debugging and monitoring ✅ Maintainable Code: Clean logic flow without duplication ✅ Race Condition Safe: Handles concurrent updates gracefully ## Test Plan - [x] Function imports and basic structure validation - [x] Code formatting and linting checks pass - [x] Type checking passes for modified files - [x] Pre-commit hooks validation ## Technical Details The key insight is using the database query itself to enforce valid transitions by filtering on allowed source statuses in the WHERE clause. This makes the operation truly atomic and eliminates the race condition window that existed in the previous implementation. 🤖 Generated with [Claude Code](https://claude.ai/code) --------- Co-authored-by: Claude <noreply@anthropic.com> Co-authored-by: Nicholas Tindle <nicholas.tindle@agpt.co>	2025-09-14 23:31:02 +00:00
Krzysztof Czerwinski	cfc975d39b	feat(backend): Type for API block data response (#10763 ) Moving to auto-generated frontend types caused returned blocks data to no longer have proper typing. ### Changes 🏗️ - Add `BlockInfo` model and `get_info` function that returns it to the `Block` class, including costs - Move `BlockCost` and `BlockCostType` to `block.py` to prevent circular imports ### Checklist 📋 #### For code changes: - [x] I have clearly listed my changes in the PR description - [x] I have made a test plan - [x] I have tested my changes according to the test plan: - [x] Endpoints using the new type work correctly Co-authored-by: Abhimanyu Yadav <122007096+Abhi1992002@users.noreply.github.com>	2025-09-06 04:21:48 +00:00
Reinier van der Leer	e16e69ca55	feat(library, executor): Make "Run Again" work with credentials (#10821 ) - Resolves [OPEN-2549: Make "Run again" work with credentials in `AgentRunDetailsView`](https://linear.app/autogpt/issue/OPEN-2549/make-run-again-work-with-credentials-in-agentrundetailsview) - Resolves #10237 ### Changes 🏗️ - feat(frontend/library): Make "Run Again" button work for runs with credentials - feat(backend/executor): Store passed-in credentials on `GraphExecution` ### Checklist 📋 #### For code changes: - [x] I have clearly listed my changes in the PR description - [x] I have made a test plan - [x] I have tested my changes according to the test plan: - Go to `/library/agents/[id]` for an agent with credentials inputs - Run the agent manually - [x] -> runs successfully - [x] -> "Run again" shows among the action buttons on the newly created run - Click "Run again" - [x] -> runs successfully	2025-09-02 18:34:56 +00:00
Nicholas Tindle	2bb8e91040	feat(backend): Add user timezone support to backend (#10707 ) Co-authored-by: Swifty <craigswift13@gmail.com> resolve issue #10692 where scheduled time and actual run	2025-08-25 11:00:07 -05:00
Reinier van der Leer	ba65fee862	hotfix(backend/executor): Fix propagation of passed-in credentials to sub-agents (#10668 ) This should fix sub-agent execution issues with passed-in credentials after a crucial data path was removed in #10568. Additionally, some of the changes are to ensure the `credentials_input_schema` gets refreshed correctly when saving a new version of a graph in the builder. ### Changes 🏗️ - Include `graph_credentials_inputs` in `nodes_input_masks` passed into sub-agent execution - Fix credentials input schema in `update_graph` and `get_library_agent_by_graph_id` return - Improve error message on sub-graph validation failure ### Checklist 📋 #### For code changes: - [x] I have clearly listed my changes in the PR description - [x] I have made a test plan - [x] I have tested my changes according to the test plan: - [x] Import agent with sub-agent(s) with required credentials inputs & run it -> should work	2025-08-18 16:42:28 +02:00
Zamil Majdy	a28b2cf04f	fix(backend/scheduler): Reconfigure scheduling setting & Add more logging on execution scheduling logic	2025-08-08 19:27:30 +07:00
Zamil Majdy	378d256b58	fix(backend): add graph validation before scheduling recurring jobs (#10568 ) ## Summary This PR addresses the recurring job validation failures by adding graph validation before scheduling jobs. Previously, validation errors only occurred at runtime during job execution, making it difficult to communicate errors to users for scheduled recurring jobs. ### Changes 🏗️ - Extract validation logic: Created `validate_and_construct_node_execution_input` wrapper function that centralizes graph fetching, credential mapping, and validation logic - Add pre-scheduling validation: Modified `add_graph_execution_schedule` to validate graphs before creating scheduled jobs - Make construct function private: Renamed `construct_node_execution_input` to `_construct_node_execution_input` to prevent direct usage and encourage use of the wrapper - Reduce code duplication: Eliminated duplicate validation logic between scheduler and execution paths - Improve scheduler lifecycle management: - Enhanced cleanup process with proper event loop shutdown sequence - Added graceful event loop thread termination with timeout - Fixed thread lifecycle management to prevent resource leaks - Add helper utilities: - Created `run_async` helper to reduce `asyncio.run_coroutine_threadsafe` boilerplate - Added `SCHEDULER_OPERATION_TIMEOUT_SECONDS` constant for consistent timeout handling across all scheduler operations ### Technical Details Validation Flow: The validation now happens in `add_graph_execution_schedule` before calling `scheduler.add_job()`, ensuring that: 1. Graph exists and is accessible to the user 2. All credentials are valid and available 3. Graph structure and node configurations are valid 4. Starting nodes are present and properly configured This uses the same validation logic as runtime execution, guaranteeing consistency. Scheduler Lifecycle Improvements: - Proper cleanup sequence: Event loop is stopped before thread termination - Thread management: Added global tracking of event loop thread for proper cleanup - Timeout consistency: All scheduler operations now use the same 300-second timeout - Resource management: Prevents potential memory leaks from unclosed event loops Code Quality Improvements: - DRY principle: `run_async` helper eliminates repeated `asyncio.run_coroutine_threadsafe` patterns - Single source of truth: All timeout values use `SCHEDULER_OPERATION_TIMEOUT_SECONDS` constant - Cleaner abstractions: Direct utility function calls instead of unnecessary wrapper methods ### Checklist 📋 #### For code changes: - [x] I have clearly listed my changes in the PR description - [x] I have made a test plan - [x] I have tested my changes according to the test plan: - [x] Verified imports work correctly for both scheduler and utils modules - [x] Confirmed code passes all linting and type checking - [x] Validated that existing functionality remains intact - [x] Tested that validation logic is properly extracted and reused - [x] Verified scheduler cleanup process works correctly - [x] Confirmed thread lifecycle management improvements #### For configuration changes: - [x] `.env.example` is updated or already compatible with my changes - [x] `docker-compose.yml` is updated or already compatible with my changes - [x] I have included a list of my configuration changes in the PR description (under Changes) Note: No configuration changes were required for this fix. ## Impact - Prevents runtime failures: Invalid graphs are caught before scheduling instead of failing silently during execution - Better error communication: Validation errors surface immediately when scheduling - Improved resource management: Proper event loop and thread cleanup prevents memory leaks - Enhanced maintainability: Single source of truth for validation logic and consistent timeout handling - Reduced code duplication: Eliminated ~30+ lines of duplicate code across validation and async execution patterns - Better developer experience: Cleaner code with helper functions and consistent patterns Resolves the TODO comment: "We need to communicate this error to the user somehow" in scheduler.py:107 Co-authored-by: Claude <noreply@anthropic.com>	2025-08-08 05:40:20 +00:00
Zamil Majdy	3fe88b6106	refactor(backend): Refactor log client and resource cleanup (#10558 ) ## Summary - Created centralized service client helpers with thread caching in `util/clients.py` - Refactored service client management to eliminate health checks and improve performance - Enhanced logging in process cleanup to include error details - Improved retry mechanisms and resource cleanup across the platform - Updated multiple services to use new centralized client patterns ## Key Changes ### New Centralized Client Factory (`util/clients.py`) - Added thread-cached factory functions for all major service clients: - Database managers (sync and async) - Scheduler client - Notification manager - Execution event bus (Redis-based) - RabbitMQ execution queue (sync and async) - Integration credentials store - All clients use `@thread_cached` decorator for performance optimization ### Service Client Improvements - Removed health checks: Eliminated unnecessary health check calls from `get_service_client()` to reduce startup overhead - Enhanced retry support: Database manager clients now use request retry by default - Better error handling: Improved error propagation and logging ### Enhanced Logging and Cleanup - Process termination logs: Added error details to termination messages in `util/process.py` - Retry mechanism updates: Improved retry logic with better error handling in `util/retry.py` - Resource cleanup: Better resource management across executors and monitoring services ### Updated Service Usage - Refactored 21+ files to use new centralized client patterns - Updated all executor, monitoring, and notification services - Maintained backward compatibility while improving performance ## Files Changed - Created: `backend/util/clients.py` - Centralized client factory with thread caching - Modified: 21 files across blocks, executor, monitoring, and utility modules - Key areas: Service client initialization, resource cleanup, retry mechanisms ## Test Plan - [x] Verify all existing tests pass - [x] Validate service startup and client initialization - [x] Test resource cleanup on process termination - [x] Confirm retry mechanisms work correctly - [x] Validate thread caching performance improvements - [x] Ensure no breaking changes to existing functionality ## Breaking Changes None - all changes maintain backward compatibility. ## Additional Notes This refactoring centralizes client management patterns that were scattered across the codebase, making them more consistent and performant through thread caching. The removal of health checks reduces startup time while maintaining reliability through improved retry mechanisms. 🤖 Generated with [Claude Code](https://claude.ai/code)	2025-08-06 13:53:01 +07:00
Reinier van der Leer	fa2d968458	fix(builder): Defer graph validation to backend (#10556 ) - Resolves #10553 ### Changes 🏗️ - Remove frontend graph validation in `useAgentGraph:saveAndRun(..)` - Remove now unused `ajv` dependency - Implement graph validation error propagation (backend->frontend) - Add `GraphValidationError` type in frontend and backend - Add `GraphModel.validate_graph_get_errors(..)` method - Fix error handling & propagation in frontend API request logic ### Checklist 📋 #### For code changes: - [x] I have clearly listed my changes in the PR description - [x] I have made a test plan - [x] I have tested my changes according to the test plan: - [x] Saving & running a graph with missing required inputs gives a node-specific error - [x] Saving & running a graph with missing node credential inputs succeeds with passed-in credentials	2025-08-05 23:43:34 +00:00
Zamil Majdy	f9b255fb7a	feat(backend/executor): Avoid executor premature termination on inflight agent execution (#10552 ) There is no 100% accurate way of retrying an agent that has been terminated. And the safest way to avoid executing an agent wrong is minimizing the chance of an agent execution being terminated. A whole set of mechanism to make sure the agent is retried on failure is still in place and improved, this is used as our best-effort reliability mechanism. ### Changes 🏗️ * Cap SIGINT & SIGTERM to be raised at most once, so the executor can gracefully handle the stopping. * SIGINT & SIGTERM will stop the execution request message consumption, but not agent execution. * Executor process will only stop if all the in-flight agent executions are completed or terminated. * Avoid retrying the agent stop command on AgentExecutorBlock on timeout. ### Checklist 📋 #### For code changes: - [x] I have clearly listed my changes in the PR description - [x] I have made a test plan - [x] I have tested my changes according to the test plan: <!-- Put your test plan here: --> - [x] Run agent, send SIGTERM to the executor pod, execution should not be interrupted. - [x] Run agent, send SIGKILL to the executor pod, execution should be transferred to another pod.	2025-08-06 05:55:30 +07:00
Zamil Majdy	e5d3ebac08	feat(backend): Make Graph & Node Execution Stats Update Durable (#10529 ) Graph and Node execution can fail due to so many reasons, sometimes this messes up the stats tracking, giving an inaccurate result. The scope of this PR is to minimize such issues. ### Changes 🏗️ * Catch BaseException on time_measured decorator to catch asyncio.CancelledError * Make sure update node & graph stats are executed on cancellation & exception. * Protect graph execution stats update under the thread lock to avoid race condition. ### Checklist 📋 #### For code changes: - [x] I have clearly listed my changes in the PR description - [x] I have made a test plan - [x] I have tested my changes according to the test plan: <!-- Put your test plan here: --> - [x] Existing automated tests. --------- Co-authored-by: Claude <noreply@anthropic.com>	2025-08-04 21:33:52 +07:00
Zamil Majdy	69d873debc	fix(backend): improve executor reliability and error handling (#10526 ) This PR improves the reliability of the executor system by addressing several race conditions and improving error handling throughout the execution pipeline. ### Changes 🏗️ - Consolidated exception handling: Now using `BaseException` to properly catch all types of interruptions including `CancelledError` and `SystemExit` - Atomic stats updates: Moved node execution stats updates to be atomic with graph stats updates to prevent race conditions - Improved cleanup handling: Added proper timeout handling (3600s) for stuck executions during cleanup - Fixed concurrent update race conditions: Node execution updates are now properly synchronized with graph execution updates - Better error propagation: Improved error type preservation and status management throughout the execution chain - Graph resumption support: Added proper handling for resuming terminated and failed graph executions - Removed deprecated methods: Removed `update_node_execution_stats` in favor of atomic updates ### Checklist 📋 #### For code changes: - [x] I have clearly listed my changes in the PR description - [x] I have made a test plan - [x] I have tested my changes according to the test plan: - [x] Execute a graph with multiple nodes and verify stats are updated correctly - [x] Cancel a running graph execution and verify proper cleanup - [x] Simulate node failures and verify error propagation - [x] Test graph resumption after termination/failure - [x] Verify no race conditions in concurrent node execution updates #### For configuration changes: - [x] `.env.example` is updated or already compatible with my changes - [x] `docker-compose.yml` is updated or already compatible with my changes - [x] I have included a list of my configuration changes in the PR description (under Changes) 🤖 Generated with [Claude Code](https://claude.ai/code) --------- Co-authored-by: Claude <noreply@anthropic.com>	2025-08-02 17:41:59 +07:00
Zamil Majdy	4d05a27388	feat(backend): Avoid executor over-consuming messages when it's fully occupied (#10449 ) When we run multiple instances of the executor, some of the executors can oversubscribe the messages and end up queuing the agent execution request instead of letting another executor handle the job. This change solves the problem. ### Changes 🏗️ * Reject execution request when the executor is full. * Improve `active_graph_runs` tracking for better horizontal scaling heuristics. ### Checklist 📋 #### For code changes: - [x] I have clearly listed my changes in the PR description - [x] I have made a test plan - [x] I have tested my changes according to the test plan: <!-- Put your test plan here: --> - [x] Manual graph execution & CI	2025-07-28 23:08:27 +00:00
Zamil Majdy	f4a179e5d6	feat(backend): Add thread safety to NodeExecutionProgress output handling (#10415 ) ## Summary - Add thread safety to NodeExecutionProgress class to prevent race conditions between graph executor and node executor threads - Fixes potential data corruption and lost outputs during concurrent access to shared output lists - Uses single global lock per node for minimal performance impact - Instead of blocking the node evaluation before adding another node evaluation, we move on to the next node, in case another node completes it. ## Changes - Added `threading.Lock` to NodeExecutionProgress class - Protected `add_output()` calls from node executor thread with lock - Protected `pop_output()` calls from graph executor thread with lock - Protected `_pop_done_task()` output checks with lock ## Problem Solved The `NodeExecutionProgress.output` dictionary was being accessed concurrently: - `add_output()` called from node executor thread (asyncio thread) - `pop_output()` called from graph executor thread (main thread) - Python lists are not thread-safe for concurrent append/pop operations - This could cause data corruption, index errors, and lost outputs ## Test Plan - [x] Existing executor tests pass - [x] No performance regression (operations are microsecond-level) - [x] Thread safety verified through code analysis ## Technical Details - Single `threading.Lock()` per NodeExecutionProgress instance (~64 bytes) - Lock acquisition time (~100-200ns) is minimal compared to list operations - Maintains order guarantees for same node_execution_id processing - No GIL contention issues as operations are very brief 🤖 Generated with [Claude Code](https://claude.ai/code) --------- Co-authored-by: Claude <noreply@anthropic.com>	2025-07-22 01:11:46 +00:00
Zamil Majdy	b6d6b865de	feat(backend): avoid using DatabaseManager when direct query is possible from the API layer (#10403 ) This PR reduces the dependency of the API layer on the database manager service by avoiding using DatabaseManager for credentials fetch when a direct query is possible from the API layer ### Changes 🏗️ * If Prisma is available, use the direct query. * Otherwise, utilize the database manager. ### Checklist 📋 #### For code changes: - [x] I have clearly listed my changes in the PR description - [x] I have made a test plan - [x] I have tested my changes according to the test plan: <!-- Put your test plan here: --> - [x] Run blocks with credentials like AiTextGeneratorBlock	2025-07-18 23:18:52 +00:00
Zamil Majdy	08b05621c1	feat(block;backend): Truncate execution update payload on large data & Improve ReadSpreadsheetBlock performance (#10395 ) ### Changes 🏗️ This PR introduces several key improvements to message handling, block functionality, and execution reliability: - Renamed CSV block to Spreadsheet block with enhanced CSV/Excel processing capabilities - Added message size limiting and truncation for Redis communication to prevent connection issues - Optimized FileReadBlock to yield content chunks instead of duplicated outputs for better performance - Improved execution termination handling with better timeout management and event publishing - Enhanced continuous retry decorator with async function support - Implemented payload truncation to prevent Redis connection issues from oversized messages ### Checklist 📋 #### For code changes: - [x] I have clearly listed my changes in the PR description - [x] I have made a test plan - [x] I have tested my changes according to the test plan: - [x] Verified backend starts without errors - [x] Confirmed message truncation works for large payloads - [x] Tested spreadsheet block functionality with CSV and Excel files - [x] Validated execution termination improvements - [x] Checked FileReadBlock chunk processing #### For configuration changes: - [x] `.env.example` is updated or already compatible with my changes - [x] `docker-compose.yml` is updated or already compatible with my changes - [x] I have included a list of my configuration changes in the PR description (under Changes) 🤖 Generated with [Claude Code](https://claude.ai/code) --------- Co-authored-by: Claude <noreply@anthropic.com>	2025-07-17 16:04:33 +00:00
Zamil Majdy	4ffb99bfb0	feat(backend): Add block error rate monitoring and Discord alerts (#10332 ) ## Summary This PR adds a simple block error rate monitoring system that runs every 24 hours (configurable) and sends Discord alerts when blocks exceed the error rate threshold. ## Changes Made Modified Files: - `backend/executor/scheduler.py` - Added `report_block_error_rates` function and scheduled job - `backend/util/settings.py` - Added configuration options - `backend/.env.example` - Added environment variable examples - Refactor scheduled job logics in scheduler.py into seperate files ## Configuration ```bash # Block Error Rate Monitoring BLOCK_ERROR_RATE_THRESHOLD=0.5 # 50% error rate threshold BLOCK_ERROR_RATE_CHECK_INTERVAL_SECS=86400 # 24 hours ``` ## How It Works 1. Scheduled Job: Runs every 24 hours (configurable via `BLOCK_ERROR_RATE_CHECK_INTERVAL_SECS`) 2. Error Rate Calculation: Queries last 24 hours of node executions and calculates error rates per block 3. Threshold Check: Alerts on blocks with ≥50% error rate (configurable via `BLOCK_ERROR_RATE_THRESHOLD`) 4. Discord Alert: Sends alert to Discord using existing `discord_system_alert` function 5. Manual Execution: Available via `execute_report_block_error_rates()` scheduler client method ## Alert Format ``` Block Error Rate Alert: 🚨 Block 'DeprecatedGPT3Block' has 75.0% error rate (75/100) in the last 24 hours 🚨 Block 'BrokenImageBlock' has 60.0% error rate (30/50) in the last 24 hours ``` ## Testing Can be tested manually via: ```python from backend.executor.scheduler import SchedulerClient client = SchedulerClient() result = client.execute_report_block_error_rates() ``` ## Implementation Notes - Follows the same pattern as `report_late_executions` function - Only checks blocks with ≥10 executions to avoid noise - Uses existing Discord notification infrastructure - Configurable threshold and check interval - Proper error handling and logging ## Test plan - [x] Verify configuration loads correctly - [x] Test error rate calculation with existing database - [x] Confirm Discord integration works - [x] Test manual execution via scheduler client - [x] Verify scheduled job runs correctly 🤖 Generated with [Claude Code](https://claude.ai/code) --------- Co-authored-by: Claude AI <claude@anthropic.com> Co-authored-by: Claude <noreply@anthropic.com>	2025-07-10 21:56:58 +00:00
Zamil Majdy	f1cc2afbda	feat(backend): improve stop graph execution reliability (#10293 ) ## Summary - Enhanced graph execution cancellation and cleanup mechanisms - Improved error handling and logging for graph execution lifecycle - Added timeout handling for graph termination with proper status updates - Exposed a new API for stopping graph based on only graph_id or user_id - Refactored logging metadata structure for better error tracking ## Key Changes ### Backend - Graph Execution Management: Enhanced `stop_graph_execution` with timeout-based waiting and proper status transitions - Execution Cleanup: Added proper cancellation waiting with timeout handling in executor manager - Logging Improvements: Centralized `LogMetadata` class and improved error logging consistency - API Enhancements: Added bulk graph execution stopping functionality - Error Handling: Better exception handling and status management for failed/cancelled executions ### Frontend - Status Safety: Added null safety checks for status chips to prevent runtime errors - Execution Control: Simplified stop execution request handling ## Test Plan - [x] Verify graph execution can be properly stopped and reaches terminal state - [x] Test timeout scenarios for stuck executions - [x] Validate proper cleanup of running node executions when graph is cancelled - [x] Check frontend status chips handle undefined statuses gracefully - [x] Test bulk execution stopping functionality 🤖 Generated with [Claude Code](https://claude.ai/code) --------- Co-authored-by: Claude <noreply@anthropic.com>	2025-07-02 21:21:26 +00:00
Reinier van der Leer	efa4b6d2a0	feat(platform/library): Triggered-agent support (#10167 ) This pull request adds support for setting up (webhook-)triggered agents in the Library. It contains changes throughout the entire stack to make everything work in the various phases of a triggered agent's lifecycle: setup, execution, updates, deletion. Setting up agents with webhook triggers was previously only possible in the Builder, limiting their use to the agent's creator only. To make it work in the Library, this change uses the previously introduced `AgentPreset` to store information on, instead of on the graph's nodes to which only a graph's creator has access. - Initial ticket: #10111 - Builds on #9786 ![screenshot of trigger setup screen in the library](https://github.com/user-attachments/assets/525b4e94-d799-4328-b5fa-f05d6a3a206a) ![screenshot of trigger edit screen in the library](https://github.com/user-attachments/assets/e67eb0bc-df70-4a75-bf95-1c31263ef0c9) ### Changes 🏗️ Frontend: - Amend the Library's `AgentRunDraftView` to handle creating and editing Presets - Add `hideIfSingleCredentialAvailable` parameter to `CredentialsInput` - Add multi-select support to `TypeBasedInput` - Add Presets section to `AgentRunsSelectorList` - Amend `AgentRunSummaryCard` for use for Presets - Add `AgentStatusChip` to display general agent status (for now: Active / Inactive / Error) - Add Preset loading logic and create/update/delete handlers logic to `AgentRunsPage` - Rename `IconClose` to `IconCross` API: - Add `LibraryAgent` properties `has_external_trigger`, `trigger_setup_info`, `credentials_input_schema` - Add `POST /library/agents/{library_agent_id}/setup_trigger` endpoint - Remove redundant parameters from `POST /library/presets/{preset_id}/execute` endpoint Backend: - Add `POST /library/agents/{library_agent_id}/setup_trigger` endpoint - Extract non-node-related logic from `on_node_activate` into `setup_webhook_for_block` - Add webhook-related logic to `update_preset` and `delete_preset` endpoints - Amend webhook infrastructure to work with AgentPresets - Add preset trigger support to webhook ingress endpoint - Amend executor stack to work with passed-in node input (`nodes_input_masks`, generalized from `node_credentials_input_map`) - Amend graph validation to work with passed-in node input - Add `AgentPreset`->`IntegrationWebhook` relation - Add `WebhookWithRelations` model - Change behavior of `BaseWebhooksManager.get_manual_webhook(..)` to avoid unnecessary changes of the webhook URL: ignore `events` to find matching webhook, and update `events` if necessary. - Fix & improve `AgentPreset` API, models, and DB logic - Add `isDeleted` filter to get/list queries - Add `user_id` attribute to `LibraryAgentPreset` model - Add separate `credentials` property to `LibraryAgentPreset` model - Fix `library_db.update_preset(..)` replacement of existing `InputPresets` - Make `library_db.update_preset(..)` more usage-friendly with separate parameters for updateable properties - Add `user_id` checks to various DB functions - Fix error handling in various endpoints - Fix cache race condition on `load_webhook_managers()` ### Checklist 📋 #### For code changes: - [x] I have clearly listed my changes in the PR description - [x] I have made a test plan - [x] I have tested my changes according to the test plan: - Test existing functionality - [x] Auto-setup and -teardown of webhooks on save in the builder still works - [x] Running an agent normally from the Library still works - Test new functionality - [x] Setting up a trigger in the Library - [x] Updating a trigger in the Library - [x] Disabling and re-enabling a trigger in the Library - [x] Deleting a trigger in the Library - [x] Triggers set up in the Library result in a new run when the webhook receives a payload	2025-06-24 20:28:20 +00:00
Zamil Majdy	e701f41e66	feat(blocks): Enabling auto type conversion on block input schema mismatch for nested input (#10203 ) Since auto conversion is applied before merging nested input in the block, it breaks the auto conversion break. ### Changes 🏗️ * Enabling auto-type conversion on block input schema mismatch for nested input * Add batching feature for `CreateListBlock` * Increase default max_token size for LLM call ### Checklist 📋 #### For code changes: - [x] I have clearly listed my changes in the PR description - [x] I have made a test plan - [x] I have tested my changes according to the test plan: <!-- Put your test plan here: --> - [x] Run `AIStructuredResponseGeneratorBlock` with non-string prompt value (should be auto-converted).	2025-06-21 03:56:53 +07:00
Zamil Majdy	1e0a3d3c1b	feat(backend): Add request retry on block execution and RPC (#10183 ) Request on block execution can be throttled, and requests between services can sometimes break. The scope of this PR is to add an appropriate retry on those. ### Changes 🏗️ * Block request retry: Retry on throttled status code only (504, 429, etc). * RPC request retry: Retry connection issues (ConnectError, Timeout, etc). * Truncate logging on executor/utils. ### Checklist 📋 #### For code changes: - [x] I have clearly listed my changes in the PR description - [x] I have made a test plan - [x] I have tested my changes according to the test plan: <!-- Put your test plan here: --> - [x] Manual graph execution	2025-06-17 21:03:46 +00:00
Zamil Majdy	97e72cb485	feat(backend): Make execution engine async-first (#10138 ) This change introduced async execution for blocks and the execution engine. Paralellism will be achieved through a single process asynchronous execution instead of process concurrency. ### Changes 🏗️ * Support async execution for the graph executor * Removed process creation for node execution * Update all blocks to support async executions ### Checklist 📋 #### For code changes: - [x] I have clearly listed my changes in the PR description - [x] I have made a test plan - [x] I have tested my changes according to the test plan: <!-- Put your test plan here: --> - [x] Manual graph executions, tested many of the impacted blocks.	2025-06-17 09:38:24 +00:00
Zamil Majdy	c109b676b8	fix(block): Invalid block input error on falsy non-null value	2025-06-13 00:39:55 -07:00
Zamil Majdy	210d457ecd	feat(executor): Improve execution ordering to allow depth-first execution (#10142 ) Allowing depth-first execution will unlock faster processing latency and a better sense of progress. <img width="950" alt="image" src="https://github.com/user-attachments/assets/e2a0e11a-8bc5-4a65-a10d-b5b6c6383354" /> ### Changes 🏗️ * Prioritize adding a new execution over processing execution output * Make sure to enqueue each node once when processing output instead of draining a single node and move on. ### Checklist 📋 #### For code changes: - [x] I have clearly listed my changes in the PR description - [x] I have made a test plan - [x] I have tested my changes according to the test plan: <!-- Put your test plan here: --> - [x] Run company follower count finder agent. --------- Co-authored-by: Swifty <craigswift13@gmail.com>	2025-06-10 12:41:31 +00:00
Zamil Majdy	2647417e9f	feat(executor;frontend): Move output processing step from node executor to graph executor & simplify input beads calculation (#10066 ) Goal: Allow parallel runs within a single node. Currently, we prevent this to avoid unexpected ordering of the execution. ### Changes 🏗️ #### Executor changes We decoupled the node execution output processing, which is responsible for deciding the next executions from the node executor code. Currently, `execute_node` does two big things: * Runs the block’s execute(...) (which yields outputs). * immediately enqueues the next nodes based on those outputs. This PR makes: * execute_node(node_exec) -> stream of (output_name, data). That purely runs the block and yields each output as soon as it’s available. * Move _enqueue_next_nodes into the graph executor. So the next execution is handled serially by the graph executor to avoid concurrency issues. #### UI changes The change on the executor also fixes the behavior of the execution update to the UI We will report the execution output to the UI as soon as it is available, not when the node execution is fully completed. This, however, broke the bread calculation logic that assumes each execution update will never overlap. So the change in this PR makes the bead calculation take the overlap / duplicated execution update into account, and simplify the overall calculation logic. ### Checklist 📋 #### For code changes: - [x] I have clearly listed my changes in the PR description - [x] I have made a test plan - [x] I have tested my changes according to the test plan: <!-- Put your test plan here: --> - [x] Execute this agent and observe its concurrency ordering <img width="1424" alt="image" src="https://github.com/user-attachments/assets/0fe8259f-9091-4ecc-b824-ce8e8819c2d2" />	2025-06-05 16:10:50 +00:00
Zamil Majdy	4b70e778d2	feat(backend): Add nested dynamic pin-name support (#10082 ) Suppose we have pint with list[list[int]] type, and we want directly insert the a new value inside the first index of the first list e.g: list[0][0] = X through a dynamic pin, this will be translated into list_$_0_$_0, and the system does not currently support this. ### Changes 🏗️ Add support for nested dynamic pins for list, object, and dict. ### Checklist 📋 #### For code changes: - [x] I have clearly listed my changes in the PR description - [x] I have made a test plan - [x] I have tested my changes according to the test plan: <!-- Put your test plan here: --> - [x] lots of unit tests - [x] Tried inserting the value directly on the `value` nested field on Google Sheets Write block. <img width="371" alt="image" src="https://github.com/user-attachments/assets/0a5e7213-b0e0-4fce-9e89-b39f7a583582" />	2025-06-04 16:32:32 +00:00
Reinier van der Leer	0726a00fb7	fix(backend): Include sub-graphs in graph-level credentials support (#9862 ) The Library Agent credentials UX (#9789) currently doesn't work for sub-graphs. ### Changes 🏗️ - Include sub-graphs in generating `Graph.credentials_input_schema` - Propagate `node_credentials_input_map` into `AgentExecutionBlock` executions - Fix: also apply `node_credentials_input_map` in `_enqueue_next_nodes` ### Checklist 📋 #### For code changes: - [x] I have clearly listed my changes in the PR description - [x] I have made a test plan - [x] I have tested my changes according to the test plan: - Import a graph with sub-graphs that need credentials - Run this agent from the Library - [x] -> Should work	2025-05-07 17:28:39 +00:00
Zamil Majdy	475c5a5cc3	fix(backend): Avoid executing any agent with zero balance (#9901 ) ### Changes 🏗️ * Avoid executing any agent with a zero balance. * Make node execution count global across agents for a single user. ### Checklist 📋 #### For code changes: - [x] I have clearly listed my changes in the PR description - [x] I have made a test plan - [x] I have tested my changes according to the test plan: <!-- Put your test plan here: --> - [x] Run agents by tweaking the `execution_cost_count_threshold` & `execution_cost_per_threshold` values.	2025-05-01 15:11:38 +00:00
Zamil Majdy	86d5cfe60b	feat(backend): Support flexible RPC client (#9842 ) Using sync code in the async route often introduces a blocking event-loop code that impacts stability. The current RPC system only provides a synchronous client to call the service endpoints. The scope of this PR is to provide an entirely decoupled signature between client and server, allowing the client can mix & match async & sync options on the client code while not changing the async/sync nature of the server. ### Changes 🏗️ * Add support for flexible async/sync RPC client. * Migrate scheduler client to all-async client. ### Checklist 📋 #### For code changes: - [x] I have clearly listed my changes in the PR description - [x] I have made a test plan - [x] I have tested my changes according to the test plan: - [x] Scheduler route test. - [x] Modified service_test.py - [x] Run normal agent executions	2025-05-01 04:38:06 +00:00
Nicholas Tindle	04c4340ee3	feat(frontend,backend): user spending admin dashboard (#9751 ) <!-- Clearly explain the need for these changes: --> We need a way to refund people who spend money on agents wihout making manual db actions ### Changes 🏗️ - Adds a bunch for refunding users - Adds reasons and admin id for actions - Add admin to db manager - Add UI for this for the admin panel - Clean up pagination controls <!-- Concisely describe all of the changes made in this pull request: --> ### Checklist 📋 #### For code changes: - [x] I have clearly listed my changes in the PR description - [x] I have made a test plan - [x] I have tested my changes according to the test plan: <!-- Put your test plan here: --> - [x] Test by importing dev db as baseline - [x] Add transactions on top for "refund", and make sure all existing transactions work --------- Co-authored-by: Zamil Majdy <zamil.majdy@agpt.co>	2025-04-29 17:39:25 +00:00
Zamil Majdy	c783f64b33	fix(backend): Handle add execution API request failure (#9838 ) There are cases where the publishing agent execution is failing, making the agent execution appear to be stuck in a queue, but the execution has never been in a queue in the first place. ### Changes 🏗️ On publishing failure, we set the graph & starting node execution status to FAILED and let the UI bubble up the error so the user can try again. ### Checklist 📋 #### For code changes: - [x] I have clearly listed my changes in the PR description - [x] I have made a test plan - [x] I have tested my changes according to the test plan: - [x] Normal add execution flow	2025-04-18 18:35:43 +00:00
Reinier van der Leer	417d7732af	feat(platform/library): Add credentials UX on `/library/agents/[id]` (#9789 ) - Resolves #9771 - ... in a non-persistent way, so it won't work for webhook-triggered agents For webhooks: #9541 ### Changes 🏗️ Frontend: - Add credentials inputs in Library "New run" screen (based on `graph.credentials_input_schema`) - Refactor `CredentialsInput` and `useCredentials` to not rely on XYFlow context - Unsplit lists of saved credentials in `CredentialsProvider` state - Move logic that was being executed at component render to `useEffect` hooks in `CredentialsInput` Backend: - Implement logic to aggregate credentials input requirements to one per provider per graph - Add `BaseGraph.credentials_input_schema` (JSON schema) computed field Underlying added logic: - `BaseGraph._credentials_input_schema` - makes a `BlockSchema` from a graph's aggregated credentials inputs - `BaseGraph.aggregate_credentials_inputs()` - aggregates a graph's nodes' credentials inputs using `CredentialsFieldInfo.combine(..)` - `BlockSchema.get_credentials_fields_info() -> dict[str, CredentialsFieldInfo]` - `CredentialsFieldInfo` model (created from `_CredentialsFieldSchemaExtra`) - Implement logic to inject explicitly passed credentials into graph execution - Add `credentials_inputs` parameter to `execute_graph` endpoint - Add `graph_credentials_input` parameter to `.executor.utils.add_graph_execution(..)` - Implement `.executor.utils.make_node_credentials_input_map(..)` - Amend `.executor.utils.construct_node_execution_input` - Add `GraphExecutionEntry.node_credentials_input_map` attribute - Amend validation to allow injecting credentials - Amend `GraphModel._validate_graph(..)` - Amend `.executor.utils._validate_node_input_credentials` - Add `node_credentials_map` parameter to `ExecutionManager.add_execution(..)` - Amend execution validation to handle side-loaded credentials - Add `GraphExecutionEntry.node_execution_map` attribute - Add mechanism to inject passed credentials into node execution data - Add credentials injection mechanism to node execution queueing logic in `Executor._on_graph_execution(..)` - Replace boilerplate logic in `v1.execute_graph` endpoint with call to existing `.executor.utils.add_graph_execution(..)` - Replace calls to `.server.routers.v1.execute_graph` with `add_graph_execution` Also: - Address tech debt in `GraphModel._validate_gaph(..)` - Fix type checking in `BaseGraph._generate_schema(..)` #### TODO - [ ] ~~Make "Run again" work with credentials in `AgentRunDetailsView`~~ - [ ] Prohibit saving a graph if it has nodes with missing discriminator value for discriminated credentials inputs ### Checklist 📋 #### For code changes: - [ ] I have clearly listed my changes in the PR description - [ ] I have made a test plan - [ ] I have tested my changes according to the test plan: <!-- Put your test plan here: --> - [ ] ...	2025-04-18 14:27:13 +00:00
Zamil Majdy	bb92226f5d	feat(backend): Remove RPC service from Agent Executor (#9804 ) Currently the execution task is not properly distributed between executors because we need to send the execution request to the execution server. The execution manager now accepts the execution request from the message queue. Thus, we can remove the synchronous RPC system from this service, let the system focus on executing the agent, and not spare any process for the HTTP API interface. This will also reduce the risk of the execution service being too busy and not able to accept any add execution requests. ### Changes 🏗️ * Remove the RPC system in Agent Executor * Allow the cancellation of the execution that is still waiting in the queue (by avoiding it from being executed). * Make a unified helper for adding an execution request to the system and move other execution-related helper functions into `executor/utils.py`. * Remove non-db connections (redis / rabbitmq) in Database Manager and let the client manage this by themselves. ### Checklist 📋 #### For code changes: - [x] I have clearly listed my changes in the PR description - [x] I have made a test plan - [x] I have tested my changes according to the test plan: - [x] Existing CI, some agent runs	2025-04-11 19:03:47 +00:00
Zamil Majdy	26984a7338	feat(backend): Add capability to charge based on block execution count (#9661 ) Blocks that are not defined in the block cost are pretty much free. The lack of cost control makes it hard to control its quota. The scope of this change is providing a way to charge any executions based on the number of block being executed in real-time. ### Changes 🏗️ * Add execution charge logic based on the number of blocks executed, controlled by these two configurations: * `execution_cost_count_threshold`: We will charge the execution based on the multiple of this number. * `execution_cost_per_threshold`: The amount we are charging on its threshold multiple. * Make charging logic on the graph execution logic (as opposed to node level) so it's being done serially and insufficient fund error is guaranteed to stop the graph execution. * Moved cost calculation logic into backend/executor/util.py ### Checklist 📋 #### For code changes: - [x] I have clearly listed my changes in the PR description - [x] I have made a test plan - [x] I have tested my changes according to the test plan: - [x] Execute graph with configured threshold & cost and test the balance being deducted on that. - [x] Existing cost calculation is still being done without any issue. - [x] Low balance stop the whole graph execution.	2025-03-24 07:26:33 +00:00

49 Commits