Commit Graph

49 Commits

Author SHA1 Message Date
Nicholas Tindle
7668c17d9c feat(platform): add User Workspace for persistent CoPilot file storage (#11867)
Implements persistent User Workspace storage for CoPilot, enabling
blocks to save and retrieve files across sessions. Files are stored in
session-scoped virtual paths (`/sessions/{session_id}/`).

Fixes SECRT-1833

### Changes 🏗️

**Database & Storage:**
- Add `UserWorkspace` and `UserWorkspaceFile` Prisma models
- Implement `WorkspaceStorageBackend` abstraction (GCS for cloud, local
filesystem for self-hosted)
- Add `workspace_id` and `session_id` fields to `ExecutionContext`

**Backend API:**
- Add REST endpoints: `GET/POST /api/workspace/files`, `GET/DELETE
/api/workspace/files/{id}`, `GET /api/workspace/files/{id}/download`
- Add CoPilot tools: `list_workspace_files`, `read_workspace_file`,
`write_workspace_file`
- Integrate workspace storage into `store_media_file()` - returns
`workspace://file-id` references

**Block Updates:**
- Refactor all file-handling blocks to use unified `ExecutionContext`
parameter
- Update media-generating blocks to persist outputs to workspace
(AIImageGenerator, AIImageCustomizer, FluxKontext, TalkingHead, FAL
video, Bannerbear, etc.)

**Frontend:**
- Render `workspace://` image references in chat via proxy endpoint
- Add "AI cannot see this image" overlay indicator

**CoPilot Context Mapping:**
- Session = Agent (graph_id) = Run (graph_exec_id)
- Files scoped to `/sessions/{session_id}/`

### Checklist 📋

#### For code changes:
- [x] I have clearly listed my changes in the PR description
- [x] I have made a test plan
- [ ] I have tested my changes according to the test plan:
- [ ] Create CoPilot session, generate image with AIImageGeneratorBlock
  - [ ] Verify image returns `workspace://file-id` (not base64)
  - [ ] Verify image renders in chat with visibility indicator
  - [ ] Verify workspace files persist across sessions
  - [ ] Test list/read/write workspace files via CoPilot tools
  - [ ] Test local storage backend for self-hosted deployments

#### For configuration changes:
- [x] `.env.default` is updated or already compatible with my changes
- [x] `docker-compose.yml` is updated or already compatible with my
changes
- [x] I have included a list of my configuration changes in the PR
description (under **Changes**)

🤖 Generated with [Claude Code](https://claude.ai/code)

<!-- CURSOR_SUMMARY -->
---

> [!NOTE]
> **Medium Risk**
> Introduces a new persistent file-storage surface area (DB tables,
storage backends, download API, and chat tools) and rewires
`store_media_file()`/block execution context across many blocks, so
regressions could impact file handling, access control, or storage
costs.
> 
> **Overview**
> Adds a **persistent per-user Workspace** (new
`UserWorkspace`/`UserWorkspaceFile` models plus `WorkspaceManager` +
`WorkspaceStorageBackend` with GCS/local implementations) and wires it
into the API via a new `/api/workspace/files/{file_id}/download` route
(including header-sanitized `Content-Disposition`) and shutdown
lifecycle hooks.
> 
> Extends `ExecutionContext` to carry execution identity +
`workspace_id`/`session_id`, updates executor tooling to clone
node-specific contexts, and updates `run_block` (CoPilot) to create a
session-scoped workspace and synthetic graph/run/node IDs.
> 
> Refactors `store_media_file()` to require `execution_context` +
`return_format` and to support `workspace://` references; migrates many
media/file-handling blocks and related tests to the new API and to
persist generated media as `workspace://...` (or fall back to data URIs
outside CoPilot), and adds CoPilot chat tools for
listing/reading/writing/deleting workspace files with safeguards against
context bloat.
> 
> <sup>Written by [Cursor
Bugbot](https://cursor.com/dashboard?tab=bugbot) for commit
6abc70f793. This will update automatically
on new commits. Configure
[here](https://cursor.com/dashboard?tab=bugbot).</sup>
<!-- /CURSOR_SUMMARY -->

---------

Co-authored-by: Claude Opus 4.5 <noreply@anthropic.com>
Co-authored-by: Reinier van der Leer <pwuts@agpt.co>
2026-01-29 05:49:47 +00:00
Zamil Majdy
fb58827c61 feat(backend;frontend): Implement node-specific auto-approval, safety popup, and race condition fixes (#11810)
## Summary

This PR implements comprehensive improvements to the human-in-the-loop
(HITL) review system, including safety features, architectural changes,
and bug fixes:

### Key Features
- **SECRT-1798: One-time safety popup** - Shows informational popup
before first run of AI-generated agents with sensitive actions/HITL
blocks
- **SECRT-1795: Auto-approval toggle UX** - Toggle in pending reviews
panel to auto-approve future actions from the same node
- **Node-specific auto-approval** - Changed from execution-specific to
node-specific using special key pattern
`auto_approve_{graph_exec_id}_{node_id}`
- **Consolidated approval checking** - Merged `check_auto_approval` into
`check_approval` using single OR query for better performance
- **Race condition prevention** - Added execution status check before
resuming to prevent duplicate execution when approving while graph is
running
- **Parallel auto-approval creation** - Uses `asyncio.gather` for better
performance when creating multiple auto-approval records

## Changes

### Backend Architecture
- **`human_review.py`**: 
- Added `check_approval()` function that checks both normal and
auto-approval in single query
- Added `create_auto_approval_record()` for node-specific auto-approval
using special key pattern
- Added `get_auto_approve_key()` helper to generate consistent
auto-approval keys
- **`review/routes.py`**: 
- Added execution status check before resuming to prevent race
conditions
- Refactored auto-approval record creation to use parallel execution
with `asyncio.gather`
  - Removed obvious comments for cleaner code
- **`review/model.py`**: Added `auto_approve_future_actions` field to
`ReviewRequest`
- **`blocks/helpers/review.py`**: Updated to use consolidated
`check_approval` via database manager client
- **`executor/database.py`**: Exposed `check_approval` through
DatabaseManager RPC for block execution context
- **`data/block.py`**: Fixed safe mode checks for sensitive action
blocks

### Frontend
- **New `AIAgentSafetyPopup`** component with localStorage-based
one-time display
- **`PendingReviewsList`**: 
  - Replaced "Approve all future actions" button with toggle
- Toggle resets data to original values and disables editing when
enabled
  - Shows warning message explaining auto-approval behavior
- **`RunAgentModal`**: Integrated safety popup before first run
- **`usePendingReviews`**: Added polling for real-time badge updates
- **`FloatingSafeModeToggle` & `SafeModeToggle`**: Simplified visibility
logic
- **`local-storage.ts`**: Added localStorage key for popup state
tracking

### Bug Fixes
- Fixed "Client is not connected to query engine" error by using
database manager client pattern
- Fixed race condition where approving reviews while graph is RUNNING
could queue execution twice
- Fixed migration to only drop FK constraint, not non-existent column
- Fixed card data reset when auto-approve toggle changes

### Code Quality
- Removed duplicate/obvious comments
- Moved imports to top-level instead of local scope in tests
- Used walrus operator for cleaner conditional assignments
- Parallel execution for auto-approval record creation

## Test plan
- [ ] Create an AI-generated agent with sensitive actions (e.g., email
sending)
- [ ] First run should show the safety popup before starting
- [ ] Subsequent runs should not show the popup
- [ ] Clear localStorage (`AI_AGENT_SAFETY_POPUP_SHOWN`) to verify popup
shows again
- [ ] Create an agent with human-in-the-loop blocks
- [ ] Run it and verify the pending reviews panel appears
- [ ] Enable the "Auto-approve all future actions" toggle
- [ ] Verify editing is disabled and shows warning message
- [ ] Click "Approve" and verify subsequent blocks from same node
auto-approve
- [ ] Verify auto-approval persists across multiple executions of same
graph
- [ ] Disable toggle and verify editing works normally
- [ ] Verify "Reject" button still works regardless of toggle state
- [ ] Test race condition: Approve reviews while graph is RUNNING
(should skip resume)
- [ ] Test race condition: Approve reviews while graph is REVIEW (should
resume)
- [ ] Verify pending reviews badge updates in real-time when new reviews
are created
2026-01-25 04:05:25 +07:00
Zamil Majdy
8b25e62959 feat(backend,frontend): add explicit safe mode toggles for HITL and sensitive actions (#11756)
## Summary

This PR introduces two explicit safe mode toggles for controlling agent
execution behavior, providing clearer and more granular control over
when agents should pause for human review.

### Key Changes

**New Safe Mode Settings:**
- **`human_in_the_loop_safe_mode`** (bool, default `true`) - Controls
whether human-in-the-loop (HITL) blocks pause for review
- **`sensitive_action_safe_mode`** (bool, default `false`) - Controls
whether sensitive action blocks pause for review

**New Computed Properties on LibraryAgent:**
- `has_human_in_the_loop` - Indicates if agent contains HITL blocks
- `has_sensitive_action` - Indicates if agent contains sensitive action
blocks

**Block Changes:**
- Renamed `requires_human_review` to `is_sensitive_action` on blocks for
clarity
- Blocks marked as `is_sensitive_action=True` pause only when
`sensitive_action_safe_mode=True`
- HITL blocks pause when `human_in_the_loop_safe_mode=True`

**Frontend Changes:**
- Two separate toggles in Agent Settings based on block types present
- Toggle visibility based on `has_human_in_the_loop` and
`has_sensitive_action` computed properties
- Settings cog hidden if neither toggle applies
- Proper state management for both toggles with defaults

**AI-Generated Agent Behavior:**
- AI-generated agents set `sensitive_action_safe_mode=True` by default
- This ensures sensitive actions are reviewed for AI-generated content

## Changes

**Backend:**
- `backend/data/graph.py` - Updated `GraphSettings` with two boolean
toggles (non-optional with defaults), added `has_sensitive_action`
computed property
- `backend/data/block.py` - Renamed `requires_human_review` to
`is_sensitive_action`, updated review logic
- `backend/data/execution.py` - Updated `ExecutionContext` with both
safe mode fields
- `backend/api/features/library/model.py` - Added
`has_human_in_the_loop` and `has_sensitive_action` to `LibraryAgent`
- `backend/api/features/library/db.py` - Updated to use
`sensitive_action_safe_mode` parameter
- `backend/executor/utils.py` - Simplified execution context creation

**Frontend:**
- `useAgentSafeMode.ts` - Rewritten to support two independent toggles
- `AgentSettingsModal.tsx` - Shows two separate toggles
- `SelectedSettingsView.tsx` - Shows two separate toggles
- Regenerated API types with new schema

## Test Plan

- [x] All backend tests pass (Python 3.11, 3.12, 3.13)
- [x] All frontend tests pass
- [x] Backend format and lint pass
- [x] Frontend format and lint pass
- [x] Pre-commit hooks pass

---------

Co-authored-by: Nicholas Tindle <nicholas.tindle@agpt.co>
2026-01-21 00:56:02 +00:00
Reinier van der Leer
b01ea3fcbd fix(backend/executor): Centralize increment_runs calls & make add_graph_execution more robust (#11764)
[OPEN-2946: \[Scheduler\] Error executing graph <graph_id> after 19.83s:
ClientNotConnectedError: Client is not connected to the query engine,
you must call `connect()` before attempting to query
data.](https://linear.app/autogpt/issue/OPEN-2946)

- Follow-up to #11375
  <sub>(broken `increment_runs` call)</sub>
- Follow-up to #11380
  <sub>(direct `get_graph_execution` call)</sub>

### Changes 🏗️

- Move `increment_runs` call from `scheduler._execute_graph` to
`executor.utils.add_graph_execution` so it can be made through
`DatabaseManager`
  - Add `increment_onboarding_runs` to `DatabaseManager`
- Remove now-redundant `increment_onboarding_runs` calls in other places
- Make `add_graph_execution` more resilient
  - Split up large try/except block
  - Fix direct `get_graph_execution` call

### Checklist 📋

#### For code changes:
- [x] I have clearly listed my changes in the PR description
- [x] I have made a test plan
- [x] I have tested my changes according to the test plan:
  - CI + a thorough review
2026-01-15 04:08:19 +00:00
Nicholas Tindle
47a3a5ef41 feat(backend,frontend): optional credentials flag for blocks at agent level (#11716)
This feature allows agent makers to mark credential fields as optional.
When credentials are not configured for an optional block, the block
will be skipped during execution rather than causing a validation error.

**Use case:** An agent with multiple notification channels (Discord,
Twilio, Slack) where the user only needs to configure one - unconfigured
channels are simply skipped.

### Changes 🏗️

#### Backend

**Data Model Changes:**
- `backend/data/graph.py`: Added `credentials_optional` property to
`Node` model that reads from node metadata
- `backend/data/execution.py`: Added `nodes_to_skip` field to
`GraphExecutionEntry` model to track nodes that should be skipped

**Validation Changes:**
- `backend/executor/utils.py`:
- Updated `_validate_node_input_credentials()` to return a tuple of
`(credential_errors, nodes_to_skip)`
- Nodes with `credentials_optional=True` and missing credentials are
added to `nodes_to_skip` instead of raising validation errors
- Updated `validate_graph_with_credentials()` to propagate
`nodes_to_skip` set
- Updated `validate_and_construct_node_execution_input()` to return
`nodes_to_skip`
- Updated `add_graph_execution()` to pass `nodes_to_skip` to execution
entry

**Execution Changes:**
- `backend/executor/manager.py`:
  - Added skip logic in `_on_graph_execution()` dispatch loop
- When a node is in `nodes_to_skip`, it is marked as `COMPLETED` without
execution
  - No outputs are produced, so downstream nodes won't trigger

#### Frontend

**Node Store:**
- `frontend/src/app/(platform)/build/stores/nodeStore.ts`:
- Added `credentials_optional` to node metadata serialization in
`convertCustomNodeToBackendNode()`
- Added `getCredentialsOptional()` and `setCredentialsOptional()` helper
methods

**Credential Field Component:**
-
`frontend/src/components/renderers/input-renderer/fields/CredentialField/CredentialField.tsx`:
  - Added "Optional - skip block if not configured" switch toggle
  - Switch controls the `credentials_optional` metadata flag
  - Placeholder text updates based on optional state

**Credential Field Hook:**
-
`frontend/src/components/renderers/input-renderer/fields/CredentialField/useCredentialField.ts`:
  - Added `disableAutoSelect` parameter
- When credentials are optional, auto-selection of credentials is
disabled

**Feature Flags:**
- `frontend/src/services/feature-flags/use-get-flag.ts`: Minor refactor
(condition ordering)

### Checklist 📋

#### For code changes:
- [x] I have clearly listed my changes in the PR description
- [x] I have made a test plan
- [x] I have tested my changes according to the test plan:
- [x] Build an agent using smart decision maker and down stream blocks
to test this

<!-- CURSOR_SUMMARY -->
---

> [!NOTE]
> Introduces optional credentials across graph execution and UI,
allowing nodes to be skipped (no outputs, no downstream triggers) when
their credentials are not configured.
> 
> - Backend
> - Adds `Node.credentials_optional` (from node `metadata`) and computes
required credential fields in `Graph.credentials_input_schema` based on
usage.
> - Validates credentials with `_validate_node_input_credentials` →
returns `(errors, nodes_to_skip)`; plumbs `nodes_to_skip` through
`validate_graph_with_credentials`,
`_construct_starting_node_execution_input`,
`validate_and_construct_node_execution_input`, and `add_graph_execution`
into `GraphExecutionEntry`.
> - Executor: dispatch loop skips nodes in `nodes_to_skip` (marks
`COMPLETED`); `execute_node`/`on_node_execution` accept `nodes_to_skip`;
`SmartDecisionMakerBlock.run` filters tool functions whose
`_sink_node_id` is in `nodes_to_skip` and errors only if all tools are
filtered.
> - Models: `GraphExecutionEntry` gains `nodes_to_skip` field. Tests and
snapshots updated accordingly.
> 
> - Frontend
> - Builder: credential field uses `custom/credential_field` with an
"Optional – skip block if not configured" toggle; `nodeStore` persists
`credentials_optional` and history; UI hides optional toggle in run
dialogs.
> - Run dialogs: compute required credentials from
`credentials_input_schema.required`; allow selecting "None"; avoid
auto-select for optional; filter out incomplete creds before execute.
>   - Minor schema/UI wiring updates (`uiSchema`, form context flags).
> 
> <sup>Written by [Cursor
Bugbot](https://cursor.com/dashboard?tab=bugbot) for commit
5e01fd6a3e. This will update automatically
on new commits. Configure
[here](https://cursor.com/dashboard?tab=bugbot).</sup>
<!-- /CURSOR_SUMMARY -->

---------

Co-authored-by: Claude <noreply@anthropic.com>
Co-authored-by: claude[bot] <41898282+claude[bot]@users.noreply.github.com>
Co-authored-by: Nicholas Tindle <ntindle@users.noreply.github.com>
2026-01-09 14:11:35 +00:00
Zamil Majdy
7b951c977e feat(platform): implement graph-level Safe Mode toggle for HITL blocks (#11455)
## Summary

This PR implements a graph-level Safe Mode toggle system for
Human-in-the-Loop (HITL) blocks. When Safe Mode is ON (default), HITL
blocks require manual review before proceeding. When OFF, they execute
automatically.

## 🔧 Backend Changes

- **Database**: Added `metadata` JSON column to `AgentGraph` table with
migration
- **API**: Updated `execute_graph` endpoint to accept `safe_mode`
parameter
- **Execution**: Enhanced execution context to use graph metadata as
default with API override capability
- **Auto-detection**: Automatically populate `has_human_in_the_loop` for
graphs containing HITL blocks
- **Block Detection**: HITL block ID:
`8b2a7b3c-6e9d-4a5f-8c1b-2e3f4a5b6c7d`

## 🎨 Frontend Changes

- **Component**: New `FloatingSafeModeToggle` with dual variants:
  - **White variant**: For library pages, integrates with action buttons
  - **Black variant**: For builders, floating positioned  
- **Integration**: Added toggles to both new/legacy builders and library
pages
- **API Integration**: Direct graph metadata updates via
`usePutV1UpdateGraphVersion`
- **Query Management**: React Query cache invalidation for consistent UI
updates
- **Conditional Display**: Toggle only appears when graph contains HITL
blocks

## 🛠 Technical Implementation

- **Safe Mode ON** (default): HITL blocks require manual review before
proceeding
- **Safe Mode OFF**: HITL blocks execute automatically without
intervention
- **Priority**: Backend API `safe_mode` parameter takes precedence over
graph metadata
- **Detection**: Auto-populates `has_human_in_the_loop` metadata field
- **Positioning**: Proper z-index and responsive positioning for
floating elements

## 🚧 Known Issues (Work in Progress)

### High Priority
- [ ] **Toggle state persistence**: Always shows "ON" regardless of
actual state - query invalidation issue
- [ ] **LibraryAgent metadata**: Missing metadata field causing
TypeScript errors
- [ ] **Tooltip z-index**: Still covered by some UI elements despite
high z-index

### Medium Priority  
- [ ] **HITL detection**: Logic needs improvement for reliable block
detection
- [ ] **Error handling**: Removing HITL blocks from graph causes save
errors
- [ ] **TypeScript**: Fix type mismatches between GraphModel and
LibraryAgent

### Low Priority
- [ ] **Frontend API**: Add `safe_mode` parameter to execution calls
once OpenAPI is regenerated
- [ ] **Performance**: Consider debouncing rapid toggle clicks

## 🧪 Test Plan

- [ ] Verify toggle appears only when graph has HITL blocks
- [ ] Test toggle persistence across page refreshes  
- [ ] Confirm API calls update graph metadata correctly
- [ ] Validate execution behavior respects safe mode setting
- [ ] Check styling consistency across builder and library contexts

## 🔗 Related

- Addresses requirements for graph-level HITL configuration
- Builds on existing FloatingReviewsPanel infrastructure
- Integrates with existing graph metadata system

🤖 Generated with [Claude Code](https://claude.ai/code)
2025-12-02 09:55:55 +00:00
Zamil Majdy
3d08c22dd5 feat(platform): add Human In The Loop block with review workflow (#11380)
## Summary
This PR implements a comprehensive Human In The Loop (HITL) block that
allows agents to pause execution and wait for human
approval/modification of data before continuing.



https://github.com/user-attachments/assets/c027d731-17d3-494c-85ca-97c3bf33329c


## Key Features
- Added WAITING_FOR_REVIEW status to AgentExecutionStatus enum
- Created PendingHumanReview database table for storing review requests
- Implemented HumanInTheLoopBlock that extracts input data and creates
review entries
- Added API endpoints at /api/executions/review for fetching and
reviewing pending data
- Updated execution manager to properly handle waiting status and resume
after approval

## Frontend Components
- PendingReviewCard for individual review handling
- PendingReviewsList for multiple reviews
- FloatingReviewsPanel for graph builder integration
- Integrated review UI into 3 locations: legacy library, new library,
and graph builder

## Technical Implementation
- Added proper type safety throughout with SafeJson handling
- Optimized database queries using count functions instead of full data
fetching
- Fixed imports to be top-level instead of local
- All formatters and linters pass

## Test plan
- [ ] Test Human In The Loop block creation in graph builder
- [ ] Test block execution pauses and creates pending review
- [ ] Test review UI appears in all 3 locations
- [ ] Test data modification and approval workflow
- [ ] Test rejection workflow
- [ ] Test execution resumes after approval

🤖 Generated with [Claude Code](https://claude.ai/code)

<!-- This is an auto-generated comment: release notes by coderabbit.ai
-->
## Summary by CodeRabbit

* **New Features**
* Added Human-In-The-Loop review workflows to pause executions for human
validation.
* Users can approve or reject pending tasks, optionally editing
submitted data and adding a message.
* New "Waiting for Review" execution status with UI indicators across
run lists, badges, and activity views.
* Review management UI: pending review cards, list view, and a floating
reviews panel for quick access.
<!-- end of auto-generated comment: release notes by coderabbit.ai -->

---------

Co-authored-by: Claude <noreply@anthropic.com>
2025-11-27 12:07:46 +07:00
Swifty
a66219fc1f fix(platform): Remove un-runnable agents from schedule (#11374)
Currently when an agent fails validation during a scheduled run, we
raise an error then try again, regardless of why.

This change removed the agent schedule and notifies the user 

### Changes 🏗️

- add schedule_id to the GraphExecutionJobArgs
- add agent_name to the GraphExecutionJobArgs
- Delete schedule on GraphValidationError
- Notify the user with a message that include the agent name

### Checklist 📋

#### For code changes:
- [x] I have clearly listed my changes in the PR description
- [x] I have made a test plan
- [x] I have tested my changes according to the test plan:
  <!-- Put your test plan here: -->
  - [x] I have ensured the scheduler tests work with these changes
2025-11-17 15:24:40 +00:00
Reinier van der Leer
d68dceb9c1 fix(backend/executor): Improve graph execution permission check (#11323)
- Resolves #11316
- Durable fix to replace #11318

### Changes 🏗️

- Expand graph execution permissions check
  - Don't require library membership for execution as sub-graph

### Checklist 📋

#### For code changes:
- [x] I have clearly listed my changes in the PR description
- [x] I have made a test plan
- [x] I have tested my changes according to the test plan:
  - [x] Can run sub-agent with non-latest graph version
- [x] Can run sub-agent that is available in Marketplace but not added
to Library
2025-11-05 17:13:41 +00:00
Zamil Majdy
5506d59da1 fix(backend/executor): make graph execution permission check version-agnostic (#11283)
## Summary
Fix critical issue where pre-execution permission validation broke
execution of graphs that reference older versions of sub-graphs.

## Problem
The `validate_graph_execution_permissions` function was checking for the
specific version of a graph in the user's library. This caused failures
when:
1. A parent graph references an older version of a sub-graph  
2. The user updates the sub-graph to a newer version
3. The older version is no longer in their library
4. Execution of the parent graph fails with `GraphNotInLibraryError`

## Root Cause
In `backend/executor/utils.py` line 523, the function was checking for
the exact version, but sub-graphs legitimately reference older versions
that may no longer be in the library.

## Solution

### 1. Remove Version-Specific Check (backend/executor/utils.py)
- Remove `graph_version=graph.version` parameter from validation call
- Add explanatory comment about version-agnostic behavior
- Now only checks that the graph ID exists in user's library (any
version)

### 2. Enhance Documentation (backend/data/graph.py)  
- Update function docstring to explain version-agnostic behavior
- Document that `None` (now default) allows execution of any version
- Clarify this is important for sub-graph version compatibility

## Technical Details
The `validate_graph_execution_permissions` function was already designed
to handle version-agnostic checks when `graph_version=None`. By omitting
the version parameter, we skip the version check and only verify:
- Graph exists in user's library  
- Graph is not deleted/archived
- User has execution permissions

## Impact
-  Parent graphs can execute even when they reference older sub-graph
versions
-  Sub-graph updates don't break existing parent graphs  
-  Maintains security: still checks library membership and permissions
-  No breaking changes: version-specific validation still available
when needed

## Example Scenario Fixed
1. User creates parent graph that uses sub-graph v1
2. User updates sub-graph to v2 (v1 removed from library)  
3. Parent graph still references sub-graph v1
4. **Before**: Execution fails with `GraphNotInLibraryError`
5. **After**: Execution succeeds (version-agnostic permission check)

## Testing
- [x] Code formatting and linting passes
- [x] Type checking passes
- [x] No breaking changes to existing functionality
- [x] Security still maintained through library membership checks

## Files Changed
- `backend/executor/utils.py`: Remove version-specific permission check
- `backend/data/graph.py`: Enhanced documentation for version-agnostic
behavior

Closes #[issue-number-if-applicable]

Co-authored-by: Claude <noreply@anthropic.com>
2025-10-29 14:13:23 +00:00
Zamil Majdy
4922f88851 feat(backend/executor): Implement cascading stop for nested graph executions (#11277)
## Summary
Fixes critical issue where child executions spawned by
`AgentExecutorBlock` continue running after parent execution is stopped.
Implements parent-child execution tracking and recursive cascading stop
logic to ensure entire execution trees are terminated together.

## Background
When a parent graph execution containing `AgentExecutorBlock` nodes is
stopped, only the parent was terminated. Child executions continued
running, leading to:
-  Orphaned child executions consuming credits
-  No user control over execution trees  
-  Race conditions where children start after parent stops
-  Resource leaks from abandoned executions

## Core Changes

### 1. Database Schema (`schema.prisma` + migration)
```sql
-- Add nullable parent tracking field
ALTER TABLE "AgentGraphExecution" ADD COLUMN "parentGraphExecutionId" TEXT;

-- Add self-referential foreign key with graceful deletion
ALTER TABLE "AgentGraphExecution" ADD CONSTRAINT "AgentGraphExecution_parentGraphExecutionId_fkey" 
  FOREIGN KEY ("parentGraphExecutionId") REFERENCES "AgentGraphExecution"("id") 
  ON DELETE SET NULL ON UPDATE CASCADE;

-- Add index for efficient child queries
CREATE INDEX "AgentGraphExecution_parentGraphExecutionId_idx" 
  ON "AgentGraphExecution"("parentGraphExecutionId");
```

### 2. Parent ID Propagation (`backend/blocks/agent.py`)
```python
# Extract current graph execution ID and pass as parent to child
execution = add_graph_execution(
    # ... other params
    parent_graph_exec_id=graph_exec_id,  # NEW: Track parent relationship
)
```

### 3. Data Layer (`backend/data/execution.py`)
```python
async def get_child_graph_executions(parent_exec_id: str) -> list[GraphExecution]:
    """Get all child executions of a parent execution."""
    children = await AgentGraphExecution.prisma().find_many(
        where={"parentGraphExecutionId": parent_exec_id, "isDeleted": False}
    )
    return [GraphExecution.from_db(child) for child in children]
```

### 4. Cascading Stop Logic (`backend/executor/utils.py`)
```python
async def stop_graph_execution(
    user_id: str,
    graph_exec_id: str,
    wait_timeout: float = 15.0,
    cascade: bool = True,  # NEW parameter
):
    # 1. Find all child executions
    if cascade:
        children = await _get_child_executions(graph_exec_id)
        
        # 2. Stop all children recursively in parallel
        if children:
            await asyncio.gather(
                *[stop_graph_execution(user_id, child.id, wait_timeout, True) 
                  for child in children],
                return_exceptions=True,  # Don't fail parent if child fails
            )
    
    # 3. Stop the parent execution
    # ... existing stop logic
```

### 5. Race Condition Prevention (`backend/executor/manager.py`)
```python
# Before executing queued child, check if parent was terminated
if parent_graph_exec_id:
    parent_exec = get_db_client().get_graph_execution_meta(parent_graph_exec_id, user_id)
    if parent_exec and parent_exec.status == ExecutionStatus.TERMINATED:
        # Skip execution, mark child as terminated
        get_db_client().update_graph_execution_stats(
            graph_exec_id=graph_exec_id,
            status=ExecutionStatus.TERMINATED,
        )
        return  # Don't start orphaned child
```

## How It Works

### Before (Broken)
```
User stops parent execution
    ↓
Parent terminates ✓
    ↓
Child executions keep running ✗
    ↓
User cannot stop children ✗
```

### After (Fixed)
```
User stops parent execution
    ↓
Query database for all children
    ↓
Recursively stop all children in parallel
    ↓
Wait for children to terminate
    ↓
Stop parent execution
    ↓
All executions in tree stopped ✓
```

### Race Prevention
```
Child in QUEUED status
    ↓
Parent stopped
    ↓
Child picked up by executor
    ↓
Pre-flight check: parent TERMINATED?
    ↓
Yes → Skip execution, mark child TERMINATED
    ↓
Child never runs ✓
```

## Edge Cases Handled
 **Deep nesting** - Recursive cascading handles multi-level trees  
 **Queued children** - Pre-flight check prevents execution  
 **Race conditions** - Child spawned during stop operation  
 **Partial failures** - `return_exceptions=True` continues on error  
 **Multiple children** - Parallel stop via `asyncio.gather()`  
 **No parent** - Backward compatible (nullable field)  
 **Already completed** - Existing status check handles it  

## Performance Impact
- **Stop operation**: O(depth) with parallel execution vs O(1) before
- **Memory**: +36 bytes per execution (one UUID reference)
- **Database**: +1 query per tree level, indexed for efficiency

## API Changes (Backward Compatible)

### `stop_graph_execution()` - New Optional Parameter
```python
# Before
async def stop_graph_execution(user_id: str, graph_exec_id: str, wait_timeout: float = 15.0)

# After  
async def stop_graph_execution(user_id: str, graph_exec_id: str, wait_timeout: float = 15.0, cascade: bool = True)
```
**Default `cascade=True`** means existing callers get the new behavior
automatically.

### `add_graph_execution()` - New Optional Parameter
```python
async def add_graph_execution(..., parent_graph_exec_id: Optional[str] = None)
```

## Security & Safety
-  **User verification** - Users can only stop their own executions
(parent + children)
-  **No cycles** - Self-referential FK prevents infinite loops  
-  **Graceful degradation** - Errors in child stops don't block parent
stop
-  **Rate limits** - Existing execution rate limits still apply

## Testing Checklist

### Database Migration
- [x] Migration runs successfully  
- [x] Prisma client regenerates without errors
- [x] Existing tests pass

### Core Functionality  
- [ ] Manual test: Stop parent with running child → child stops
- [ ] Manual test: Stop parent with queued child → child never starts
- [ ] Unit test: Cascading stop with multiple children
- [ ] Unit test: Deep nesting (3+ levels)
- [ ] Integration test: Race condition prevention

## Breaking Changes
**None** - All changes are backward compatible with existing code.

## Rollback Plan
If issues arise:
1. **Code rollback**: Revert PR, redeploy
2. **Database rollback**: Drop column and constraints (non-destructive)

---

**Note**: This branch contains additional unrelated changes from merging
with `dev`. The core cascading stop feature involves only:
- `schema.prisma` + migration
- `backend/data/execution.py` 
- `backend/executor/utils.py`
- `backend/blocks/agent.py`
- `backend/executor/manager.py`

All other file changes are from dev branch updates and not part of this
feature.

🤖 Generated with [Claude Code](https://claude.ai/code)

<!-- This is an auto-generated comment: release notes by coderabbit.ai
-->
## Summary by CodeRabbit

* **New Features**
* Nested graph executions: parent-child tracking and retrieval of child
executions

* **Improvements**
* Cascading stop: stopping a parent optionally terminates child
executions
  * Parent execution IDs propagated through runs and surfaced in logs
  * Per-user/graph concurrent execution limits enforced

* **Bug Fixes**
* Skip enqueuing children if parent is terminated; robust handling when
parent-status checks fail

* **Tests**
  * Updated tests to cover parent linkage in graph creation
<!-- end of auto-generated comment: release notes by coderabbit.ai -->

---------

Co-authored-by: Claude <noreply@anthropic.com>
2025-10-29 11:11:22 +00:00
Zamil Majdy
de70ede54a fix(backend): prevent execution of deleted agents and cleanup orphaned resources (#11243)
## Summary
Fix critical bug where deleted agents continue running scheduled and
triggered executions indefinitely, consuming credits without user
control.

## Problem
When agents are deleted from user libraries, their schedules and webhook
triggers remain active, leading to:
-  Uncontrolled resource consumption 
-  "Unknown agent" executions that charge credits
-  No way for users to stop orphaned executions
-  Accumulation of orphaned database records

## Solution

### 1. Prevention: Library Validation Before Execution
- Add `is_graph_in_user_library()` function with efficient database
queries
- Validate graph accessibility before all executions in
`validate_and_construct_node_execution_input()`
- Use specific `GraphNotInLibraryError` for clear error handling

### 2. Cleanup: Remove Schedules & Webhooks on Deletion
- Enhanced `delete_library_agent()` to clean up associated schedules and
webhooks
- Comprehensive cleanup functions for both scheduled and triggered
executions
- Proper database transaction handling

### 3. Error-Based Cleanup: Handle Existing Orphaned Resources
- Catch `GraphNotInLibraryError` in scheduler and webhook handlers
- Automatically clean up orphaned resources when execution fails
- Graceful degradation without breaking existing workflows

### 4. Migration: Clean Up Historical Orphans
- SQL migration to remove existing orphaned schedules and webhooks
- Performance index for faster cleanup queries
- Proper logging and error handling

## Key Changes

### Core Library Validation
```python
# backend/data/graph.py - Single source of truth
async def is_graph_in_user_library(graph_id: str, user_id: str, graph_version: Optional[int] = None) -> bool:
    where_clause = {"userId": user_id, "agentGraphId": graph_id, "isDeleted": False, "isArchived": False}
    if graph_version is not None:
        where_clause["agentGraphVersion"] = graph_version
    count = await LibraryAgent.prisma().count(where=where_clause)
    return count > 0
```

### Enhanced Agent Deletion
```python
# backend/server/v2/library/db.py
async def delete_library_agent(library_agent_id: str, user_id: str, soft_delete: bool = True) -> None:
    # ... existing deletion logic ...
    await _cleanup_schedules_for_graph(graph_id=graph_id, user_id=user_id)
    await _cleanup_webhooks_for_graph(graph_id=graph_id, user_id=user_id)
```

### Execution Prevention
```python
# backend/executor/utils.py
if not await gdb.is_graph_in_user_library(graph_id=graph_id, user_id=user_id, graph_version=graph.version):
    raise GraphNotInLibraryError(f"Graph #{graph_id} is not accessible in your library")
```

### Error-Based Cleanup
```python
# backend/executor/scheduler.py & backend/server/integrations/router.py
except GraphNotInLibraryError as e:
    logger.warning(f"Execution blocked for deleted/archived graph {graph_id}")
    await _cleanup_orphaned_resources_for_graph(graph_id, user_id)
```

## Technical Implementation

### Database Efficiency
- Use `count()` instead of `find_first()` for faster queries
- Add performance index: `idx_library_agent_user_graph_active`
- Follow existing `prisma.is_connected()` patterns

### Error Handling Hierarchy
- **`GraphNotInLibraryError`**: Specific exception for deleted/archived
graphs
- **`NotAuthorizedError`**: Generic authorization errors (preserved for
user ID mismatches)
- Clear error messages for better debugging

### Code Organization
- Single source of truth for library validation in
`backend/data/graph.py`
- Import from centralized location to avoid duplication
- Top-level imports following codebase conventions

## Testing & Validation

### Functional Testing
-  Library validation prevents execution of deleted agents
-  Cleanup functions remove schedules and webhooks properly  
-  Error-based cleanup handles orphaned resources gracefully
-  Migration removes existing orphaned records

### Integration Testing
-  All existing tests pass (including `test_store_listing_graph`)
-  No breaking changes to existing functionality
-  Proper error propagation and handling

### Performance Testing
-  Efficient database queries with proper indexing
-  Minimal overhead for normal execution flows
-  Cleanup operations don't impact performance

## Impact

### User Experience
- 🎯 **Immediate**: Deleted agents stop running automatically
- 🎯 **Ongoing**: No more unexpected credit charges from orphaned
executions
- 🎯 **Cleanup**: Historical orphaned resources are removed

### System Reliability
- 🔒 **Security**: Users can only execute agents they have access to
- 🧹 **Cleanup**: Automatic removal of orphaned database records
- 📈 **Performance**: Efficient validation with minimal overhead

### Developer Experience
- 🎯 **Clear Errors**: Specific exception types for better debugging
- 🔧 **Maintainable**: Centralized library validation logic
- 📚 **Documented**: Comprehensive error handling patterns

## Files Modified
- `backend/data/graph.py` - Library validation function
- `backend/server/v2/library/db.py` - Enhanced agent deletion with
cleanup
- `backend/executor/utils.py` - Execution validation and prevention
- `backend/executor/scheduler.py` - Error-based cleanup for schedules
- `backend/server/integrations/router.py` - Error-based cleanup for
webhooks
- `backend/util/exceptions.py` - Specific error type for deleted graphs
-
`migrations/20251023000000_cleanup_orphaned_schedules_and_webhooks/migration.sql`
- Historical cleanup

## Breaking Changes
None. All changes are backward compatible and preserve existing
functionality.

## Follow-up Tasks
- [ ] Monitor cleanup effectiveness in production
- [ ] Consider adding metrics for orphaned resource detection
- [ ] Potential optimization of cleanup batch operations

🤖 Generated with [Claude Code](https://claude.ai/code)

Co-Authored-By: Claude <noreply@anthropic.com>

---------

Co-authored-by: Claude <noreply@anthropic.com>
2025-10-28 23:48:35 +00:00
Reinier van der Leer
04df981115 fix(backend): Fix structured logging for cloud environments (#11227)
- Resolves #11226

### Changes 🏗️

- Drop use of `CloudLoggingHandler` which docs state isn't for use in
GKE
- For cloud logging, output only structured log entries to `stdout`

### Checklist 📋

#### For code changes:
- [x] I have clearly listed my changes in the PR description
- [x] I have made a test plan
- [x] I have tested my changes according to the test plan:
  - [x] Test deploy to dev and check logs
2025-10-21 12:48:41 +00:00
Zamil Majdy
11d55f6055 fix(backend/executor): Avoid running direct query in executor (#11224)
## Summary
- Fixes database connection warnings in executor logs: "Client is not
connected to the query engine, you must call `connect()` before
attempting to query data"
- Implements resilient database client pattern already used elsewhere in
the codebase
- Adds caching to reduce database load for user context lookups

## Changes
- Updated `get_user_context()` to check `prisma.is_connected()` and fall
back to database manager client
- Added `@cached(maxsize=1000, ttl_seconds=3600)` decorator for
performance optimization
- Updated database manager to expose `get_user_by_id` method

## Test plan
- [x] Verify executor pods no longer show Prisma connection warnings
- [x] Confirm user timezone is still correctly retrieved
- [x] Test fallback behavior when Prisma is disconnected

🤖 Generated with [Claude Code](https://claude.ai/code)

Co-authored-by: Claude <noreply@anthropic.com>
2025-10-21 08:46:40 +00:00
Zamil Majdy
4e1557e498 fix(backend): Add dynamic input pin support for Smart Decision Maker Block (#11082)
## Summary

- Centralize dynamic field delimiters and helpers in
backend/data/dynamic_fields.py.
- Refactor SmartDecisionMaker: build function signatures with
dynamic-field mapping and re-map tool outputs back to original dynamic
names.
- Deterministic retry loop with retry-only feedback to avoid polluting
final conversation history.
- Update executor/utils.py and data/graph.py to use centralized
utilities.
- Update and extend tests: dynamic-field E2E flow, mapping verification,
output yielding, and retry validation; switch mocked llm_call to
AsyncMock; align tool-name expectations.
- Add a single-tool fallback in schema lookup to support mocked
scenarios.

## Validation

- Full backend test suite: 1125 passed, 88 skipped, 53 warnings (local).
- Backend lint/format pass.

## Scope

- Minimal and localized to SmartDecisionMaker and dynamic-field
utilities; unrelated pyright warnings remain unchanged.

## Risks/Mitigations

- Behavior is backward-compatible; dynamic-field constants are
centralized and reused.
- Output re-mapping only affects SmartDecisionMaker tool outputs and
matches existing link naming conventions.

## Checklist

- [x] Formatted and linted
- [x] All updated tests pass locally
- [x] No secrets introduced

---------

Co-authored-by: Claude <noreply@anthropic.com>
2025-10-04 14:23:13 +00:00
Zamil Majdy
27fccdbf31 fix(backend/executor): Make graph execution status transitions atomic and enforce state machine (#10863)
## Summary
- Fixed race condition issues in `update_graph_execution_stats` function
- Implemented atomic status transitions using database-level constraints
- Added state machine enforcement to prevent invalid status transitions
- Eliminated code duplication and improved error handling

## Problem
The `update_graph_execution_stats` function had race condition
vulnerabilities where concurrent status updates could cause invalid
transitions like RUNNING → QUEUED. The function was not durable and
could result in executions moving backwards in their lifecycle, causing
confusion and potential system inconsistencies.

## Root Cause Analysis
1. **Race Conditions**: The function used a broad OR clause that allowed
updates from multiple source statuses without validating the specific
transition
2. **No Atomicity**: No atomic check to ensure the status hadn't changed
between read and write operations
3. **Missing State Machine**: No enforcement of valid state transitions
according to execution lifecycle rules

## Solution Implementation

### 1. Atomic Status Transitions
- Use database-level atomicity by including the current allowed source
statuses in the WHERE clause during updates
- This ensures only valid transitions can occur at the database level

### 2. State Machine Enforcement
Define valid transitions as a module constant
`VALID_STATUS_TRANSITIONS`:
- `INCOMPLETE` → `QUEUED`, `RUNNING`, `FAILED`, `TERMINATED`
- `QUEUED` → `RUNNING`, `FAILED`, `TERMINATED`  
- `RUNNING` → `COMPLETED`, `TERMINATED`, `FAILED`
- `TERMINATED` → `RUNNING` (for resuming halted execution)
- `COMPLETED` and `FAILED` are terminal states with no allowed
transitions

### 3. Improved Error Handling
- Early validation with clear error messages for invalid parameters
- Graceful handling when transitions fail - return current state instead
of None
- Proper logging of invalid transition attempts

### 4. Code Quality Improvements
- Eliminated code duplication in fetch logic
- Added proper type hints and casting
- Made status transitions constant for better maintainability

## Benefits
 **Prevents Invalid Regressions**: No more RUNNING → QUEUED transitions
 **Atomic Operations**: Database-level consistency guarantees  
 **Clear Error Messages**: Better debugging and monitoring  
 **Maintainable Code**: Clean logic flow without duplication  
 **Race Condition Safe**: Handles concurrent updates gracefully  

## Test Plan
- [x] Function imports and basic structure validation
- [x] Code formatting and linting checks pass
- [x] Type checking passes for modified files
- [x] Pre-commit hooks validation

## Technical Details
The key insight is using the database query itself to enforce valid
transitions by filtering on allowed source statuses in the WHERE clause.
This makes the operation truly atomic and eliminates the race condition
window that existed in the previous implementation.

🤖 Generated with [Claude Code](https://claude.ai/code)

---------

Co-authored-by: Claude <noreply@anthropic.com>
Co-authored-by: Nicholas Tindle <nicholas.tindle@agpt.co>
2025-09-14 23:31:02 +00:00
Krzysztof Czerwinski
cfc975d39b feat(backend): Type for API block data response (#10763)
Moving to auto-generated frontend types caused returned blocks data to
no longer have proper typing.

### Changes 🏗️

- Add `BlockInfo` model and `get_info` function that returns it to the
`Block` class, including costs
- Move `BlockCost` and `BlockCostType` to `block.py` to prevent circular
imports

### Checklist 📋

#### For code changes:
- [x] I have clearly listed my changes in the PR description
- [x] I have made a test plan
- [x] I have tested my changes according to the test plan:
  - [x] Endpoints using the new type work correctly

Co-authored-by: Abhimanyu Yadav <122007096+Abhi1992002@users.noreply.github.com>
2025-09-06 04:21:48 +00:00
Reinier van der Leer
e16e69ca55 feat(library, executor): Make "Run Again" work with credentials (#10821)
- Resolves [OPEN-2549: Make "Run again" work with credentials in
`AgentRunDetailsView`](https://linear.app/autogpt/issue/OPEN-2549/make-run-again-work-with-credentials-in-agentrundetailsview)
- Resolves #10237

### Changes 🏗️

- feat(frontend/library): Make "Run Again" button work for runs with
credentials
- feat(backend/executor): Store passed-in credentials on
`GraphExecution`

### Checklist 📋

#### For code changes:
- [x] I have clearly listed my changes in the PR description
- [x] I have made a test plan
- [x] I have tested my changes according to the test plan:
  - Go to `/library/agents/[id]` for an agent with credentials inputs
  - Run the agent manually
    - [x] -> runs successfully
- [x] -> "Run again" shows among the action buttons on the newly created
run
  - Click "Run again"
    - [x] -> runs successfully
2025-09-02 18:34:56 +00:00
Nicholas Tindle
2bb8e91040 feat(backend): Add user timezone support to backend (#10707)
Co-authored-by: Swifty <craigswift13@gmail.com>
resolve issue #10692 where scheduled time and actual run
2025-08-25 11:00:07 -05:00
Reinier van der Leer
ba65fee862 hotfix(backend/executor): Fix propagation of passed-in credentials to sub-agents (#10668)
This should fix sub-agent execution issues with passed-in credentials after a crucial data path was removed in #10568.

Additionally, some of the changes are to ensure the `credentials_input_schema` gets refreshed correctly when saving a new version of a graph in the builder.

### Changes 🏗️

- Include `graph_credentials_inputs` in `nodes_input_masks` passed into sub-agent execution
- Fix credentials input schema in `update_graph` and `get_library_agent_by_graph_id` return
- Improve error message on sub-graph validation failure

### Checklist 📋

#### For code changes:
- [x] I have clearly listed my changes in the PR description
- [x] I have made a test plan
- [x] I have tested my changes according to the test plan:
  - [x] Import agent with sub-agent(s) with required credentials inputs & run it -> should work
2025-08-18 16:42:28 +02:00
Zamil Majdy
a28b2cf04f fix(backend/scheduler): Reconfigure scheduling setting & Add more logging on execution scheduling logic 2025-08-08 19:27:30 +07:00
Zamil Majdy
378d256b58 fix(backend): add graph validation before scheduling recurring jobs (#10568)
## Summary

This PR addresses the recurring job validation failures by adding graph
validation before scheduling jobs. Previously, validation errors only
occurred at runtime during job execution, making it difficult to
communicate errors to users for scheduled recurring jobs.

### Changes 🏗️

- **Extract validation logic**: Created
`validate_and_construct_node_execution_input` wrapper function that
centralizes graph fetching, credential mapping, and validation logic
- **Add pre-scheduling validation**: Modified
`add_graph_execution_schedule` to validate graphs before creating
scheduled jobs
- **Make construct function private**: Renamed
`construct_node_execution_input` to `_construct_node_execution_input` to
prevent direct usage and encourage use of the wrapper
- **Reduce code duplication**: Eliminated duplicate validation logic
between scheduler and execution paths
- **Improve scheduler lifecycle management**:
  - Enhanced cleanup process with proper event loop shutdown sequence
  - Added graceful event loop thread termination with timeout
  - Fixed thread lifecycle management to prevent resource leaks
- **Add helper utilities**: 
- Created `run_async` helper to reduce
`asyncio.run_coroutine_threadsafe` boilerplate
- Added `SCHEDULER_OPERATION_TIMEOUT_SECONDS` constant for consistent
timeout handling across all scheduler operations

### Technical Details

**Validation Flow:**
The validation now happens in `add_graph_execution_schedule` before
calling `scheduler.add_job()`, ensuring that:
1. Graph exists and is accessible to the user
2. All credentials are valid and available
3. Graph structure and node configurations are valid
4. Starting nodes are present and properly configured

This uses the same validation logic as runtime execution, guaranteeing
consistency.

**Scheduler Lifecycle Improvements:**
- **Proper cleanup sequence**: Event loop is stopped before thread
termination
- **Thread management**: Added global tracking of event loop thread for
proper cleanup
- **Timeout consistency**: All scheduler operations now use the same
300-second timeout
- **Resource management**: Prevents potential memory leaks from unclosed
event loops

**Code Quality Improvements:**
- **DRY principle**: `run_async` helper eliminates repeated
`asyncio.run_coroutine_threadsafe` patterns
- **Single source of truth**: All timeout values use
`SCHEDULER_OPERATION_TIMEOUT_SECONDS` constant
- **Cleaner abstractions**: Direct utility function calls instead of
unnecessary wrapper methods

### Checklist 📋

#### For code changes:
- [x] I have clearly listed my changes in the PR description
- [x] I have made a test plan
- [x] I have tested my changes according to the test plan:
- [x] Verified imports work correctly for both scheduler and utils
modules
  - [x] Confirmed code passes all linting and type checking
  - [x] Validated that existing functionality remains intact
  - [x] Tested that validation logic is properly extracted and reused
  - [x] Verified scheduler cleanup process works correctly
  - [x] Confirmed thread lifecycle management improvements

#### For configuration changes:
- [x] `.env.example` is updated or already compatible with my changes
- [x] `docker-compose.yml` is updated or already compatible with my
changes
- [x] I have included a list of my configuration changes in the PR
description (under **Changes**)

*Note: No configuration changes were required for this fix.*

## Impact

- **Prevents runtime failures**: Invalid graphs are caught before
scheduling instead of failing silently during execution
- **Better error communication**: Validation errors surface immediately
when scheduling
- **Improved resource management**: Proper event loop and thread cleanup
prevents memory leaks
- **Enhanced maintainability**: Single source of truth for validation
logic and consistent timeout handling
- **Reduced code duplication**: Eliminated ~30+ lines of duplicate code
across validation and async execution patterns
- **Better developer experience**: Cleaner code with helper functions
and consistent patterns

Resolves the TODO comment: "We need to communicate this error to the
user somehow" in scheduler.py:107

Co-authored-by: Claude <noreply@anthropic.com>
2025-08-08 05:40:20 +00:00
Zamil Majdy
3fe88b6106 refactor(backend): Refactor log client and resource cleanup (#10558)
## Summary
- Created centralized service client helpers with thread caching in
`util/clients.py`
- Refactored service client management to eliminate health checks and
improve performance
- Enhanced logging in process cleanup to include error details
- Improved retry mechanisms and resource cleanup across the platform
- Updated multiple services to use new centralized client patterns

## Key Changes
### New Centralized Client Factory (`util/clients.py`)
- Added thread-cached factory functions for all major service clients:
  - Database managers (sync and async)
  - Scheduler client
  - Notification manager
  - Execution event bus (Redis-based)
  - RabbitMQ execution queue (sync and async)
  - Integration credentials store
- All clients use `@thread_cached` decorator for performance
optimization

### Service Client Improvements
- **Removed health checks**: Eliminated unnecessary health check calls
from `get_service_client()` to reduce startup overhead
- **Enhanced retry support**: Database manager clients now use request
retry by default
- **Better error handling**: Improved error propagation and logging

### Enhanced Logging and Cleanup
- **Process termination logs**: Added error details to termination
messages in `util/process.py`
- **Retry mechanism updates**: Improved retry logic with better error
handling in `util/retry.py`
- **Resource cleanup**: Better resource management across executors and
monitoring services

### Updated Service Usage
- Refactored 21+ files to use new centralized client patterns
- Updated all executor, monitoring, and notification services
- Maintained backward compatibility while improving performance

## Files Changed
- **Created**: `backend/util/clients.py` - Centralized client factory
with thread caching
- **Modified**: 21 files across blocks, executor, monitoring, and
utility modules
- **Key areas**: Service client initialization, resource cleanup, retry
mechanisms

## Test Plan
- [x] Verify all existing tests pass
- [x] Validate service startup and client initialization  
- [x] Test resource cleanup on process termination
- [x] Confirm retry mechanisms work correctly
- [x] Validate thread caching performance improvements
- [x] Ensure no breaking changes to existing functionality

## Breaking Changes
None - all changes maintain backward compatibility.

## Additional Notes
This refactoring centralizes client management patterns that were
scattered across the codebase, making them more consistent and
performant through thread caching. The removal of health checks reduces
startup time while maintaining reliability through improved retry
mechanisms.

🤖 Generated with [Claude Code](https://claude.ai/code)
2025-08-06 13:53:01 +07:00
Reinier van der Leer
fa2d968458 fix(builder): Defer graph validation to backend (#10556)
- Resolves #10553

### Changes 🏗️

- Remove frontend graph validation in `useAgentGraph:saveAndRun(..)`
  - Remove now unused `ajv` dependency
- Implement graph validation error propagation (backend->frontend)
  - Add `GraphValidationError` type in frontend and backend
  - Add `GraphModel.validate_graph_get_errors(..)` method
  - Fix error handling & propagation in frontend API request logic

### Checklist 📋

#### For code changes:
- [x] I have clearly listed my changes in the PR description
- [x] I have made a test plan
- [x] I have tested my changes according to the test plan:
- [x] Saving & running a graph with missing required inputs gives a
node-specific error
- [x] Saving & running a graph with missing node credential inputs
succeeds with passed-in credentials
2025-08-05 23:43:34 +00:00
Zamil Majdy
f9b255fb7a feat(backend/executor): Avoid executor premature termination on inflight agent execution (#10552)
There is no 100% accurate way of retrying an agent that has been
terminated. And the safest way to avoid executing an agent wrong is
minimizing the chance of an agent execution being terminated. A whole
set of mechanism to make sure the agent is retried on failure is still
in place and improved, this is used as our best-effort reliability
mechanism.

### Changes 🏗️

* Cap SIGINT & SIGTERM to be raised at most once, so the executor can
gracefully handle the stopping.
* SIGINT & SIGTERM will stop the execution request message consumption,
but not agent execution.
* Executor process will only stop if all the in-flight agent executions
are completed or terminated.
* Avoid retrying the agent stop command on AgentExecutorBlock on
timeout.

### Checklist 📋

#### For code changes:
- [x] I have clearly listed my changes in the PR description
- [x] I have made a test plan
- [x] I have tested my changes according to the test plan:
  <!-- Put your test plan here: -->
- [x] Run agent, send SIGTERM to the executor pod, execution should not
be interrupted.
- [x] Run agent, send SIGKILL to the executor pod, execution should be
transferred to another pod.
2025-08-06 05:55:30 +07:00
Zamil Majdy
e5d3ebac08 feat(backend): Make Graph & Node Execution Stats Update Durable (#10529)
Graph and Node execution can fail due to so many reasons, sometimes this
messes up the stats tracking, giving an inaccurate result. The scope of
this PR is to minimize such issues.

### Changes 🏗️

* Catch BaseException on time_measured decorator to catch
asyncio.CancelledError
* Make sure update node & graph stats are executed on cancellation &
exception.
* Protect graph execution stats update under the thread lock to avoid
race condition.

### Checklist 📋

#### For code changes:
- [x] I have clearly listed my changes in the PR description
- [x] I have made a test plan
- [x] I have tested my changes according to the test plan:
  <!-- Put your test plan here: -->
  - [x] Existing automated tests.

---------

Co-authored-by: Claude <noreply@anthropic.com>
2025-08-04 21:33:52 +07:00
Zamil Majdy
69d873debc fix(backend): improve executor reliability and error handling (#10526)
This PR improves the reliability of the executor system by addressing
several race conditions and improving error handling throughout the
execution pipeline.

### Changes 🏗️

- **Consolidated exception handling**: Now using `BaseException` to
properly catch all types of interruptions including `CancelledError` and
`SystemExit`
- **Atomic stats updates**: Moved node execution stats updates to be
atomic with graph stats updates to prevent race conditions
- **Improved cleanup handling**: Added proper timeout handling (3600s)
for stuck executions during cleanup
- **Fixed concurrent update race conditions**: Node execution updates
are now properly synchronized with graph execution updates
- **Better error propagation**: Improved error type preservation and
status management throughout the execution chain
- **Graph resumption support**: Added proper handling for resuming
terminated and failed graph executions
- **Removed deprecated methods**: Removed `update_node_execution_stats`
in favor of atomic updates

### Checklist 📋

#### For code changes:
- [x] I have clearly listed my changes in the PR description
- [x] I have made a test plan
- [x] I have tested my changes according to the test plan:
- [x] Execute a graph with multiple nodes and verify stats are updated
correctly
  - [x] Cancel a running graph execution and verify proper cleanup
  - [x] Simulate node failures and verify error propagation
  - [x] Test graph resumption after termination/failure
  - [x] Verify no race conditions in concurrent node execution updates

#### For configuration changes:
- [x] `.env.example` is updated or already compatible with my changes
- [x] `docker-compose.yml` is updated or already compatible with my
changes
- [x] I have included a list of my configuration changes in the PR
description (under **Changes**)

🤖 Generated with [Claude Code](https://claude.ai/code)

---------

Co-authored-by: Claude <noreply@anthropic.com>
2025-08-02 17:41:59 +07:00
Zamil Majdy
4d05a27388 feat(backend): Avoid executor over-consuming messages when it's fully occupied (#10449)
When we run multiple instances of the executor, some of the executors
can oversubscribe the messages and end up queuing the agent execution
request instead of letting another executor handle the job. This change
solves the problem.

### Changes 🏗️

* Reject execution request when the executor is full.
* Improve `active_graph_runs` tracking for better horizontal scaling
heuristics.

### Checklist 📋

#### For code changes:
- [x] I have clearly listed my changes in the PR description
- [x] I have made a test plan
- [x] I have tested my changes according to the test plan:
  <!-- Put your test plan here: -->
  - [x] Manual graph execution & CI
2025-07-28 23:08:27 +00:00
Zamil Majdy
f4a179e5d6 feat(backend): Add thread safety to NodeExecutionProgress output handling (#10415)
## Summary
- Add thread safety to NodeExecutionProgress class to prevent race
conditions between graph executor and node executor threads
- Fixes potential data corruption and lost outputs during concurrent
access to shared output lists
- Uses single global lock per node for minimal performance impact
- Instead of blocking the node evaluation before adding another node
evaluation, we move on to the next node, in case another node completes
it.

## Changes
- Added `threading.Lock` to NodeExecutionProgress class
- Protected `add_output()` calls from node executor thread with lock
- Protected `pop_output()` calls from graph executor thread with lock
- Protected `_pop_done_task()` output checks with lock

## Problem Solved
The `NodeExecutionProgress.output` dictionary was being accessed
concurrently:
- `add_output()` called from node executor thread (asyncio thread) 
- `pop_output()` called from graph executor thread (main thread)
- Python lists are not thread-safe for concurrent append/pop operations
- This could cause data corruption, index errors, and lost outputs

## Test Plan
- [x] Existing executor tests pass
- [x] No performance regression (operations are microsecond-level)
- [x] Thread safety verified through code analysis

## Technical Details
- Single `threading.Lock()` per NodeExecutionProgress instance (~64
bytes)
- Lock acquisition time (~100-200ns) is minimal compared to list
operations
- Maintains order guarantees for same node_execution_id processing
- No GIL contention issues as operations are very brief

🤖 Generated with [Claude Code](https://claude.ai/code)

---------

Co-authored-by: Claude <noreply@anthropic.com>
2025-07-22 01:11:46 +00:00
Zamil Majdy
b6d6b865de feat(backend): avoid using DatabaseManager when direct query is possible from the API layer (#10403)
This PR reduces the dependency of the API layer on the database manager
service by avoiding using DatabaseManager for credentials fetch when a
direct query is possible from the API layer

### Changes 🏗️

* If Prisma is available, use the direct query.
* Otherwise, utilize the database manager.

### Checklist 📋

#### For code changes:
- [x] I have clearly listed my changes in the PR description
- [x] I have made a test plan
- [x] I have tested my changes according to the test plan:
  <!-- Put your test plan here: -->
  - [x] Run blocks with credentials like AiTextGeneratorBlock
2025-07-18 23:18:52 +00:00
Zamil Majdy
08b05621c1 feat(block;backend): Truncate execution update payload on large data & Improve ReadSpreadsheetBlock performance (#10395)
### Changes 🏗️

This PR introduces several key improvements to message handling, block
functionality, and execution reliability:

- **Renamed CSV block to Spreadsheet block** with enhanced CSV/Excel
processing capabilities
- **Added message size limiting and truncation** for Redis communication
to prevent connection issues
- **Optimized FileReadBlock** to yield content chunks instead of
duplicated outputs for better performance
- **Improved execution termination handling** with better timeout
management and event publishing
- **Enhanced continuous retry decorator** with async function support
- **Implemented payload truncation** to prevent Redis connection issues
from oversized messages

### Checklist 📋

#### For code changes:
- [x] I have clearly listed my changes in the PR description
- [x] I have made a test plan
- [x] I have tested my changes according to the test plan:
  - [x] Verified backend starts without errors
  - [x] Confirmed message truncation works for large payloads
  - [x] Tested spreadsheet block functionality with CSV and Excel files
  - [x] Validated execution termination improvements
  - [x] Checked FileReadBlock chunk processing

#### For configuration changes:
- [x] `.env.example` is updated or already compatible with my changes
- [x] `docker-compose.yml` is updated or already compatible with my
changes
- [x] I have included a list of my configuration changes in the PR
description (under **Changes**)

🤖 Generated with [Claude Code](https://claude.ai/code)

---------

Co-authored-by: Claude <noreply@anthropic.com>
2025-07-17 16:04:33 +00:00
Zamil Majdy
4ffb99bfb0 feat(backend): Add block error rate monitoring and Discord alerts (#10332)
## Summary

This PR adds a simple block error rate monitoring system that runs every
24 hours (configurable) and sends Discord alerts when blocks exceed the
error rate threshold.

## Changes Made

**Modified Files:**
- `backend/executor/scheduler.py` - Added `report_block_error_rates`
function and scheduled job
- `backend/util/settings.py` - Added configuration options
- `backend/.env.example` - Added environment variable examples
- Refactor scheduled job logics in scheduler.py into seperate files

## Configuration

```bash
# Block Error Rate Monitoring
BLOCK_ERROR_RATE_THRESHOLD=0.5  # 50% error rate threshold
BLOCK_ERROR_RATE_CHECK_INTERVAL_SECS=86400  # 24 hours
```

## How It Works

1. **Scheduled Job**: Runs every 24 hours (configurable via
`BLOCK_ERROR_RATE_CHECK_INTERVAL_SECS`)
2. **Error Rate Calculation**: Queries last 24 hours of node executions
and calculates error rates per block
3. **Threshold Check**: Alerts on blocks with ≥50% error rate
(configurable via `BLOCK_ERROR_RATE_THRESHOLD`)
4. **Discord Alert**: Sends alert to Discord using existing
`discord_system_alert` function
5. **Manual Execution**: Available via
`execute_report_block_error_rates()` scheduler client method

## Alert Format

```
Block Error Rate Alert:
🚨 Block 'DeprecatedGPT3Block' has 75.0% error rate (75/100) in the last 24 hours
🚨 Block 'BrokenImageBlock' has 60.0% error rate (30/50) in the last 24 hours
```

## Testing

Can be tested manually via:
```python
from backend.executor.scheduler import SchedulerClient
client = SchedulerClient()
result = client.execute_report_block_error_rates()
```

## Implementation Notes

- Follows the same pattern as `report_late_executions` function
- Only checks blocks with ≥10 executions to avoid noise
- Uses existing Discord notification infrastructure
- Configurable threshold and check interval
- Proper error handling and logging

## Test plan

- [x] Verify configuration loads correctly
- [x] Test error rate calculation with existing database
- [x] Confirm Discord integration works
- [x] Test manual execution via scheduler client
- [x] Verify scheduled job runs correctly

🤖 Generated with [Claude Code](https://claude.ai/code)

---------

Co-authored-by: Claude AI <claude@anthropic.com>
Co-authored-by: Claude <noreply@anthropic.com>
2025-07-10 21:56:58 +00:00
Zamil Majdy
f1cc2afbda feat(backend): improve stop graph execution reliability (#10293)
## Summary
- Enhanced graph execution cancellation and cleanup mechanisms
- Improved error handling and logging for graph execution lifecycle
- Added timeout handling for graph termination with proper status
updates
- Exposed a new API for stopping graph based on only graph_id or user_id
- Refactored logging metadata structure for better error tracking

## Key Changes
### Backend
- **Graph Execution Management**: Enhanced `stop_graph_execution` with
timeout-based waiting and proper status transitions
- **Execution Cleanup**: Added proper cancellation waiting with timeout
handling in executor manager
- **Logging Improvements**: Centralized `LogMetadata` class and improved
error logging consistency
- **API Enhancements**: Added bulk graph execution stopping
functionality
- **Error Handling**: Better exception handling and status management
for failed/cancelled executions

### Frontend
- **Status Safety**: Added null safety checks for status chips to
prevent runtime errors
- **Execution Control**: Simplified stop execution request handling

## Test Plan
- [x] Verify graph execution can be properly stopped and reaches
terminal state
- [x] Test timeout scenarios for stuck executions
- [x] Validate proper cleanup of running node executions when graph is
cancelled
- [x] Check frontend status chips handle undefined statuses gracefully
- [x] Test bulk execution stopping functionality

🤖 Generated with [Claude Code](https://claude.ai/code)

---------

Co-authored-by: Claude <noreply@anthropic.com>
2025-07-02 21:21:26 +00:00
Reinier van der Leer
efa4b6d2a0 feat(platform/library): Triggered-agent support (#10167)
This pull request adds support for setting up (webhook-)triggered agents
in the Library. It contains changes throughout the entire stack to make
everything work in the various phases of a triggered agent's lifecycle:
setup, execution, updates, deletion.

Setting up agents with webhook triggers was previously only possible in
the Builder, limiting their use to the agent's creator only. To make it
work in the Library, this change uses the previously introduced
`AgentPreset` to store information on, instead of on the graph's nodes
to which only a graph's creator has access.

- Initial ticket: #10111
- Builds on #9786

![screenshot of trigger setup screen in the
library](https://github.com/user-attachments/assets/525b4e94-d799-4328-b5fa-f05d6a3a206a)
![screenshot of trigger edit screen in the
library](https://github.com/user-attachments/assets/e67eb0bc-df70-4a75-bf95-1c31263ef0c9)

### Changes 🏗️

Frontend:
- Amend the Library's `AgentRunDraftView` to handle creating and editing
Presets
- Add `hideIfSingleCredentialAvailable` parameter to `CredentialsInput`
  - Add multi-select support to `TypeBasedInput`
- Add Presets section to `AgentRunsSelectorList`
  - Amend `AgentRunSummaryCard` for use for Presets
- Add `AgentStatusChip` to display general agent status (for now: Active
/ Inactive / Error)
- Add Preset loading logic and create/update/delete handlers logic to
`AgentRunsPage`
- Rename `IconClose` to `IconCross`

API:
- Add `LibraryAgent` properties `has_external_trigger`,
`trigger_setup_info`, `credentials_input_schema`
- Add `POST /library/agents/{library_agent_id}/setup_trigger` endpoint
- Remove redundant parameters from `POST
/library/presets/{preset_id}/execute` endpoint

Backend:
- Add `POST /library/agents/{library_agent_id}/setup_trigger` endpoint
- Extract non-node-related logic from `on_node_activate` into
`setup_webhook_for_block`
- Add webhook-related logic to `update_preset` and `delete_preset`
endpoints
- Amend webhook infrastructure to work with AgentPresets
  - Add preset trigger support to webhook ingress endpoint
- Amend executor stack to work with passed-in node input
(`nodes_input_masks`, generalized from `node_credentials_input_map`)
    - Amend graph validation to work with passed-in node input
  - Add `AgentPreset`->`IntegrationWebhook` relation
    - Add `WebhookWithRelations` model
- Change behavior of `BaseWebhooksManager.get_manual_webhook(..)` to
avoid unnecessary changes of the webhook URL: ignore `events` to find
matching webhook, and update `events` if necessary.
- Fix & improve `AgentPreset` API, models, and DB logic
  - Add `isDeleted` filter to get/list queries
  - Add `user_id` attribute to `LibraryAgentPreset` model
  - Add separate `credentials` property to `LibraryAgentPreset` model
- Fix `library_db.update_preset(..)` replacement of existing
`InputPresets`
- Make `library_db.update_preset(..)` more usage-friendly with separate
parameters for updateable properties
- Add `user_id` checks to various DB functions
- Fix error handling in various endpoints
- Fix cache race condition on `load_webhook_managers()`

### Checklist 📋

#### For code changes:
- [x] I have clearly listed my changes in the PR description
- [x] I have made a test plan
- [x] I have tested my changes according to the test plan:
  - Test existing functionality
- [x] Auto-setup and -teardown of webhooks on save in the builder still
works
    - [x] Running an agent normally from the Library still works
  - Test new functionality
    - [x] Setting up a trigger in the Library
    - [x] Updating a trigger in the Library
    - [x] Disabling and re-enabling a trigger in the Library
    - [x] Deleting a trigger in the Library
- [x] Triggers set up in the Library result in a new run when the
webhook receives a payload
2025-06-24 20:28:20 +00:00
Zamil Majdy
e701f41e66 feat(blocks): Enabling auto type conversion on block input schema mismatch for nested input (#10203)
Since auto conversion is applied before merging nested input in the
block, it breaks the auto conversion break.

### Changes 🏗️

* Enabling auto-type conversion on block input schema mismatch for
nested input
* Add batching feature for `CreateListBlock`
* Increase default max_token size for LLM call

### Checklist 📋

#### For code changes:
- [x] I have clearly listed my changes in the PR description
- [x] I have made a test plan
- [x] I have tested my changes according to the test plan:
  <!-- Put your test plan here: -->
- [x] Run `AIStructuredResponseGeneratorBlock` with non-string prompt
value (should be auto-converted).
2025-06-21 03:56:53 +07:00
Zamil Majdy
1e0a3d3c1b feat(backend): Add request retry on block execution and RPC (#10183)
Request on block execution can be throttled, and requests between
services can sometimes break. The scope of this PR is to add an
appropriate retry on those.

### Changes 🏗️

* Block request retry: Retry on throttled status code only (504, 429,
etc).
* RPC request retry: Retry connection issues (ConnectError, Timeout,
etc).
* Truncate logging on executor/utils.

### Checklist 📋

#### For code changes:
- [x] I have clearly listed my changes in the PR description
- [x] I have made a test plan
- [x] I have tested my changes according to the test plan:
  <!-- Put your test plan here: -->
  - [x] Manual graph execution
2025-06-17 21:03:46 +00:00
Zamil Majdy
97e72cb485 feat(backend): Make execution engine async-first (#10138)
This change introduced async execution for blocks and the execution
engine. Paralellism will be achieved through a single process
asynchronous execution instead of process concurrency.

### Changes 🏗️

* Support async execution for the graph executor
* Removed process creation for node execution
* Update all blocks to support async executions

### Checklist 📋

#### For code changes:
- [x] I have clearly listed my changes in the PR description
- [x] I have made a test plan
- [x] I have tested my changes according to the test plan:
  <!-- Put your test plan here: -->
  - [x] Manual graph executions, tested many of the impacted blocks.
2025-06-17 09:38:24 +00:00
Zamil Majdy
c109b676b8 fix(block): Invalid block input error on falsy non-null value 2025-06-13 00:39:55 -07:00
Zamil Majdy
210d457ecd feat(executor): Improve execution ordering to allow depth-first execution (#10142)
Allowing depth-first execution will unlock faster processing latency and
a better sense of progress.

<img width="950" alt="image"
src="https://github.com/user-attachments/assets/e2a0e11a-8bc5-4a65-a10d-b5b6c6383354"
/>


### Changes 🏗️

* Prioritize adding a new execution over processing execution output
* Make sure to enqueue each node once when processing output instead of
draining a single node and move on.

### Checklist 📋

#### For code changes:
- [x] I have clearly listed my changes in the PR description
- [x] I have made a test plan
- [x] I have tested my changes according to the test plan:
  <!-- Put your test plan here: -->
  - [x] Run company follower count finder agent.

---------

Co-authored-by: Swifty <craigswift13@gmail.com>
2025-06-10 12:41:31 +00:00
Zamil Majdy
2647417e9f feat(executor;frontend): Move output processing step from node executor to graph executor & simplify input beads calculation (#10066)
**Goal**: Allow parallel runs within a single node. Currently, we
prevent this to avoid unexpected ordering of the execution.

### Changes 🏗️

#### Executor changes

We decoupled the node execution output processing, which is responsible
for deciding the next executions from the node executor code.

Currently, `execute_node` does two big things:
* Runs the block’s execute(...) (which yields outputs).
* immediately enqueues the next nodes based on those outputs.

This PR makes:
* execute_node(node_exec) -> stream of (output_name, data). That purely
runs the block and yields each output as soon as it’s available.
* Move _enqueue_next_nodes into the graph executor. So the next
execution is handled serially by the graph executor to avoid concurrency
issues.

#### UI changes

The change on the executor also fixes the behavior of the execution
update to the UI We will report the execution output to the UI as soon
as it is available, not when the node execution is fully completed.
This, however, broke the bread calculation logic that assumes each
execution update will never overlap. So the change in this PR makes the
bead calculation take the overlap / duplicated execution update into
account, and simplify the overall calculation logic.


### Checklist 📋

#### For code changes:
- [x] I have clearly listed my changes in the PR description
- [x] I have made a test plan
- [x] I have tested my changes according to the test plan:
  <!-- Put your test plan here: -->
  - [x] Execute this agent and observe its concurrency ordering
  
<img width="1424" alt="image"
src="https://github.com/user-attachments/assets/0fe8259f-9091-4ecc-b824-ce8e8819c2d2"
/>
2025-06-05 16:10:50 +00:00
Zamil Majdy
4b70e778d2 feat(backend): Add nested dynamic pin-name support (#10082)
Suppose we have pint with list[list[int]] type, and we want directly
insert the a new value inside the first index of the first list e.g:
list[0][0] = X through a dynamic pin, this will be translated into
list_$_0_$_0, and the system does not currently support this.

### Changes 🏗️

Add support for nested dynamic pins for list, object, and dict.

### Checklist 📋

#### For code changes:
- [x] I have clearly listed my changes in the PR description
- [x] I have made a test plan
- [x] I have tested my changes according to the test plan:
  <!-- Put your test plan here: -->
  - [x] lots of unit tests
- [x] Tried inserting the value directly on the `value` nested field on
Google Sheets Write block.
<img width="371" alt="image"
src="https://github.com/user-attachments/assets/0a5e7213-b0e0-4fce-9e89-b39f7a583582"
/>
2025-06-04 16:32:32 +00:00
Reinier van der Leer
0726a00fb7 fix(backend): Include sub-graphs in graph-level credentials support (#9862)
The Library Agent credentials UX (#9789) currently doesn't work for
sub-graphs.

### Changes 🏗️

- Include sub-graphs in generating `Graph.credentials_input_schema`
- Propagate `node_credentials_input_map` into `AgentExecutionBlock`
executions
- Fix: also apply `node_credentials_input_map` in `_enqueue_next_nodes`

### Checklist 📋

#### For code changes:
- [x] I have clearly listed my changes in the PR description
- [x] I have made a test plan
- [x] I have tested my changes according to the test plan:
  - Import a graph with sub-graphs that need credentials
  - Run this agent from the Library
  - [x] -> Should work
2025-05-07 17:28:39 +00:00
Zamil Majdy
475c5a5cc3 fix(backend): Avoid executing any agent with zero balance (#9901)
### Changes 🏗️

* Avoid executing any agent with a zero balance.
* Make node execution count global across agents for a single user.

### Checklist 📋

#### For code changes:
- [x] I have clearly listed my changes in the PR description
- [x] I have made a test plan
- [x] I have tested my changes according to the test plan:
  <!-- Put your test plan here: -->
- [x] Run agents by tweaking the `execution_cost_count_threshold` &
`execution_cost_per_threshold` values.
2025-05-01 15:11:38 +00:00
Zamil Majdy
86d5cfe60b feat(backend): Support flexible RPC client (#9842)
Using sync code in the async route often introduces a blocking
event-loop code that impacts stability.

The current RPC system only provides a synchronous client to call the
service endpoints.
The scope of this PR is to provide an entirely decoupled signature
between client and server, allowing the client can mix & match async &
sync options on the client code while not changing the async/sync nature
of the server.

### Changes 🏗️

* Add support for flexible async/sync RPC client.
* Migrate scheduler client to all-async client.

### Checklist 📋

#### For code changes:
- [x] I have clearly listed my changes in the PR description
- [x] I have made a test plan
- [x] I have tested my changes according to the test plan:
  - [x] Scheduler route test.
  - [x] Modified service_test.py
  - [x] Run normal agent executions
2025-05-01 04:38:06 +00:00
Nicholas Tindle
04c4340ee3 feat(frontend,backend): user spending admin dashboard (#9751)
<!-- Clearly explain the need for these changes: -->
We need a way to refund people who spend money on agents wihout making
manual db actions

### Changes 🏗️
- Adds a bunch for refunding users
- Adds reasons and admin id for actions
- Add admin to db manager
- Add UI for this for the admin panel
- Clean up pagination controls
<!-- Concisely describe all of the changes made in this pull request:
-->

### Checklist 📋

#### For code changes:
- [x] I have clearly listed my changes in the PR description
- [x] I have made a test plan
- [x] I have tested my changes according to the test plan:
  <!-- Put your test plan here: -->
  - [x] Test by importing dev db as baseline
- [x] Add transactions on top for "refund", and make sure all existing
transactions work

---------

Co-authored-by: Zamil Majdy <zamil.majdy@agpt.co>
2025-04-29 17:39:25 +00:00
Zamil Majdy
c783f64b33 fix(backend): Handle add execution API request failure (#9838)
There are cases where the publishing agent execution is failing, making
the agent execution appear to be stuck in a queue, but the execution has
never been in a queue in the first place.

### Changes 🏗️

On publishing failure, we set the graph & starting node execution status
to FAILED and let the UI bubble up the error so the user can try again.

### Checklist 📋

#### For code changes:
- [x] I have clearly listed my changes in the PR description
- [x] I have made a test plan
- [x] I have tested my changes according to the test plan:
  - [x] Normal add execution flow
2025-04-18 18:35:43 +00:00
Reinier van der Leer
417d7732af feat(platform/library): Add credentials UX on /library/agents/[id] (#9789)
- Resolves #9771
- ... in a non-persistent way, so it won't work for webhook-triggered
agents
    For webhooks: #9541

### Changes 🏗️

Frontend:
- Add credentials inputs in Library "New run" screen (based on
`graph.credentials_input_schema`)
- Refactor `CredentialsInput` and `useCredentials` to not rely on XYFlow
context

- Unsplit lists of saved credentials in `CredentialsProvider` state

- Move logic that was being executed at component render to `useEffect`
hooks in `CredentialsInput`

Backend:
- Implement logic to aggregate credentials input requirements to one per
provider per graph
- Add `BaseGraph.credentials_input_schema` (JSON schema) computed field
    Underlying added logic:
- `BaseGraph._credentials_input_schema` - makes a `BlockSchema` from a
graph's aggregated credentials inputs
- `BaseGraph.aggregate_credentials_inputs()` - aggregates a graph's
nodes' credentials inputs using `CredentialsFieldInfo.combine(..)`
- `BlockSchema.get_credentials_fields_info() -> dict[str,
CredentialsFieldInfo]`
- `CredentialsFieldInfo` model (created from
`_CredentialsFieldSchemaExtra`)

- Implement logic to inject explicitly passed credentials into graph
execution
  - Add `credentials_inputs` parameter to `execute_graph` endpoint
- Add `graph_credentials_input` parameter to
`.executor.utils.add_graph_execution(..)`
  - Implement `.executor.utils.make_node_credentials_input_map(..)`
  - Amend `.executor.utils.construct_node_execution_input`
  - Add `GraphExecutionEntry.node_credentials_input_map` attribute
  - Amend validation to allow injecting credentials
    - Amend `GraphModel._validate_graph(..)`
    - Amend `.executor.utils._validate_node_input_credentials`
- Add `node_credentials_map` parameter to
`ExecutionManager.add_execution(..)`
    - Amend execution validation to handle side-loaded credentials
    - Add `GraphExecutionEntry.node_execution_map` attribute
- Add mechanism to inject passed credentials into node execution data
- Add credentials injection mechanism to node execution queueing logic
in `Executor._on_graph_execution(..)`

- Replace boilerplate logic in `v1.execute_graph` endpoint with call to
existing `.executor.utils.add_graph_execution(..)`
- Replace calls to `.server.routers.v1.execute_graph` with
`add_graph_execution`

Also:
- Address tech debt in `GraphModel._validate_gaph(..)`
- Fix type checking in `BaseGraph._generate_schema(..)`

#### TODO
- [ ] ~~Make "Run again" work with credentials in
`AgentRunDetailsView`~~
- [ ] Prohibit saving a graph if it has nodes with missing discriminator
value for discriminated credentials inputs

### Checklist 📋

#### For code changes:
- [ ] I have clearly listed my changes in the PR description
- [ ] I have made a test plan
- [ ] I have tested my changes according to the test plan:
  <!-- Put your test plan here: -->
  - [ ] ...
2025-04-18 14:27:13 +00:00
Zamil Majdy
bb92226f5d feat(backend): Remove RPC service from Agent Executor (#9804)
Currently the execution task is not properly distributed between
executors because we need to send the execution request to the execution
server.

The execution manager now accepts the execution request from the message
queue. Thus, we can remove the synchronous RPC system from this service,
let the system focus on executing the agent, and not spare any process
for the HTTP API interface.

This will also reduce the risk of the execution service being too busy
and not able to accept any add execution requests.

### Changes 🏗️

* Remove the RPC system in Agent Executor
* Allow the cancellation of the execution that is still waiting in the
queue (by avoiding it from being executed).
* Make a unified helper for adding an execution request to the system
and move other execution-related helper functions into
`executor/utils.py`.
* Remove non-db connections (redis / rabbitmq) in Database Manager and
let the client manage this by themselves.

### Checklist 📋

#### For code changes:
- [x] I have clearly listed my changes in the PR description
- [x] I have made a test plan
- [x] I have tested my changes according to the test plan:
  - [x] Existing CI, some agent runs
2025-04-11 19:03:47 +00:00
Zamil Majdy
26984a7338 feat(backend): Add capability to charge based on block execution count (#9661)
Blocks that are not defined in the block cost are pretty much free. The
lack of cost control makes it hard to control its quota. The scope of
this change is providing a way to charge any executions based on the
number of block being executed in real-time.

### Changes 🏗️

* Add execution charge logic based on the number of blocks executed,
controlled by these two configurations:
* `execution_cost_count_threshold`: We will charge the execution based
on the multiple of this number.
* `execution_cost_per_threshold`: The amount we are charging on its
threshold multiple.
* Make charging logic on the graph execution logic (as opposed to node
level) so it's being done serially and insufficient fund error is
guaranteed to stop the graph execution.
* Moved cost calculation logic into backend/executor/util.py

### Checklist 📋

#### For code changes:
- [x] I have clearly listed my changes in the PR description
- [x] I have made a test plan
- [x] I have tested my changes according to the test plan:
- [x] Execute graph with configured threshold & cost and test the
balance being deducted on that.
  - [x] Existing cost calculation is still being done without any issue.
  - [x] Low balance stop the whole graph execution.
2025-03-24 07:26:33 +00:00