Commit Graph

923 Commits

Author SHA1 Message Date
Nicholas Tindle
c1c371bcf3 Add total upcoming execution runs to diagnostics
Backend now calculates and returns the total number of scheduled execution runs in the next hour and 24 hours, not just unique schedules. The frontend displays these new metrics in the diagnostics admin panel. The OpenAPI schema is updated to reflect the new fields.
2025-11-03 19:39:14 -06:00
Nicholas Tindle
6a72440005 Add admin endpoints for bulk stopping and cleanup of executions
Introduces backend and frontend support for stopping all long-running executions and cleaning up all stuck queued executions via new admin endpoints. Updates diagnostics logic to ensure both cancel signals and DB status updates are performed, adds corresponding API routes, and enhances the admin UI to expose these bulk actions. Also updates the sidebar icon for diagnostics.
2025-11-03 19:24:44 -06:00
Nicholas Tindle
1403c8f2de Improve failed executions error extraction and counting
Extract error messages from the stats JSON field in failed executions details. Update the admin diagnostics route to always count the actual number of failed executions within the specified time window, ensuring accurate pagination.
2025-11-03 18:37:01 -06:00
Nicholas Tindle
6068ed3516 Add admin diagnostics for agent schedules
Introduces backend endpoints and models for schedule diagnostics, including orphaned schedule detection, listing, and bulk cleanup. Updates the frontend to display schedule health metrics and a new schedules table with management actions. OpenAPI spec is updated to document the new endpoints and models.
2025-11-03 18:21:27 -06:00
Nicholas Tindle
53a6de9fdb feat(admin): Enhance diagnostics with comprehensive execution monitoring and management
Add extensive diagnostic capabilities for on-call engineers to monitor and manage execution health.

Backend Enhancements:
- Add 18 diagnostic metrics covering failures, orphaned executions, stuck queued, throughput, and queue health
- Implement orphaned execution detection (>24h old, not in executor)
- Add stuck queued detection (QUEUED >1h, never started)
- Add long-running execution detection (RUNNING >24h)
- Monitor both execution and cancel RabbitMQ queues
- Track failure rates (1h, 24h) and execution throughput metrics

New Backend Endpoints (15 total):
- GET /admin/diagnostics/executions/orphaned - List orphaned executions
- GET /admin/diagnostics/executions/stuck-queued - List stuck queued executions
- GET /admin/diagnostics/executions/long-running - List long-running executions
- GET /admin/diagnostics/executions/failed - List failed executions with error messages
- POST /admin/diagnostics/executions/cleanup-all-orphaned - Cleanup all orphaned (operates on entire dataset)
- POST /admin/diagnostics/executions/requeue - Requeue single stuck execution
- POST /admin/diagnostics/executions/requeue-bulk - Requeue selected executions
- POST /admin/diagnostics/executions/requeue-all-stuck - Requeue all stuck queued (operates on entire dataset)

Execution Management:
- Dual-mode stop: Active executions (cancel signals) vs orphaned (direct DB cleanup)
- Intelligent Stop All: Auto-splits active/orphaned, executes in parallel
- Requeue functionality for stuck QUEUED executions with credit cost warnings
- Stop sends cancel signals to RabbitMQ for graceful termination
- Cleanup orphaned updates DB directly without cancel signals
- ALL endpoints operate on entire datasets (not limited to pagination)

Frontend Enhancements:
- 5-tab filtering interface: All, Orphaned, Stuck Queued, Long-Running, Failed
- Clickable alert cards (🟠 🔴 🟡) automatically switch to relevant tabs
- Tab badges show live counts from diagnostics metrics
- Age column displays execution duration (e.g., "245d 12h")
- Orange row highlighting for orphaned executions (>24h old)
- Error message column for failed executions with hover tooltips
- Click-to-copy for execution IDs and user IDs with visual feedback
- Status badge colors match library view (blue=RUNNING, yellow=QUEUED, red=FAILED)

Tab-Specific Actions:
- Stuck Queued: Cleanup All OR Requeue All buttons with cost warnings
- Stuck Queued per-row: 🟠 Cleanup OR 🔵 Requeue buttons
- Orphaned: Cleanup All (operates on ALL orphaned)
- Long-Running: Stop All (sends cancel signals)
- Failed: View-only with error details
- All: Stop All (intelligent split of active/orphaned)

Alert Cards:
- 🟠 Orphaned: Shows count with RUNNING/QUEUED breakdown, click to view
- 🔴 Failed (24h): Shows count with hourly rate, click to view
- 🟡 Long-Running: Shows count with oldest execution age, click to view

Updated Diagnostic Info Card:
- Color-coded explanations for each execution type
- When to cleanup vs requeue vs stop
- Credit cost implications clearly documented
- Queue health thresholds explained

Provides ~70% coverage of on-call guide requirements for troubleshooting execution issues, orphaned database records, and system health monitoring.

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude <noreply@anthropic.com>
2025-11-03 16:57:49 -06:00
Nicholas Tindle
cdd501c031 Merge branch 'dev' into claude/admin-user-management-011CULzkwgiPXZYcvCeozofC 2025-11-03 13:03:39 -06:00
Krzysztof Czerwinski
f97e19f418 hotfix: Patch onboarding (#11299)
### Changes 🏗️

- Prevent removing progress of user onboarding tasks by merging arrays
on the backend instead of replacing them
- New endpoint for onboarding reset

### Checklist 📋

#### For code changes:
- [x] I have clearly listed my changes in the PR description
- [x] I have made a test plan
- [x] I have tested my changes according to the test plan:
  - [x] Tasks are not being reset
  - [x] `/onboarding/reset` works
2025-11-01 10:19:55 +01:00
Reinier van der Leer
42b9facd4a hotfix(backend/scheduler): Bump apscheduler to DST-fixed version 3.11.1 (#11294)
- #11273

- Bump `apscheduler` to v3.11.1 which contains a fix for the issue

- [x] I have clearly listed my changes in the PR description
- [x] I have made a test plan
- [x] I have tested my changes according to the test plan:
- [x] "It's a rather ugly solution but the test proves that it works."
~the maintainer
  - [x] CI passes
2025-10-31 23:09:28 +01:00
Nicholas Tindle
834617d221 hotfix(backend): Clarify prompt requirements for list generation for our friend claude (#11293) 2025-10-31 12:28:05 -05:00
Lluis Agusti
e6fb649ced Merge 'master' into 'dev' 2025-10-30 20:05:55 +07:00
Zamil Majdy
2f8cdf62ba feat(backend): Standardize error handling with BlockSchemaInput & BlockSchemaOutput base class (#11257)
<!-- Clearly explain the need for these changes: -->

This PR addresses the need for consistent error handling across all
blocks in the AutoGPT platform. Previously, each block had to manually
define an `error` field in their output schema, leading to code
duplication and potential inconsistencies. Some blocks might forget to
include the error field, making error handling unpredictable.

### Changes 🏗️

<!-- Concisely describe all of the changes made in this pull request:
-->

- **Created `BlockSchemaOutput` base class**: New base class that
extends `BlockSchema` with a standardized `error` field
- **Created `BlockSchemaInput` base class**: Added for consistency and
future extensibility
- **Updated 140+ block implementations**: Changed all block `Output`
classes from `class Output(BlockSchema):` to `class
Output(BlockSchemaOutput):`
- **Removed manual error field definitions**: Eliminated hundreds of
duplicate `error: str = SchemaField(...)` definitions
- **Updated type annotations**: Changed `Block[BlockSchema,
BlockSchema]` to `Block[BlockSchemaInput, BlockSchemaOutput]` throughout
the codebase
- **Fixed imports**: Added `BlockSchemaInput` and `BlockSchemaOutput`
imports to all relevant files
- **Maintained backward compatibility**: Updated `EmptySchema` to
inherit from `BlockSchemaOutput`

**Key Benefits:**
- Consistent error handling across all blocks
- Reduced code duplication (removed ~200 lines of repetitive error field
definitions)
- Type safety improvements with distinct input/output schema types
- Blocks can still override error field with more specific descriptions
when needed

### Checklist 📋

#### For code changes:
- [x] I have clearly listed my changes in the PR description
- [x] I have made a test plan
- [x] I have tested my changes according to the test plan:
  <!-- Put your test plan here: -->
- [x] Verified `poetry run format` passes (all linting, formatting, and
type checking)
- [x] Tested block instantiation works correctly (MediaDurationBlock,
UnrealTextToSpeechBlock)
- [x] Confirmed error fields are automatically present in all updated
blocks
- [x] Verified block loading system works (successfully loads 353+
blocks)
  - [x] Tested backward compatibility with EmptySchema
- [x] Confirmed blocks can still override error field with custom
descriptions
  - [x] Validated core schema inheritance chain works correctly

#### For configuration changes:

- [x] `.env.default` is updated or already compatible with my changes
- [x] `docker-compose.yml` is updated or already compatible with my
changes
- [x] I have included a list of my configuration changes in the PR
description (under **Changes**)

*Note: No configuration changes were needed for this refactoring.*

🤖 Generated with [Claude Code](https://claude.ai/code)

---------

Co-authored-by: Claude <noreply@anthropic.com>
Co-authored-by: Lluis Agusti <hi@llu.lu>
Co-authored-by: Ubbe <hi@ubbe.dev>
2025-10-30 12:28:08 +00:00
seer-by-sentry[bot]
3dc5208f71 feat(backend): Increase max_field_size in aiohttp requests (#11261)
### Changes 🏗️

- Increased `max_field_size` in `aiohttp.ClientSession` to 16KB to
handle servers with large headers (e.g., long CSP headers).

### Checklist 📋

#### For code changes:
- [x] I have clearly listed my changes in the PR description
- [x] I have made a test plan
- [x] I have tested my changes according to the test plan:
  <!-- Put your test plan here: -->
  - [x]  Add unit test that checks it can now parse headers over 8k size

---------

Co-authored-by: seer-by-sentry[bot] <157164994+seer-by-sentry[bot]@users.noreply.github.com>
Co-authored-by: Swifty <craigswift13@gmail.com>
Co-authored-by: Ubbe <hi@ubbe.dev>
2025-10-30 10:41:22 +00:00
seer-by-sentry[bot]
4140331731 fix(blocks/llm): Validate LLM summary responses are strings (#11275)
### Changes 🏗️

- Added validation to ensure that the `summary` and `final_summary`
returned by the LLM are strings.
- Raises a `ValueError` if the LLM returns a list or other non-string
type, providing a descriptive error message to aid debugging.

Fixes
[AUTOGPT-SERVER-6M4](https://sentry.io/organizations/significant-gravitas/issues/6978480131/).
The issue was that: LLM returned list of strings instead of single
string summary, causing `_combine_summaries` to fail on `join`.

This fix was generated by Seer in Sentry, triggered by Craig Swift. 👁️
Run ID: 2230933

Not quite right? [Click here to continue debugging with
Seer.](https://sentry.io/organizations/significant-gravitas/issues/6978480131/?seerDrawer=true)

### Checklist 📋

#### For code changes:
- [x] I have clearly listed my changes in the PR description
- [x] I have made a test plan
- [x] I have tested my changes according to the test plan:
  <!-- Put your test plan here: -->
- [x] Added a unit test to verify that a ValueError is raised when the
LLM returns a list instead of a string for summary or final_summary.

---------

Co-authored-by: seer-by-sentry[bot] <157164994+seer-by-sentry[bot]@users.noreply.github.com>
Co-authored-by: Swifty <craigswift13@gmail.com>
2025-10-30 09:52:50 +00:00
Swifty
594b1adcf7 fix(frontend): Fix marketplace sort by (#11284)
Marketplace sort by functionality was not working on the frontend. This
PR fixes it

### Changes 🏗️

- Add type hints for sort by
- Fix marketplace sort by drop downs


### Checklist 📋

#### For code changes:
- [x] I have clearly listed my changes in the PR description
- [x] I have made a test plan
- [x] I have tested my changes according to the test plan:
  <!-- Put your test plan here: -->
  - [x] tested locally
2025-10-30 08:46:11 +00:00
Swifty
a1ac109356 fix(backend): Further enhance sanitization of SQL raw queries (#11279)
### Changes 🏗️

Enhanced SQL query security in the store search functionality by
implementing proper parameterization to prevent SQL injection
vulnerabilities.

**Security Improvements:**
- Replaced string interpolation with PostgreSQL positional parameters
(`$1`, `$2`, etc.) for all user inputs
- Added ORDER BY whitelist validation to prevent injection via
`sorted_by` parameter
- Parameterized search term, creators array, category, and pagination
values
- Fixed variable naming conflict (`sql_where_clause` vs `where_clause`)

**Testing:**
- Added 4 comprehensive tests validating SQL injection prevention across
different attack vectors
- Tests verify that malicious input in search queries, filters, sorting,
and categories are safely handled
- All 10 tests in db_test.py pass successfully

### Checklist 📋

#### For code changes:
- [x] I have clearly listed my changes in the PR description
- [x] I have made a test plan
- [x] I have tested my changes according to the test plan:
  - [x] All existing tests pass (10/10 tests passing)
  - [x] New security tests validate SQL injection prevention
  - [x] Verified parameterized queries handle malicious input safely
  - [x] Code formatting passes (`poetry run format`)

#### For configuration changes:
- [x] `.env.default` is updated or already compatible with my changes
- [x] `docker-compose.yml` is updated or already compatible with my
changes
- [x] I have included a list of my configuration changes in the PR
description (under **Changes**)

*Note: No configuration changes required for this security fix*
2025-10-29 15:21:27 +00:00
Zamil Majdy
5506d59da1 fix(backend/executor): make graph execution permission check version-agnostic (#11283)
## Summary
Fix critical issue where pre-execution permission validation broke
execution of graphs that reference older versions of sub-graphs.

## Problem
The `validate_graph_execution_permissions` function was checking for the
specific version of a graph in the user's library. This caused failures
when:
1. A parent graph references an older version of a sub-graph  
2. The user updates the sub-graph to a newer version
3. The older version is no longer in their library
4. Execution of the parent graph fails with `GraphNotInLibraryError`

## Root Cause
In `backend/executor/utils.py` line 523, the function was checking for
the exact version, but sub-graphs legitimately reference older versions
that may no longer be in the library.

## Solution

### 1. Remove Version-Specific Check (backend/executor/utils.py)
- Remove `graph_version=graph.version` parameter from validation call
- Add explanatory comment about version-agnostic behavior
- Now only checks that the graph ID exists in user's library (any
version)

### 2. Enhance Documentation (backend/data/graph.py)  
- Update function docstring to explain version-agnostic behavior
- Document that `None` (now default) allows execution of any version
- Clarify this is important for sub-graph version compatibility

## Technical Details
The `validate_graph_execution_permissions` function was already designed
to handle version-agnostic checks when `graph_version=None`. By omitting
the version parameter, we skip the version check and only verify:
- Graph exists in user's library  
- Graph is not deleted/archived
- User has execution permissions

## Impact
-  Parent graphs can execute even when they reference older sub-graph
versions
-  Sub-graph updates don't break existing parent graphs  
-  Maintains security: still checks library membership and permissions
-  No breaking changes: version-specific validation still available
when needed

## Example Scenario Fixed
1. User creates parent graph that uses sub-graph v1
2. User updates sub-graph to v2 (v1 removed from library)  
3. Parent graph still references sub-graph v1
4. **Before**: Execution fails with `GraphNotInLibraryError`
5. **After**: Execution succeeds (version-agnostic permission check)

## Testing
- [x] Code formatting and linting passes
- [x] Type checking passes
- [x] No breaking changes to existing functionality
- [x] Security still maintained through library membership checks

## Files Changed
- `backend/executor/utils.py`: Remove version-specific permission check
- `backend/data/graph.py`: Enhanced documentation for version-agnostic
behavior

Closes #[issue-number-if-applicable]

Co-authored-by: Claude <noreply@anthropic.com>
2025-10-29 14:13:23 +00:00
Zamil Majdy
4922f88851 feat(backend/executor): Implement cascading stop for nested graph executions (#11277)
## Summary
Fixes critical issue where child executions spawned by
`AgentExecutorBlock` continue running after parent execution is stopped.
Implements parent-child execution tracking and recursive cascading stop
logic to ensure entire execution trees are terminated together.

## Background
When a parent graph execution containing `AgentExecutorBlock` nodes is
stopped, only the parent was terminated. Child executions continued
running, leading to:
-  Orphaned child executions consuming credits
-  No user control over execution trees  
-  Race conditions where children start after parent stops
-  Resource leaks from abandoned executions

## Core Changes

### 1. Database Schema (`schema.prisma` + migration)
```sql
-- Add nullable parent tracking field
ALTER TABLE "AgentGraphExecution" ADD COLUMN "parentGraphExecutionId" TEXT;

-- Add self-referential foreign key with graceful deletion
ALTER TABLE "AgentGraphExecution" ADD CONSTRAINT "AgentGraphExecution_parentGraphExecutionId_fkey" 
  FOREIGN KEY ("parentGraphExecutionId") REFERENCES "AgentGraphExecution"("id") 
  ON DELETE SET NULL ON UPDATE CASCADE;

-- Add index for efficient child queries
CREATE INDEX "AgentGraphExecution_parentGraphExecutionId_idx" 
  ON "AgentGraphExecution"("parentGraphExecutionId");
```

### 2. Parent ID Propagation (`backend/blocks/agent.py`)
```python
# Extract current graph execution ID and pass as parent to child
execution = add_graph_execution(
    # ... other params
    parent_graph_exec_id=graph_exec_id,  # NEW: Track parent relationship
)
```

### 3. Data Layer (`backend/data/execution.py`)
```python
async def get_child_graph_executions(parent_exec_id: str) -> list[GraphExecution]:
    """Get all child executions of a parent execution."""
    children = await AgentGraphExecution.prisma().find_many(
        where={"parentGraphExecutionId": parent_exec_id, "isDeleted": False}
    )
    return [GraphExecution.from_db(child) for child in children]
```

### 4. Cascading Stop Logic (`backend/executor/utils.py`)
```python
async def stop_graph_execution(
    user_id: str,
    graph_exec_id: str,
    wait_timeout: float = 15.0,
    cascade: bool = True,  # NEW parameter
):
    # 1. Find all child executions
    if cascade:
        children = await _get_child_executions(graph_exec_id)
        
        # 2. Stop all children recursively in parallel
        if children:
            await asyncio.gather(
                *[stop_graph_execution(user_id, child.id, wait_timeout, True) 
                  for child in children],
                return_exceptions=True,  # Don't fail parent if child fails
            )
    
    # 3. Stop the parent execution
    # ... existing stop logic
```

### 5. Race Condition Prevention (`backend/executor/manager.py`)
```python
# Before executing queued child, check if parent was terminated
if parent_graph_exec_id:
    parent_exec = get_db_client().get_graph_execution_meta(parent_graph_exec_id, user_id)
    if parent_exec and parent_exec.status == ExecutionStatus.TERMINATED:
        # Skip execution, mark child as terminated
        get_db_client().update_graph_execution_stats(
            graph_exec_id=graph_exec_id,
            status=ExecutionStatus.TERMINATED,
        )
        return  # Don't start orphaned child
```

## How It Works

### Before (Broken)
```
User stops parent execution
    ↓
Parent terminates ✓
    ↓
Child executions keep running ✗
    ↓
User cannot stop children ✗
```

### After (Fixed)
```
User stops parent execution
    ↓
Query database for all children
    ↓
Recursively stop all children in parallel
    ↓
Wait for children to terminate
    ↓
Stop parent execution
    ↓
All executions in tree stopped ✓
```

### Race Prevention
```
Child in QUEUED status
    ↓
Parent stopped
    ↓
Child picked up by executor
    ↓
Pre-flight check: parent TERMINATED?
    ↓
Yes → Skip execution, mark child TERMINATED
    ↓
Child never runs ✓
```

## Edge Cases Handled
 **Deep nesting** - Recursive cascading handles multi-level trees  
 **Queued children** - Pre-flight check prevents execution  
 **Race conditions** - Child spawned during stop operation  
 **Partial failures** - `return_exceptions=True` continues on error  
 **Multiple children** - Parallel stop via `asyncio.gather()`  
 **No parent** - Backward compatible (nullable field)  
 **Already completed** - Existing status check handles it  

## Performance Impact
- **Stop operation**: O(depth) with parallel execution vs O(1) before
- **Memory**: +36 bytes per execution (one UUID reference)
- **Database**: +1 query per tree level, indexed for efficiency

## API Changes (Backward Compatible)

### `stop_graph_execution()` - New Optional Parameter
```python
# Before
async def stop_graph_execution(user_id: str, graph_exec_id: str, wait_timeout: float = 15.0)

# After  
async def stop_graph_execution(user_id: str, graph_exec_id: str, wait_timeout: float = 15.0, cascade: bool = True)
```
**Default `cascade=True`** means existing callers get the new behavior
automatically.

### `add_graph_execution()` - New Optional Parameter
```python
async def add_graph_execution(..., parent_graph_exec_id: Optional[str] = None)
```

## Security & Safety
-  **User verification** - Users can only stop their own executions
(parent + children)
-  **No cycles** - Self-referential FK prevents infinite loops  
-  **Graceful degradation** - Errors in child stops don't block parent
stop
-  **Rate limits** - Existing execution rate limits still apply

## Testing Checklist

### Database Migration
- [x] Migration runs successfully  
- [x] Prisma client regenerates without errors
- [x] Existing tests pass

### Core Functionality  
- [ ] Manual test: Stop parent with running child → child stops
- [ ] Manual test: Stop parent with queued child → child never starts
- [ ] Unit test: Cascading stop with multiple children
- [ ] Unit test: Deep nesting (3+ levels)
- [ ] Integration test: Race condition prevention

## Breaking Changes
**None** - All changes are backward compatible with existing code.

## Rollback Plan
If issues arise:
1. **Code rollback**: Revert PR, redeploy
2. **Database rollback**: Drop column and constraints (non-destructive)

---

**Note**: This branch contains additional unrelated changes from merging
with `dev`. The core cascading stop feature involves only:
- `schema.prisma` + migration
- `backend/data/execution.py` 
- `backend/executor/utils.py`
- `backend/blocks/agent.py`
- `backend/executor/manager.py`

All other file changes are from dev branch updates and not part of this
feature.

🤖 Generated with [Claude Code](https://claude.ai/code)

<!-- This is an auto-generated comment: release notes by coderabbit.ai
-->
## Summary by CodeRabbit

* **New Features**
* Nested graph executions: parent-child tracking and retrieval of child
executions

* **Improvements**
* Cascading stop: stopping a parent optionally terminates child
executions
  * Parent execution IDs propagated through runs and surfaced in logs
  * Per-user/graph concurrent execution limits enforced

* **Bug Fixes**
* Skip enqueuing children if parent is terminated; robust handling when
parent-status checks fail

* **Tests**
  * Updated tests to cover parent linkage in graph creation
<!-- end of auto-generated comment: release notes by coderabbit.ai -->

---------

Co-authored-by: Claude <noreply@anthropic.com>
2025-10-29 11:11:22 +00:00
Zamil Majdy
5fb142c656 fix(backend/executor): ensure cluster lock release on all execution submission failures (#11281)
## Root Cause
During rolling deployment, execution
`97058338-052a-4528-87f4-98c88416bb7f` got stuck in QUEUED state
because:

1. Pod acquired cluster lock successfully during shutdown  
2. Subsequent setup operations failed (ThreadPoolExecutor shutdown,
resource exhaustion, etc.)
3. **No error handling existed** around the critical section after lock
acquisition
4. Cluster lock remained stuck in Redis for 5 minutes (TTL timeout)
5. Other pods couldn't acquire the lock, leaving execution permanently
queued

## The Fix

### Problem: Critical Section Not Protected
The original code had no error handling for the entire critical section
after successful lock acquisition:
```python
# Original code - no error handling after lock acquired
current_owner = cluster_lock.try_acquire()
if current_owner != self.executor_id:
    return  # didn't get lock
    
# CRITICAL SECTION - any failure here leaves lock stuck
self._execution_locks[graph_exec_id] = cluster_lock  # Could fail: memory
logger.info("Acquired cluster lock...")              # Could fail: logging  
cancel_event = threading.Event()                     # Could fail: resources
future = self.executor.submit(...)                   # Could fail: shutdown
self.active_graph_runs[...] = (future, cancel_event) # Could fail: memory
```

### Solution: Wrap Entire Critical Section  
Protect ALL operations after successful lock acquisition:
```python
# Fixed code - comprehensive error handling
current_owner = cluster_lock.try_acquire()
if current_owner != self.executor_id:
    return  # didn't get lock

# Wrap ENTIRE critical section after successful acquisition
try:
    self._execution_locks[graph_exec_id] = cluster_lock
    logger.info("Acquired cluster lock...")
    cancel_event = threading.Event()
    future = self.executor.submit(...)
    self.active_graph_runs[...] = (future, cancel_event)
except Exception as e:
    # Release cluster lock before requeue
    cluster_lock.release()
    del self._execution_locks[graph_exec_id] 
    _ack_message(reject=True, requeue=True)
    return
```

### Why This Comprehensive Approach Works
- **Complete protection**: Any failure in critical section → lock
released
- **Proper cleanup order**: Lock released → message requeued → another
pod can try
- **Uses existing infrastructure**: Leverages established
`_ack_message()` requeue logic
- **Handles all scenarios**: ThreadPoolExecutor shutdown, resource
exhaustion, memory issues, logging failures

## Protected Failure Scenarios
1. **Memory exhaustion**: `_execution_locks` assignment or
`active_graph_runs` assignment
2. **Resource exhaustion**: `threading.Event()` creation fails
3. **ThreadPoolExecutor shutdown**: `executor.submit()` with "cannot
schedule new futures after shutdown"
4. **Logging system failures**: `logger.info()` calls fail
5. **Any unexpected exceptions**: Network issues, disk problems, etc.

## Validation
-  All existing tests pass  
-  Maintains exact same success path behavior
-  Comprehensive error handling for all failure points
-  Minimal code change with maximum protection

## Impact
- **Eliminates stuck executions** during pod lifecycle events (rolling
deployments, scaling, crashes)
- **Faster recovery**: Immediate requeue vs 5-minute Redis TTL wait
- **Higher reliability**: Handles ANY failure in the critical section
- **Production-ready**: Comprehensive solution for distributed lock
management

This prevents the exact race condition that caused execution
`97058338-052a-4528-87f4-98c88416bb7f` to be stuck for >300 seconds,
plus many other potential failure scenarios.

---------

Co-authored-by: Claude <noreply@anthropic.com>
2025-10-29 08:56:24 +00:00
Pratyush Singh
e14594ff4a fix: handle oversized notifications by sending summary email (#11119) (#11130)
📨 Fix: Handle Oversized Notification Emails
Summary

This PR adds logic to detect and handle oversized notification emails
exceeding Postmark’s 5 MB limit. Instead of retrying indefinitely, the
system now sends a lightweight summary email with key stats and a
dashboard link.

Changes

Added size check in EmailSender.send_templated()

Sends summary email when payload > ~4.5 MB

Prevents infinite retries and queue clogging

Added logs for oversized detection

Fixes #11119

---------

Co-authored-by: Nicholas Tindle <nicholas.tindle@agpt.co>
Co-authored-by: Zamil Majdy <zamil.majdy@agpt.co>
2025-10-29 00:57:13 +00:00
Zamil Majdy
de70ede54a fix(backend): prevent execution of deleted agents and cleanup orphaned resources (#11243)
## Summary
Fix critical bug where deleted agents continue running scheduled and
triggered executions indefinitely, consuming credits without user
control.

## Problem
When agents are deleted from user libraries, their schedules and webhook
triggers remain active, leading to:
-  Uncontrolled resource consumption 
-  "Unknown agent" executions that charge credits
-  No way for users to stop orphaned executions
-  Accumulation of orphaned database records

## Solution

### 1. Prevention: Library Validation Before Execution
- Add `is_graph_in_user_library()` function with efficient database
queries
- Validate graph accessibility before all executions in
`validate_and_construct_node_execution_input()`
- Use specific `GraphNotInLibraryError` for clear error handling

### 2. Cleanup: Remove Schedules & Webhooks on Deletion
- Enhanced `delete_library_agent()` to clean up associated schedules and
webhooks
- Comprehensive cleanup functions for both scheduled and triggered
executions
- Proper database transaction handling

### 3. Error-Based Cleanup: Handle Existing Orphaned Resources
- Catch `GraphNotInLibraryError` in scheduler and webhook handlers
- Automatically clean up orphaned resources when execution fails
- Graceful degradation without breaking existing workflows

### 4. Migration: Clean Up Historical Orphans
- SQL migration to remove existing orphaned schedules and webhooks
- Performance index for faster cleanup queries
- Proper logging and error handling

## Key Changes

### Core Library Validation
```python
# backend/data/graph.py - Single source of truth
async def is_graph_in_user_library(graph_id: str, user_id: str, graph_version: Optional[int] = None) -> bool:
    where_clause = {"userId": user_id, "agentGraphId": graph_id, "isDeleted": False, "isArchived": False}
    if graph_version is not None:
        where_clause["agentGraphVersion"] = graph_version
    count = await LibraryAgent.prisma().count(where=where_clause)
    return count > 0
```

### Enhanced Agent Deletion
```python
# backend/server/v2/library/db.py
async def delete_library_agent(library_agent_id: str, user_id: str, soft_delete: bool = True) -> None:
    # ... existing deletion logic ...
    await _cleanup_schedules_for_graph(graph_id=graph_id, user_id=user_id)
    await _cleanup_webhooks_for_graph(graph_id=graph_id, user_id=user_id)
```

### Execution Prevention
```python
# backend/executor/utils.py
if not await gdb.is_graph_in_user_library(graph_id=graph_id, user_id=user_id, graph_version=graph.version):
    raise GraphNotInLibraryError(f"Graph #{graph_id} is not accessible in your library")
```

### Error-Based Cleanup
```python
# backend/executor/scheduler.py & backend/server/integrations/router.py
except GraphNotInLibraryError as e:
    logger.warning(f"Execution blocked for deleted/archived graph {graph_id}")
    await _cleanup_orphaned_resources_for_graph(graph_id, user_id)
```

## Technical Implementation

### Database Efficiency
- Use `count()` instead of `find_first()` for faster queries
- Add performance index: `idx_library_agent_user_graph_active`
- Follow existing `prisma.is_connected()` patterns

### Error Handling Hierarchy
- **`GraphNotInLibraryError`**: Specific exception for deleted/archived
graphs
- **`NotAuthorizedError`**: Generic authorization errors (preserved for
user ID mismatches)
- Clear error messages for better debugging

### Code Organization
- Single source of truth for library validation in
`backend/data/graph.py`
- Import from centralized location to avoid duplication
- Top-level imports following codebase conventions

## Testing & Validation

### Functional Testing
-  Library validation prevents execution of deleted agents
-  Cleanup functions remove schedules and webhooks properly  
-  Error-based cleanup handles orphaned resources gracefully
-  Migration removes existing orphaned records

### Integration Testing
-  All existing tests pass (including `test_store_listing_graph`)
-  No breaking changes to existing functionality
-  Proper error propagation and handling

### Performance Testing
-  Efficient database queries with proper indexing
-  Minimal overhead for normal execution flows
-  Cleanup operations don't impact performance

## Impact

### User Experience
- 🎯 **Immediate**: Deleted agents stop running automatically
- 🎯 **Ongoing**: No more unexpected credit charges from orphaned
executions
- 🎯 **Cleanup**: Historical orphaned resources are removed

### System Reliability
- 🔒 **Security**: Users can only execute agents they have access to
- 🧹 **Cleanup**: Automatic removal of orphaned database records
- 📈 **Performance**: Efficient validation with minimal overhead

### Developer Experience
- 🎯 **Clear Errors**: Specific exception types for better debugging
- 🔧 **Maintainable**: Centralized library validation logic
- 📚 **Documented**: Comprehensive error handling patterns

## Files Modified
- `backend/data/graph.py` - Library validation function
- `backend/server/v2/library/db.py` - Enhanced agent deletion with
cleanup
- `backend/executor/utils.py` - Execution validation and prevention
- `backend/executor/scheduler.py` - Error-based cleanup for schedules
- `backend/server/integrations/router.py` - Error-based cleanup for
webhooks
- `backend/util/exceptions.py` - Specific error type for deleted graphs
-
`migrations/20251023000000_cleanup_orphaned_schedules_and_webhooks/migration.sql`
- Historical cleanup

## Breaking Changes
None. All changes are backward compatible and preserve existing
functionality.

## Follow-up Tasks
- [ ] Monitor cleanup effectiveness in production
- [ ] Consider adding metrics for orphaned resource detection
- [ ] Potential optimization of cleanup batch operations

🤖 Generated with [Claude Code](https://claude.ai/code)

Co-Authored-By: Claude <noreply@anthropic.com>

---------

Co-authored-by: Claude <noreply@anthropic.com>
2025-10-28 23:48:35 +00:00
Reinier van der Leer
5e5f45a713 fix(backend): Fix various warnings (#11252)
- Resolves #11251

This fixes all the warnings mentioned in #11251, reducing noise and
making our logs and error alerts more useful :)

### Changes 🏗️

- Remove "Block {block_name} has multiple credential inputs" warning
(not actually an issue)
- Rename `json` attribute of `MainCodeExecutionResult` to `json_data`;
retain serialized name through a field alias
- Replace `Path(regex=...)` with `Path(pattern=...)` in
`get_shared_execution` endpoint parameter config
- Change Uvicorn's WebSocket module to new Sans-I/O implementation for
WS server
- Disable Uvicorn's WebSocket module for REST server
- Remove deprecated `enable_cleanup_closed=True` argument in
`CloudStorageHandler` implementation
- Replace Prisma transaction timeout `int` argument with a `timedelta`
value
- Update Sentry SDK to latest version (v2.42.1)
- Broaden filter for cleanup warnings from indirect dependency `litellm`
- Fix handling of `MissingConfigError` in REST server endpoints

### Checklist 📋

#### For code changes:
- [x] I have clearly listed my changes in the PR description
- [x] I have made a test plan
- [x] I have tested my changes according to the test plan:
  - Check that the warnings are actually gone
- [x] Deploy to dev environment and run a graph; check for any warnings
  - Test WebSocket server
- [x] Run an agent in the Builder; make sure real-time execution updates
still work
2025-10-28 13:18:45 +00:00
seer-by-sentry[bot]
377657f8a1 fix(backend): Extract response from LLM response dictionary (#11262)
### Changes 🏗️

- Modifies the LLM block to extract the actual response from the
dictionary returned by the LLM, instead of yielding the entire
dictionary. This addresses
[AUTOGPT-SERVER-6EY](https://sentry.io/organizations/significant-gravitas/issues/6950850822/).

### Checklist 📋

#### For code changes:
- [x] I have clearly listed my changes in the PR description
- [x] I have made a test plan
- [x] I have tested my changes according to the test plan:
  <!-- Put your test plan here: -->
- [x] After applying the fix, I ran the agent that triggered the Sentry
error and confirmed that it now completes successfully without errors.

---------

Co-authored-by: seer-by-sentry[bot] <157164994+seer-by-sentry[bot]@users.noreply.github.com>
Co-authored-by: Swifty <craigswift13@gmail.com>
2025-10-28 08:43:29 +00:00
seer-by-sentry[bot]
ff71c940c9 fix(backend): Properly encode hostname in URL validation (#11259)
Fixes
[AUTOGPT-SERVER-6KZ](https://sentry.io/organizations/significant-gravitas/issues/6976926125/).
The issue was that: Redirect handling strips the URL scheme, causing
subsequent requests to fail validation and hit a 404.

- Ensures the hostname in the URL is properly IDNA-encoded after
validation.
- Reconstructs the netloc with the encoded hostname and preserves the
port if it exists.

This fix was generated by Seer in Sentry, triggered by Craig Swift. 👁️
Run ID: 2204774

Not quite right? [Click here to continue debugging with
Seer.](https://sentry.io/organizations/significant-gravitas/issues/6976926125/?seerDrawer=true)

### Changes 🏗️

**backend/util/request.py:**
- Fixed URL validation to properly preserve port numbers when
reconstructing netloc
- Ensures IDNA-encoded hostname is combined with port (if present)
before URL reconstruction

**Test Results:**
-  Tested request to https://www.target.com/ (original failing URL from
Sentry issue)
-  Status: 200, Content retrieved successfully (339,846 bytes)
-  Port preservation verified for URLs with explicit ports

### Checklist 📋

#### For code changes:
- [x] I have clearly listed my changes in the PR description
- [x] I have made a test plan
- [x] I have tested my changes according to the test plan:
  - [x] Tested request to https://www.target.com/ (original failing URL)
  - [x] Verified status code 200 and successful content retrieval
  - [x] Verified port preservation in URL validation

<details>
  <summary>Example test plan</summary>
  
  - [ ] Create from scratch and execute an agent with at least 3 blocks
- [ ] Import an agent from file upload, and confirm it executes
correctly
  - [ ] Upload agent to marketplace
- [ ] Import an agent from marketplace and confirm it executes correctly
  - [ ] Edit an agent from monitor, and confirm it executes correctly
</details>

#### For configuration changes:

- [x] `.env.default` is updated or already compatible with my changes
- [x] `docker-compose.yml` is updated or already compatible with my
changes
- [x] I have included a list of my configuration changes in the PR
description (under **Changes**)

<details>
  <summary>Examples of configuration changes</summary>

  - Changing ports
  - Adding new services that need to communicate with each other
  - Secrets or environment variable changes
  - New or infrastructure changes such as databases
</details>

Co-authored-by: seer-by-sentry[bot] <157164994+seer-by-sentry[bot]@users.noreply.github.com>
Co-authored-by: Swifty <craigswift13@gmail.com>
2025-10-28 08:43:14 +00:00
Bently
9db443960a feat(blocks/claude): Remove Claude 3.5 Sonnet and Haiku model (#11260)
Removes CLAUDE_3_5_SONNET and CLAUDE_3_5_HAIKU from LlmModel enum, model
metadata, and cost configuration since they are deprecated

  ### Checklist 📋

  #### For code changes:
  - [x] I have clearly listed my changes in the PR description
  - [x] I have made a test plan
  - [x] I have tested my changes according to the test plan:
  - [x] Verify the models are gone from the llm blocks
2025-10-27 16:49:02 +00:00
Swifty
b31d60276a fix(backend/store): Sanitize all sql terms (#11228)
Categories and Creators where not sanitized in the full text search

- apply sanitization to categories and creators

- [x] I have clearly listed my changes in the PR description
- [x] I have made a test plan
- [x] I have tested my changes according to the test plan:
  - [x] run tests to check it still works
2025-10-27 13:16:37 +01:00
Swifty
7cbb1ed859 fix(backend/store): Sanitize all sql terms (#11228)
Categories and Creators where not sanitized in the full text search

### Changes 🏗️

- apply sanitization to categories and creators

### Checklist 📋

#### For code changes:
- [x] I have clearly listed my changes in the PR description
- [x] I have made a test plan
- [x] I have tested my changes according to the test plan:
  - [x] run tests to check it still works
2025-10-27 12:59:05 +01:00
Toran Bruce Richards
b52e95e1fc fix(blocks): Add missing error output pins to all Firecrawl blocks (#11256)
Added error output pins to all Firecrawl blocks as standard on the
AutoGPT platform. The base block execution code already handles error
yielding, so no try-catch logic was needed.

- FirecrawlScrapeBlock: Added error output pin for scrape failures
- FirecrawlCrawlBlock: Added error output pin for crawl failures
- FirecrawlExtractBlock: Added error output pin for extraction failures
- FirecrawlMapBlock: Added error output pin for map failures
- FirecrawlSearchBlock: Added error output pin for search failures

Resolves #11253

<!-- Clearly explain the need for these changes: -->

### Changes 🏗️

<!-- Concisely describe all of the changes made in this pull request:
-->

### Checklist 📋

#### For code changes:
- [ ] I have clearly listed my changes in the PR description
- [ ] I have made a test plan
- [ ] I have tested my changes according to the test plan:
  <!-- Put your test plan here: -->
  - [ ] ...

<details>
  <summary>Example test plan</summary>
  
  - [ ] Create from scratch and execute an agent with at least 3 blocks
- [ ] Import an agent from file upload, and confirm it executes
correctly
  - [ ] Upload agent to marketplace
- [ ] Import an agent from marketplace and confirm it executes correctly
  - [ ] Edit an agent from monitor, and confirm it executes correctly
</details>

#### For configuration changes:

- [ ] `.env.default` is updated or already compatible with my changes
- [ ] `docker-compose.yml` is updated or already compatible with my
changes
- [ ] I have included a list of my configuration changes in the PR
description (under **Changes**)

<details>
  <summary>Examples of configuration changes</summary>

  - Changing ports
  - Adding new services that need to communicate with each other
  - Secrets or environment variable changes
  - New or infrastructure changes such as databases
</details>

Co-authored-by: claude[bot] <41898282+claude[bot]@users.noreply.github.com>
Co-authored-by: Toran Bruce Richards <Torantulino@users.noreply.github.com>
2025-10-27 08:36:28 +00:00
Reinier van der Leer
e06e7ff33f fix(backend): Implement graceful shutdown in AppService to prevent RPC errors (#11240)
We're currently seeing errors in the `DatabaseManager` while it's
shutting down, like:

```
WARNING [DatabaseManager] Termination request: SystemExit; 0 executing cleanup.
INFO [DatabaseManager]  Disconnecting Database...
INFO [PID-1|THREAD-29|DatabaseManager|Prisma-82fb1994-4b87-40c1-8869-fbd97bd33fc8] Releasing connection started...
INFO [PID-1|THREAD-29|DatabaseManager|Prisma-82fb1994-4b87-40c1-8869-fbd97bd33fc8] Releasing connection completed successfully.
INFO [DatabaseManager] Terminated.
ERROR POST /create_or_add_to_user_notification_batch failed: Failed to create or add to notification batch for user {user_id} and type AGENT_RUN: NoneType: None
```

This indicates two issues:
- The service doesn't wait for pending RPC calls to finish before
terminating
- We're using `logger.exception` outside an error handling context,
causing the confusing and not much useful `NoneType: None` to be printed
instead of error info

### Changes 🏗️

- Implement graceful shutdown in `AppService` so in-flight RPC calls can
finish
  - Add tests for graceful shutdown
  - Prevent `AppService` accepting new requests during shutdown
- Rework `AppService` lifecycle management; add support for async
`lifespan`
- Fix `AppService` endpoint error logging
- Improve logging in `AppProcess` and `AppService`

### Checklist 📋

#### For code changes:
- [x] I have clearly listed my changes in the PR description
- [x] I have made a test plan
- [x] I have tested my changes according to the test plan:
- Deploy to Dev cluster, then `kubectl rollout restart` the different
services a few times
    - [x] -> `DatabaseManager` doesn't break on re-deployment
    - [x] -> `Scheduler` doesn't break on re-deployment
    - [x] -> `NotificationManager` doesn't break on re-deployment
2025-10-25 14:47:19 +00:00
Bently
f4ba02f2f1 feat(blocks/revid): Add cost configs for revid video blocks (#11242)
Updated block costs in `backend/backend/data/block_cost_config.py`:
  - **AIShortformVideoCreatorBlock**: Updated from 50 credits to 307
  - **AIAdMakerVideoCreatorBlock**: Added cost of 714 credits
  - **AIScreenshotToVideoAdBlock**: Added cost of 612 credits

  ### Checklist 📋

  #### For code changes:
  - [x] I have clearly listed my changes in the PR description
  - [x] I have made a test plan
  - [x] I have tested my changes according to the test plan:
- [x] Verify AIShortformVideoCreatorBlock costs 307 credits when
executed
- [x] Verify AIAdMakerVideoCreatorBlock costs 714 credits when executed
- [x] Verify AIScreenshotToVideoAdBlock costs 612 credits when executed
2025-10-24 18:35:37 +01:00
Bently
48ff225837 feat(blocks/revid): Add cost configs for revid video blocks (#11242)
Updated block costs in `backend/backend/data/block_cost_config.py`:
  - **AIShortformVideoCreatorBlock**: Updated from 50 credits to 307
  - **AIAdMakerVideoCreatorBlock**: Added cost of 714 credits
  - **AIScreenshotToVideoAdBlock**: Added cost of 612 credits

  ### Checklist 📋

  #### For code changes:
  - [x] I have clearly listed my changes in the PR description
  - [x] I have made a test plan
  - [x] I have tested my changes according to the test plan:
- [x] Verify AIShortformVideoCreatorBlock costs 307 credits when
executed
- [x] Verify AIAdMakerVideoCreatorBlock costs 714 credits when executed
- [x] Verify AIScreenshotToVideoAdBlock costs 612 credits when executed
2025-10-23 09:46:22 +00:00
Bently
a6a2f71458 Merge commit from fork
* Replace urllib with Requests in RSS block to prevent SSRF

* Format
2025-10-22 14:18:34 +01:00
Bently
788b861bb7 Merge commit from fork 2025-10-22 14:17:26 +01:00
Zamil Majdy
bb0b45d7f7 fix(backend): Make Jinja Error on TextFormatter as value error (#11236)
<!-- Clearly explain the need for these changes: -->

This PR converts Jinja2 TemplateError exceptions to ValueError in the
TextFormatter class to ensure proper error handling and HTTP status code
responses (400 instead of 500).

### Changes 🏗️

<!-- Concisely describe all of the changes made in this pull request:
-->

- Added import for `jinja2.exceptions.TemplateError` in
`backend/util/text.py:6`
- Wrapped template rendering in try-catch block in `format_string`
method (`backend/util/text.py:105-109`)
- Convert `TemplateError` to `ValueError` to ensure proper 400 HTTP
status code for client errors
- Added warning logging for template rendering errors before re-raising
as ValueError

### Checklist 📋

#### For code changes:
- [x] I have clearly listed my changes in the PR description
- [x] I have made a test plan
- [x] I have tested my changes according to the test plan:
  <!-- Put your test plan: -->
- [x] Verified that invalid Jinja2 templates now raise ValueError
instead of TemplateError
  - [x] Confirmed that valid templates continue to work correctly
  - [x] Checked that warning logs are generated for template errors
  - [x] Validated that the exception chain is preserved with `from e`

#### For configuration changes:

- [x] `.env.default` is updated or already compatible with my changes
- [x] `docker-compose.yml` is updated or already compatible with my
changes
- [x] I have included a list of my configuration changes in the PR
description (under **Changes**)
2025-10-22 09:38:02 +00:00
claude[bot]
86b9ccfe5e fix: Apply linting and formatting fixes
- Run ruff, isort, and black on Python files
- Run prettier on TypeScript files
- Remove unused LaunchDarklyIntegration import from metrics.py

Co-authored-by: Nicholas Tindle <ntindle@users.noreply.github.com>
2025-10-22 07:48:54 +00:00
claude[bot]
cc1a2cd829 feat(admin): Add execution management table with stop functionality
- Created backend/data/diagnostics.py following Option B data layer pattern
- Refactored diagnostics_admin_routes.py to use the new data layer
- Added endpoints for listing running executions with details
- Added endpoints for stopping executions (single and bulk)
- Created ExecutionsTable component with multi-select and stop buttons
- Integrated execution management table into diagnostics page

Co-authored-by: Nicholas Tindle <ntindle@users.noreply.github.com>
2025-10-22 07:09:34 +00:00
claude[bot]
3a8cbe3eb4 fix: Remove unnecessary agent count endpoints from diagnostics
- Removed get_total_agents_count() function
- Removed get_active_agents_count() function
- Updated AgentDiagnosticsResponse model to only include agents_with_active_executions
- Updated frontend to display only agents with active executions metric

Co-authored-by: Nicholas Tindle <ntindle@users.noreply.github.com>
2025-10-21 22:00:31 +00:00
Claude
632085528a feat(admin): Add system diagnostics page for execution and agent monitoring
Add new admin diagnostics page to improve on-call diagnostics with the following features:

Backend changes:
- Add ExecutionDiagnosticsResponse and AgentDiagnosticsResponse models
- Create diagnostics_admin_routes.py with endpoints for:
  - /admin/diagnostics/executions - Get running, queued (DB), and queued (RabbitMQ) execution counts
  - /admin/diagnostics/agents - Get total agents, active agents, and agents with active executions
- Register new diagnostics routes in rest_api.py
- Use Prisma for database queries and direct RabbitMQ connection for queue depth

Frontend changes:
- Add new /admin/diagnostics page with real-time metrics display
- Create DiagnosticsContent component with auto-refresh capability
- Add diagnostic metrics cards for:
  - Running executions
  - Queued executions (database)
  - Queued executions (RabbitMQ)
  - Total agents
  - Active agents
  - Agents with active executions
- Add "System Diagnostics" link to admin navigation sidebar
- Update TypeScript types for new API responses

This improves on-call diagnostics by providing visibility into:
- System load (running executions)
- Queue backlog (DB vs RabbitMQ comparison)
- Agent activity levels

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude <noreply@anthropic.com>
2025-10-21 20:55:08 +00:00
Reinier van der Leer
04df981115 fix(backend): Fix structured logging for cloud environments (#11227)
- Resolves #11226

### Changes 🏗️

- Drop use of `CloudLoggingHandler` which docs state isn't for use in
GKE
- For cloud logging, output only structured log entries to `stdout`

### Checklist 📋

#### For code changes:
- [x] I have clearly listed my changes in the PR description
- [x] I have made a test plan
- [x] I have tested my changes according to the test plan:
  - [x] Test deploy to dev and check logs
2025-10-21 12:48:41 +00:00
Swifty
d25997b4f2 Revert "Merge branch 'swiftyos/secrt-1709-store-provider-names-and-en… (#11225)
Changes to providers blocks to store in db

### Changes 🏗️

- revet change

### Checklist 📋

#### For code changes:
- [x] I have clearly listed my changes in the PR description
- [x] I have made a test plan
- [x] I have tested my changes according to the test plan:
  <!-- Put your test plan here: -->
  - [x] I have reverted the merge
2025-10-21 09:12:00 +00:00
Zamil Majdy
11d55f6055 fix(backend/executor): Avoid running direct query in executor (#11224)
## Summary
- Fixes database connection warnings in executor logs: "Client is not
connected to the query engine, you must call `connect()` before
attempting to query data"
- Implements resilient database client pattern already used elsewhere in
the codebase
- Adds caching to reduce database load for user context lookups

## Changes
- Updated `get_user_context()` to check `prisma.is_connected()` and fall
back to database manager client
- Added `@cached(maxsize=1000, ttl_seconds=3600)` decorator for
performance optimization
- Updated database manager to expose `get_user_by_id` method

## Test plan
- [x] Verify executor pods no longer show Prisma connection warnings
- [x] Confirm user timezone is still correctly retrieved
- [x] Test fallback behavior when Prisma is disconnected

🤖 Generated with [Claude Code](https://claude.ai/code)

Co-authored-by: Claude <noreply@anthropic.com>
2025-10-21 08:46:40 +00:00
Reinier van der Leer
3da595f599 fix(backend): Only try to initialize LaunchDarkly once (#11222)
We currently try to re-init the LaunchDarkly client every time a feature flag is checked.
This causes 5 second extra latency on the flag check when LD is down, such as now.
Since flag checks are performed on every block execution, this currently cripples the platform's executors.

- Follow-up to #11221

### Changes 🏗️

- Only try to init LaunchDarkly once
- Improve surrounding log statements in the `feature_flag` module

### Checklist 📋

#### For code changes:
- [x] I have clearly listed my changes in the PR description
- [x] I have made a test plan
- [x] I have tested my changes according to the test plan:
  - This is a critical hotfix; we'll see its effect once deployed
2025-10-21 08:46:07 +02:00
Reinier van der Leer
e5e60921a3 fix(backend): Handle LaunchDarkly init failure (#11221)
LaunchDarkly is currently down and it's keeping our executor pods from
spinning up.

### Changes 🏗️

- Wrap `LaunchDarklyIntegration` init in a try/except

### Checklist 📋

#### For code changes:
- [x] I have clearly listed my changes in the PR description
- [x] I have made a test plan
- [x] I have tested my changes according to the test plan:
  - We'll see if it works once it deploys
2025-10-21 07:53:40 +02:00
Copilot
90af8f8e1a feat(backend): Add language fallback for YouTube transcription block (#11057)
## Problem

The YouTube transcription block would fail when attempting to transcribe
videos that only had transcripts available in non-English languages.
Even when usable transcripts existed in other languages, the block would
raise a `NoTranscriptFound` error because it only requested English
transcripts.

**Example video that would fail:**
https://www.youtube.com/watch?v=3AMl5d2NKpQ (only has Hungarian
transcripts)

**Error message:**
```
Could not retrieve a transcript for the video https://www.youtube.com/watch?v=3AMl5d2NKpQ! 
No transcripts were found for any of the requested language codes: ('en',)

For this video (3AMl5d2NKpQ) transcripts are available in the following languages:
(GENERATED) - hu ("Hungarian (auto-generated)")
```

## Solution

Implemented intelligent language fallback in the
`TranscribeYoutubeVideoBlock.get_transcript()` method:

1. **First**, tries to fetch English transcript (maintains backward
compatibility)
2. **If English unavailable**, lists all available transcripts and
selects the first one using this priority:
   - Manually created transcripts (any language)
   - Auto-generated transcripts (any language)
3. **Only fails** if no transcripts exist at all

**Example behavior:**
```python
# Before: Video with only Hungarian transcript
get_transcript("3AMl5d2NKpQ")  #  Raises NoTranscriptFound

# After: Video with only Hungarian transcript  
get_transcript("3AMl5d2NKpQ")  #  Returns Hungarian transcript
```

## Changes

- **Modified** `backend/blocks/youtube.py`: Added try-catch logic to
fallback to any available language when English is not found
- **Added** `test/blocks/test_youtube.py`: Comprehensive test suite
covering URL extraction, language fallback, transcript preferences, and
error handling (7 tests)
- **Updated** `docs/content/platform/blocks/youtube.md`: Documented the
language fallback behavior and transcript priority order

## Testing

-  All 7 new unit tests pass
-  Block integration test passes
-  Full test suite: 621 passed, 0 failed (no regressions)
-  Code formatting and linting pass

## Impact

This fix enables the YouTube transcription block to work with
international content while maintaining full backward compatibility:

-  Videos in any language can now be transcribed
-  English is still preferred when available
-  No breaking changes to existing functionality
-  Graceful degradation to available languages

Fixes #10637
Fixes https://linear.app/autogpt/issue/OPEN-2626

> [!WARNING]
>
> <details>
> <summary>Firewall rules blocked me from connecting to one or more
addresses (expand for details)</summary>
>
> #### I tried to connect to the following addresses, but was blocked by
firewall rules:
>
> - `www.youtube.com`
> - Triggering command:
`/home/REDACTED/.cache/pypoetry/virtualenvs/autogpt-platform-backend-Ajv4iu2i-py3.11/bin/python3`
(dns block)
>
> If you need me to access, download, or install something from one of
these locations, you can either:
>
> - Configure [Actions setup
steps](https://gh.io/copilot/actions-setup-steps) to set up my
environment, which run before the firewall is enabled
> - Add the appropriate URLs or hosts to the custom allowlist in this
repository's [Copilot coding agent
settings](https://github.com/Significant-Gravitas/AutoGPT/settings/copilot/coding_agent)
(admins only)
>
> </details>

<!-- START COPILOT CODING AGENT SUFFIX -->



<details>

<summary>Original prompt</summary>

> Issue Title: if theres only one lanague available for transcribe
youtube return that langage not an error
> Issue Description: `Could not retrieve a transcript for the video
https://www.youtube.com/watch?v=3AMl5d2NKpQ! This is most likely caused
by: No transcripts were found for any of the requested language codes:
('en',) For this video (3AMl5d2NKpQ) transcripts are available in the
following languages: (MANUALLY CREATED) None (GENERATED) - hu
("Hungarian (auto-generated)") (TRANSLATION LANGUAGES) None If you are
sure that the described cause is not responsible for this error and that
a transcript should be retrievable, please create an issue at
https://github.com/jdepoix/youtube-transcript-api/issues. Please add
which version of youtube_transcript_api you are using and provide the
information needed to replicate the error. Also make sure that there are
no open issues which already describe your problem!` you can use this
video to test:
[https://www.youtube.com/watch?v=3AMl5d2NKpQ\`](https://www.youtube.com/watch?v=3AMl5d2NKpQ%60)
> Fixes
https://linear.app/autogpt/issue/OPEN-2626/if-theres-only-one-lanague-available-for-transcribe-youtube-return
> 
> 
> Comment by User :
> This thread is for an agent session with githubcopilotcodingagent.
> 
> Comment by User :
> This thread is for an agent session with githubcopilotcodingagent.
> 
> Comment by User :
> This comment thread is synced to a corresponding [GitHub
issue](https://github.com/Significant-Gravitas/AutoGPT/issues/10637).
All replies are displayed in both locations.
> 
> 


</details>


<!-- START COPILOT CODING AGENT TIPS -->
---

 Let Copilot coding agent [set things up for
you](https://github.com/Significant-Gravitas/AutoGPT/issues/new?title=+Set+up+Copilot+instructions&body=Configure%20instructions%20for%20this%20repository%20as%20documented%20in%20%5BBest%20practices%20for%20Copilot%20coding%20agent%20in%20your%20repository%5D%28https://gh.io/copilot-coding-agent-tips%29%2E%0A%0A%3COnboard%20this%20repo%3E&assignees=copilot)
— coding agent works faster and does higher quality work when set up for
your repo.

---------

Co-authored-by: copilot-swe-agent[bot] <198982749+Copilot@users.noreply.github.com>
Co-authored-by: ntindle <8845353+ntindle@users.noreply.github.com>
Co-authored-by: Nicholas Tindle <nicholas.tindle@agpt.co>
2025-10-21 02:31:33 +00:00
Nicholas Tindle
eba67e0a4b fix(platform/blocks): update linear oauth to use refresh tokens (#10998)
<!-- Clearly explain the need for these changes: -->

### Need 💡

This PR addresses Linear issue SECRT-1665, which mandates an update to
Linear's OAuth2 implementation. Linear is transitioning from long-lived
access tokens to short-lived access tokens with refresh tokens, with a
deadline of April 1, 2026. This change is crucial to ensure continued
integration with Linear and to support their new token management
system, including a migration path for existing long-lived tokens.

### Changes 🏗️

-   **`autogpt_platform/backend/backend/blocks/linear/_oauth.py`**:
- Implemented full support for refresh tokens, including HTTP Basic
Authentication for token refresh requests.
- Added `migrate_old_token()` method to exchange old long-lived access
tokens for new short-lived tokens with refresh tokens using Linear's
`/oauth/migrate_old_token` endpoint.
- Enhanced `get_access_token()` to automatically detect and attempt
migration for old tokens, and to refresh short-lived tokens when they
expire.
    -   Improved error handling and token expiration management.
- Updated `_request_tokens` to handle both authorization code and
refresh token flows, supporting Linear's recommended authentication
methods.
-   **`autogpt_platform/backend/backend/blocks/linear/_config.py`**:
- Updated `TEST_CREDENTIALS_OAUTH` mock data to include realistic
`access_token_expires_at` and `refresh_token` for testing the new token
lifecycle.
-   **`LINEAR_OAUTH_IMPLEMENTATION.md`**:
- Added documentation detailing the new Linear OAuth refresh token
implementation, including technical details, migration strategy, and
testing notes.

### Checklist 📋

#### For code changes:
- [x] I have clearly listed my changes in the PR description
- [x] I have made a test plan
- [x] I have tested my changes according to the test plan:
  - [x] Verified OAuth URL generation and parameter encoding.
- [x] Confirmed HTTP Basic Authentication header creation for refresh
requests.
  - [x] Tested token expiration logic with a 5-minute buffer.
  - [x] Validated migration detection for old vs. new token types.
  - [x] Checked code syntax and import compatibility.

#### For configuration changes:

- [ ] `.env.default` is updated or already compatible with my changes
- [ ] `docker-compose.yml` is updated or already compatible with my
changes
- [ ] I have included a list of my configuration changes in the PR
description (under **Changes**)

---
Linear Issue: [SECRT-1665](https://linear.app/autogpt/issue/SECRT-1665)

<a
href="https://cursor.com/background-agent?bcId=bc-95f4c668-f7fa-4057-87e5-622ac81c0783"><picture><source
media="(prefers-color-scheme: dark)"
srcset="https://cursor.com/open-in-cursor-dark.svg"><source
media="(prefers-color-scheme: light)"
srcset="https://cursor.com/open-in-cursor-light.svg"><img alt="Open in
Cursor"
src="https://cursor.com/open-in-cursor.svg"></picture></a>&nbsp;<a
href="https://cursor.com/agents?id=bc-95f4c668-f7fa-4057-87e5-622ac81c0783"><picture><source
media="(prefers-color-scheme: dark)"
srcset="https://cursor.com/open-in-web-dark.svg"><source
media="(prefers-color-scheme: light)"
srcset="https://cursor.com/open-in-web-light.svg"><img alt="Open in Web"
src="https://cursor.com/open-in-web.svg"></picture></a>

---------

Co-authored-by: Cursor Agent <cursoragent@cursor.com>
Co-authored-by: claude[bot] <41898282+claude[bot]@users.noreply.github.com>
Co-authored-by: Nicholas Tindle <ntindle@users.noreply.github.com>
Co-authored-by: Bentlybro <Github@bentlybro.com>
2025-10-20 20:44:58 +00:00
Nicholas Tindle
47bb89caeb fix(backend): Disable LaunchDarkly integration in metrics.py (#11217) 2025-10-20 14:07:21 -05:00
Swifty
3988057032 Merge branch 'swiftyos/secrt-1712-remove-error-handling-form-store-routes' into dev 2025-10-18 12:28:25 +02:00
Swifty
a6c6e48f00 Merge branch 'swiftyos/open-2791-featplatform-add-easy-test-data-creation' into dev 2025-10-18 12:28:17 +02:00
Swifty
e72ce2f9e7 Merge branch 'swiftyos/secrt-1709-store-provider-names-and-env-vars-in-db' into dev 2025-10-18 12:27:58 +02:00
Swifty
bd7a79a920 Merge branch 'swiftyos/secrt-1706-improve-store-search' into dev 2025-10-18 12:27:31 +02:00
Swifty
d9035a233c Merge branch 'swiftyos/secrt-1709-store-provider-names-and-env-vars-in-db' of github.com:Significant-Gravitas/AutoGPT into swiftyos/secrt-1709-store-provider-names-and-env-vars-in-db 2025-10-17 17:20:27 +02:00