Files
AutoGPT/autogpt_platform/backend
Nicholas Tindle 53a6de9fdb feat(admin): Enhance diagnostics with comprehensive execution monitoring and management
Add extensive diagnostic capabilities for on-call engineers to monitor and manage execution health.

Backend Enhancements:
- Add 18 diagnostic metrics covering failures, orphaned executions, stuck queued, throughput, and queue health
- Implement orphaned execution detection (>24h old, not in executor)
- Add stuck queued detection (QUEUED >1h, never started)
- Add long-running execution detection (RUNNING >24h)
- Monitor both execution and cancel RabbitMQ queues
- Track failure rates (1h, 24h) and execution throughput metrics

New Backend Endpoints (15 total):
- GET /admin/diagnostics/executions/orphaned - List orphaned executions
- GET /admin/diagnostics/executions/stuck-queued - List stuck queued executions
- GET /admin/diagnostics/executions/long-running - List long-running executions
- GET /admin/diagnostics/executions/failed - List failed executions with error messages
- POST /admin/diagnostics/executions/cleanup-all-orphaned - Cleanup all orphaned (operates on entire dataset)
- POST /admin/diagnostics/executions/requeue - Requeue single stuck execution
- POST /admin/diagnostics/executions/requeue-bulk - Requeue selected executions
- POST /admin/diagnostics/executions/requeue-all-stuck - Requeue all stuck queued (operates on entire dataset)

Execution Management:
- Dual-mode stop: Active executions (cancel signals) vs orphaned (direct DB cleanup)
- Intelligent Stop All: Auto-splits active/orphaned, executes in parallel
- Requeue functionality for stuck QUEUED executions with credit cost warnings
- Stop sends cancel signals to RabbitMQ for graceful termination
- Cleanup orphaned updates DB directly without cancel signals
- ALL endpoints operate on entire datasets (not limited to pagination)

Frontend Enhancements:
- 5-tab filtering interface: All, Orphaned, Stuck Queued, Long-Running, Failed
- Clickable alert cards (🟠 🔴 🟡) automatically switch to relevant tabs
- Tab badges show live counts from diagnostics metrics
- Age column displays execution duration (e.g., "245d 12h")
- Orange row highlighting for orphaned executions (>24h old)
- Error message column for failed executions with hover tooltips
- Click-to-copy for execution IDs and user IDs with visual feedback
- Status badge colors match library view (blue=RUNNING, yellow=QUEUED, red=FAILED)

Tab-Specific Actions:
- Stuck Queued: Cleanup All OR Requeue All buttons with cost warnings
- Stuck Queued per-row: 🟠 Cleanup OR 🔵 Requeue buttons
- Orphaned: Cleanup All (operates on ALL orphaned)
- Long-Running: Stop All (sends cancel signals)
- Failed: View-only with error details
- All: Stop All (intelligent split of active/orphaned)

Alert Cards:
- 🟠 Orphaned: Shows count with RUNNING/QUEUED breakdown, click to view
- 🔴 Failed (24h): Shows count with hourly rate, click to view
- 🟡 Long-Running: Shows count with oldest execution age, click to view

Updated Diagnostic Info Card:
- Color-coded explanations for each execution type
- When to cleanup vs requeue vs stop
- Credit cost implications clearly documented
- Queue health thresholds explained

Provides ~70% coverage of on-call guide requirements for troubleshooting execution issues, orphaned database records, and system health monitoring.

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude <noreply@anthropic.com>
2025-11-03 16:57:49 -06:00
..
2025-10-16 12:14:26 +02:00