From 72938590f285fd266b735864402a886e61c3162d Mon Sep 17 00:00:00 2001
From: Zamil Majdy <zamil.majdy@agpt.co>
Date: Mon, 18 Aug 2025 09:49:39 +0400
Subject: [PATCH] hotfix: reduce scheduler max_workers to match database pool
 size (#10665)
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 8bit

## Summary
- Fixes scheduler pod crashes during peak scheduling periods (e.g.,
03:00:00)
- Reduces APScheduler ThreadPoolExecutor max_workers from 10 to 3
(matching scheduler_db_pool_size)
- Prevents event loop saturation that blocks health checks and causes
pod restarts

## Root Cause Analysis
During peak scheduling periods, multiple jobs execute simultaneously and
compete for the shared event loop through `run_async()`. This creates a
resource bottleneck where:

1. **ThreadPoolExecutor** runs up to 10 jobs concurrently
2. Each job calls `run_async()` which submits to the **same event loop**
that FastAPI health check needs
3. **Health check blocks** waiting for event loop availability
4. **Liveness probe fails** after 5 consecutive timeouts (50s)
5. **Pod gets killed** with SIGKILL (exit code 137)
6. **Executions orphaned** - created in DB but never published to
RabbitMQ

## Solution
Match `max_workers` to `scheduler_db_pool_size` (3) to prevent more
concurrent jobs than the system can handle without blocking critical
health checks.

## Evidence
- Pod restart at exactly 03:05:48 when executions
e47cd564-ed87-4a52-999b-40804c41537a and
eae69811-4c7c-4cd5-b084-41872293185b were created
- 7 scheduled jobs triggered simultaneously at 03:00:00
- Health check normally responds in 0.007s but times out during high
concurrency
- Exit code 137 indicates SIGKILL from liveness probe failure

## Test Plan
- [ ] Monitor scheduler pod stability during peak scheduling periods
- [ ] Verify no executions remain QUEUED without being published to
RabbitMQ
- [ ] Confirm health checks remain responsive under load
- [ ] Check that job execution still works correctly with reduced
concurrency

🤖 Generated with [Claude Code](https://claude.ai/code)

---------

Co-authored-by: Claude <noreply@anthropic.com>
---
 autogpt_platform/backend/backend/executor/scheduler.py | 4 +++-
 1 file changed, 3 insertions(+), 1 deletion(-)

diff --git a/autogpt_platform/backend/backend/executor/scheduler.py b/autogpt_platform/backend/backend/executor/scheduler.py
index 50b165045c..3d6ac59a3c 100644
--- a/autogpt_platform/backend/backend/executor/scheduler.py
+++ b/autogpt_platform/backend/backend/executor/scheduler.py
@@ -269,7 +269,9 @@ class Scheduler(AppService):
 
         self.scheduler = BackgroundScheduler(
             executors={
-                "default": ThreadPoolExecutor(max_workers=10),  # Max 10 concurrent jobs
+                "default": ThreadPoolExecutor(
+                    max_workers=self.db_pool_size()
+                ),  # Match DB pool size to prevent resource contention
             },
             job_defaults={
                 "coalesce": True,  # Skip redundant missed jobs - just run the latest