From 72938590f285fd266b735864402a886e61c3162d Mon Sep 17 00:00:00 2001 From: Zamil Majdy Date: Mon, 18 Aug 2025 09:49:39 +0400 Subject: [PATCH] hotfix: reduce scheduler max_workers to match database pool size (#10665) MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit ## Summary - Fixes scheduler pod crashes during peak scheduling periods (e.g., 03:00:00) - Reduces APScheduler ThreadPoolExecutor max_workers from 10 to 3 (matching scheduler_db_pool_size) - Prevents event loop saturation that blocks health checks and causes pod restarts ## Root Cause Analysis During peak scheduling periods, multiple jobs execute simultaneously and compete for the shared event loop through `run_async()`. This creates a resource bottleneck where: 1. **ThreadPoolExecutor** runs up to 10 jobs concurrently 2. Each job calls `run_async()` which submits to the **same event loop** that FastAPI health check needs 3. **Health check blocks** waiting for event loop availability 4. **Liveness probe fails** after 5 consecutive timeouts (50s) 5. **Pod gets killed** with SIGKILL (exit code 137) 6. **Executions orphaned** - created in DB but never published to RabbitMQ ## Solution Match `max_workers` to `scheduler_db_pool_size` (3) to prevent more concurrent jobs than the system can handle without blocking critical health checks. ## Evidence - Pod restart at exactly 03:05:48 when executions e47cd564-ed87-4a52-999b-40804c41537a and eae69811-4c7c-4cd5-b084-41872293185b were created - 7 scheduled jobs triggered simultaneously at 03:00:00 - Health check normally responds in 0.007s but times out during high concurrency - Exit code 137 indicates SIGKILL from liveness probe failure ## Test Plan - [ ] Monitor scheduler pod stability during peak scheduling periods - [ ] Verify no executions remain QUEUED without being published to RabbitMQ - [ ] Confirm health checks remain responsive under load - [ ] Check that job execution still works correctly with reduced concurrency 🤖 Generated with [Claude Code](https://claude.ai/code) --------- Co-authored-by: Claude --- autogpt_platform/backend/backend/executor/scheduler.py | 4 +++- 1 file changed, 3 insertions(+), 1 deletion(-) diff --git a/autogpt_platform/backend/backend/executor/scheduler.py b/autogpt_platform/backend/backend/executor/scheduler.py index 50b165045c..3d6ac59a3c 100644 --- a/autogpt_platform/backend/backend/executor/scheduler.py +++ b/autogpt_platform/backend/backend/executor/scheduler.py @@ -269,7 +269,9 @@ class Scheduler(AppService): self.scheduler = BackgroundScheduler( executors={ - "default": ThreadPoolExecutor(max_workers=10), # Max 10 concurrent jobs + "default": ThreadPoolExecutor( + max_workers=self.db_pool_size() + ), # Match DB pool size to prevent resource contention }, job_defaults={ "coalesce": True, # Skip redundant missed jobs - just run the latest