hotfix: reduce scheduler max_workers to match database pool size (#10665)

mirror of https://github.com/Significant-Gravitas/AutoGPT.git synced 2026-01-08 06:44:05 -05:00

## Summary
- Fixes scheduler pod crashes during peak scheduling periods (e.g.,
03:00:00)
- Reduces APScheduler ThreadPoolExecutor max_workers from 10 to 3
(matching scheduler_db_pool_size)
- Prevents event loop saturation that blocks health checks and causes
pod restarts

## Root Cause Analysis
During peak scheduling periods, multiple jobs execute simultaneously and
compete for the shared event loop through `run_async()`. This creates a
resource bottleneck where:

1. **ThreadPoolExecutor** runs up to 10 jobs concurrently
2. Each job calls `run_async()` which submits to the **same event loop**
that FastAPI health check needs
3. **Health check blocks** waiting for event loop availability 
4. **Liveness probe fails** after 5 consecutive timeouts (50s)
5. **Pod gets killed** with SIGKILL (exit code 137)
6. **Executions orphaned** - created in DB but never published to
RabbitMQ

## Solution
Match `max_workers` to `scheduler_db_pool_size` (3) to prevent more
concurrent jobs than the system can handle without blocking critical
health checks.

## Evidence
- Pod restart at exactly 03:05:48 when executions
e47cd564-ed87-4a52-999b-40804c41537a and
eae69811-4c7c-4cd5-b084-41872293185b were created
- 7 scheduled jobs triggered simultaneously at 03:00:00
- Health check normally responds in 0.007s but times out during high
concurrency
- Exit code 137 indicates SIGKILL from liveness probe failure

## Test Plan
- [ ] Monitor scheduler pod stability during peak scheduling periods
- [ ] Verify no executions remain QUEUED without being published to
RabbitMQ
- [ ] Confirm health checks remain responsive under load
- [ ] Check that job execution still works correctly with reduced
concurrency

🤖 Generated with [Claude Code](https://claude.ai/code)

---------

Co-authored-by: Claude <noreply@anthropic.com>

This commit is contained in:

Zamil Majdy

2025-08-18 09:49:39 +04:00

committed by

GitHub

parent bf92e7dbc8

commit 72938590f2

1 changed files with 3 additions and 1 deletions

									
										4

autogpt_platform/backend/backend/executor/scheduler.py
									
												View File
												
				@@ -269,7 +269,9 @@ class Scheduler(AppService):

				        self.scheduler = BackgroundScheduler(

				            executors={

				                "default": ThreadPoolExecutor(max_workers=10),  # Max 10 concurrent jobs

				                "default": ThreadPoolExecutor(

				                    max_workers=self.db_pool_size()

				                ),  # Match DB pool size to prevent resource contention

				            },

				            job_defaults={

				                "coalesce": True,  # Skip redundant missed jobs - just run the latest

hotfix: reduce scheduler max_workers to match database pool size (#10665)

4 autogpt_platform/backend/backend/executor/scheduler.py Unescape Escape View File

4

autogpt_platform/backend/backend/executor/scheduler.py

View File