hotfix: reduce scheduler max_workers to match database pool size (#10665)

## Summary
- Fixes scheduler pod crashes during peak scheduling periods (e.g.,
03:00:00)
- Reduces APScheduler ThreadPoolExecutor max_workers from 10 to 3
(matching scheduler_db_pool_size)
- Prevents event loop saturation that blocks health checks and causes
pod restarts

## Root Cause Analysis
During peak scheduling periods, multiple jobs execute simultaneously and
compete for the shared event loop through `run_async()`. This creates a
resource bottleneck where:

1. **ThreadPoolExecutor** runs up to 10 jobs concurrently
2. Each job calls `run_async()` which submits to the **same event loop**
that FastAPI health check needs
3. **Health check blocks** waiting for event loop availability 
4. **Liveness probe fails** after 5 consecutive timeouts (50s)
5. **Pod gets killed** with SIGKILL (exit code 137)
6. **Executions orphaned** - created in DB but never published to
RabbitMQ

## Solution
Match `max_workers` to `scheduler_db_pool_size` (3) to prevent more
concurrent jobs than the system can handle without blocking critical
health checks.

## Evidence
- Pod restart at exactly 03:05:48 when executions
e47cd564-ed87-4a52-999b-40804c41537a and
eae69811-4c7c-4cd5-b084-41872293185b were created
- 7 scheduled jobs triggered simultaneously at 03:00:00
- Health check normally responds in 0.007s but times out during high
concurrency
- Exit code 137 indicates SIGKILL from liveness probe failure

## Test Plan
- [ ] Monitor scheduler pod stability during peak scheduling periods
- [ ] Verify no executions remain QUEUED without being published to
RabbitMQ
- [ ] Confirm health checks remain responsive under load
- [ ] Check that job execution still works correctly with reduced
concurrency

🤖 Generated with [Claude Code](https://claude.ai/code)

---------

Co-authored-by: Claude <noreply@anthropic.com>
This commit is contained in:
Zamil Majdy
2025-08-18 09:49:39 +04:00
committed by GitHub
parent bf92e7dbc8
commit 72938590f2

View File

@@ -269,7 +269,9 @@ class Scheduler(AppService):
self.scheduler = BackgroundScheduler(
executors={
"default": ThreadPoolExecutor(max_workers=10), # Max 10 concurrent jobs
"default": ThreadPoolExecutor(
max_workers=self.db_pool_size()
), # Match DB pool size to prevent resource contention
},
job_defaults={
"coalesce": True, # Skip redundant missed jobs - just run the latest