mirror of
https://github.com/Significant-Gravitas/AutoGPT.git
synced 2026-01-10 07:38:04 -05:00
## 🚨 Critical Deadlock Fix: Scheduler Pod Stuck for 3+ Hours This PR resolves a critical production deadlock where scheduler pods become completely unresponsive due to a CloudLoggingHandler locking issue. ## 📋 Incident Summary **Affected Pod**: `autogpt-scheduler-server-6d7b89c4f9-mqp59` - **Duration**: Stuck for 3+ hours (still ongoing) - **Symptoms**: Health checks failing, appears completely dead - **Impact**: No new job executions, system appears down - **Root Cause**: CloudLoggingHandler deadlock with gRPC timeout failure ## 🔍 Detailed Incident Analysis ### The Deadlock Chain 1. **Thread 58 (APScheduler Worker)**: - Completed job successfully - Called `logger.info("Job executed successfully")` - CloudLoggingHandler acquired lock at `logging/__init__.py:976` - Made gRPC call to Google Cloud Logging - **Got stuck in TCP black hole for 3+ hours** 2. **Thread 26 (FastAPI Health Check)**: - Tried to log health check response - **Blocked at `logging/__init__.py:927` waiting for same lock** - Health check never completes → Kubernetes thinks pod is dead 3. **All Other Threads**: Similarly blocked on any logging attempt ### Why gRPC Timeout Failed The gRPC call had a 60-second timeout but has been stuck for 10,775+ seconds because: - **TCP Black Hole**: Network packets silently dropped (firewall/load balancer timeout) - **No Socket Timeout**: Python default is `None` (infinite wait) - **TCP Keepalive Disabled**: Dead connections hang forever - **Kernel-Level Block**: gRPC timeout can't interrupt `socket.recv()` syscall ### Evidence from Thread Dump ```python Thread 58: "ThreadPoolExecutor-0_1" _blocking (grpc/_channel.py:1162) timeout: 60 # ← Should have timed out deadline: 1755061203 # ← Expired 3 hours ago\! emit (logging_v2/handlers/handlers.py:225) # ← HOLDING LOCK handle (logging/__init__.py:978) # ← After acquire() Thread 26: "Thread-4 (__start_fastapi)" acquire (logging/__init__.py:927) # ← BLOCKED waiting for lock self: <CloudLoggingHandler at 0x7a657280d550> # ← Same instance\! ``` ## 🔧 The Fix ### Primary Solution Replace **blocking** `SyncTransport` with **non-blocking** `BackgroundThreadTransport`: ```python # BEFORE (Dangerous - blocks while holding lock) transport=SyncTransport, # AFTER (Safe - queues and returns immediately) transport=BackgroundThreadTransport, ``` ### Why BackgroundThreadTransport Solves It 1. **Non-blocking**: `emit()` returns immediately after queuing 2. **Lock Released**: No network I/O while holding the logging lock 3. **Isolated Failures**: Background thread hangs don't affect main app 4. **Better Performance**: Built-in batching and retry logic ### Additional Hardening - **Socket Timeout**: 30-second global timeout prevents infinite hangs - **gRPC Keepalive**: Detects and closes dead connections faster - **Comprehensive Logging**: Comments explain the deadlock prevention ## 🧪 Technical Validation ### Before (SyncTransport) ``` log.info("message") ↓ acquire_lock() ✅ ↓ gRPC_call() ❌ HANGS FOR HOURS ↓ [DEADLOCK - lock never released] ``` ### After (BackgroundThreadTransport) ``` log.info("message") ↓ acquire_lock() ✅ ↓ queue_message() ✅ Instant ↓ release_lock() ✅ Immediate ↓ [Background thread handles gRPC separately] ``` ## 🚀 Impact & Benefits **Immediate Impact**: - ✅ Prevents CloudLoggingHandler deadlocks - ✅ Health checks respond normally - ✅ System remains observable during network issues - ✅ Scheduler can continue processing jobs **Long-term Benefits**: - 📈 Better logging performance (batching + async) - 🛡️ Resilient to network partitions and timeouts - 🔍 Maintained observability during failures - ⚡ No blocking I/O on critical application threads ## 📊 Files Changed - `autogpt_libs/autogpt_libs/logging/config.py`: Transport change + socket hardening ## 🧪 Test Plan - [x] Validate BackgroundThreadTransport import works - [x] Confirm socket timeout configuration applies - [x] Verify gRPC keepalive environment variables set - [ ] Deploy to staging and verify no deadlocks under load - [ ] Monitor Cloud Logging delivery remains reliable ## 🔍 Monitoring After Deploy - Watch for any logging delivery delays (expected: minimal) - Confirm health checks respond consistently - Verify no more scheduler "hanging" incidents - Monitor gRPC connection patterns in Cloud Logging metrics ## 🎯 Risk Assessment - **Risk**: Very Low - BackgroundThreadTransport is the recommended approach - **Rollback**: Simple revert if any issues observed - **Testing**: Extensively used in production Google Cloud services --- **This fixes a critical production stability issue affecting scheduler reliability and system observability.** 🤖 Generated with [Claude Code](https://claude.ai/code) --------- Co-authored-by: Claude <noreply@anthropic.com>
AutoGPT Libs
This is a new project to store shared functionality across different services in the AutoGPT Platform (e.g. authentication)