fix(backend): resolve CloudLoggingHandler deadlock causing scheduler hangs (#10634)

## 🚨 Critical Deadlock Fix: Scheduler Pod Stuck for 3+ Hours This PR resolves a critical production deadlock where scheduler pods become completely unresponsive due to a CloudLoggingHandler locking issue. ## 📋 Incident Summary **Affected Pod**: `autogpt-scheduler-server-6d7b89c4f9-mqp59` - **Duration**: Stuck for 3+ hours (still ongoing) - **Symptoms**: Health checks failing, appears completely dead - **Impact**: No new job executions, system appears down - **Root Cause**: CloudLoggingHandler deadlock with gRPC timeout failure ## 🔍 Detailed Incident Analysis ### The Deadlock Chain 1. **Thread 58 (APScheduler Worker)**: - Completed job successfully - Called `logger.info("Job executed successfully")` - CloudLoggingHandler acquired lock at `logging/__init__.py:976` - Made gRPC call to Google Cloud Logging - **Got stuck in TCP black hole for 3+ hours** 2. **Thread 26 (FastAPI Health Check)**: - Tried to log health check response - **Blocked at `logging/__init__.py:927` waiting for same lock** - Health check never completes → Kubernetes thinks pod is dead 3. **All Other Threads**: Similarly blocked on any logging attempt ### Why gRPC Timeout Failed The gRPC call had a 60-second timeout but has been stuck for 10,775+ seconds because: - **TCP Black Hole**: Network packets silently dropped (firewall/load balancer timeout) - **No Socket Timeout**: Python default is `None` (infinite wait) - **TCP Keepalive Disabled**: Dead connections hang forever - **Kernel-Level Block**: gRPC timeout can't interrupt `socket.recv()` syscall ### Evidence from Thread Dump ```python Thread 58: "ThreadPoolExecutor-0_1" _blocking (grpc/_channel.py:1162) timeout: 60 # ← Should have timed out deadline: 1755061203 # ← Expired 3 hours ago\! emit (logging_v2/handlers/handlers.py:225) # ← HOLDING LOCK handle (logging/__init__.py:978) # ← After acquire() Thread 26: "Thread-4 (__start_fastapi)" acquire (logging/__init__.py:927) # ← BLOCKED waiting for lock self: <CloudLoggingHandler at 0x7a657280d550> # ← Same instance\! ``` ## 🔧 The Fix ### Primary Solution Replace **blocking** `SyncTransport` with **non-blocking** `BackgroundThreadTransport`: ```python # BEFORE (Dangerous - blocks while holding lock) transport=SyncTransport, # AFTER (Safe - queues and returns immediately) transport=BackgroundThreadTransport, ``` ### Why BackgroundThreadTransport Solves It 1. **Non-blocking**: `emit()` returns immediately after queuing 2. **Lock Released**: No network I/O while holding the logging lock 3. **Isolated Failures**: Background thread hangs don't affect main app 4. **Better Performance**: Built-in batching and retry logic ### Additional Hardening - **Socket Timeout**: 30-second global timeout prevents infinite hangs - **gRPC Keepalive**: Detects and closes dead connections faster - **Comprehensive Logging**: Comments explain the deadlock prevention ## 🧪 Technical Validation ### Before (SyncTransport) ``` log.info("message") ↓ acquire_lock() ✅ ↓ gRPC_call() ❌ HANGS FOR HOURS ↓ [DEADLOCK - lock never released] ``` ### After (BackgroundThreadTransport) ``` log.info("message") ↓ acquire_lock() ✅ ↓ queue_message() ✅ Instant ↓ release_lock() ✅ Immediate ↓ [Background thread handles gRPC separately] ``` ## 🚀 Impact & Benefits **Immediate Impact**: - ✅ Prevents CloudLoggingHandler deadlocks - ✅ Health checks respond normally - ✅ System remains observable during network issues - ✅ Scheduler can continue processing jobs **Long-term Benefits**: - 📈 Better logging performance (batching + async) - 🛡️ Resilient to network partitions and timeouts - 🔍 Maintained observability during failures - ⚡ No blocking I/O on critical application threads ## 📊 Files Changed - `autogpt_libs/autogpt_libs/logging/config.py`: Transport change + socket hardening ## 🧪 Test Plan - [x] Validate BackgroundThreadTransport import works - [x] Confirm socket timeout configuration applies - [x] Verify gRPC keepalive environment variables set - [ ] Deploy to staging and verify no deadlocks under load - [ ] Monitor Cloud Logging delivery remains reliable ## 🔍 Monitoring After Deploy - Watch for any logging delivery delays (expected: minimal) - Confirm health checks respond consistently - Verify no more scheduler "hanging" incidents - Monitor gRPC connection patterns in Cloud Logging metrics ## 🎯 Risk Assessment - **Risk**: Very Low - BackgroundThreadTransport is the recommended approach - **Rollback**: Simple revert if any issues observed - **Testing**: Extensively used in production Google Cloud services --- **This fixes a critical production stability issue affecting scheduler reliability and system observability.** 🤖 Generated with [Claude Code](https://claude.ai/code) --------- Co-authored-by: Claude <noreply@anthropic.com>
2026-01-09 23:28:07 -05:00 · 2025-08-13 17:23:09 +04:00
parent 41f500790f
commit 34dd218a91
1 changed files with 17 additions and 3 deletions
--- a/autogpt_platform/autogpt_libs/autogpt_libs/logging/config.py
+++ b/autogpt_platform/autogpt_libs/autogpt_libs/logging/config.py
@@ -1,6 +1,8 @@
 """Logging module for Auto-GPT."""

 import logging
+import os
+import socket
 import sys
 from pathlib import Path

@@ -10,6 +12,15 @@ from pydantic_settings import BaseSettings, SettingsConfigDict
 from .filters import BelowLevelFilter
 from .formatters import AGPTFormatter

+# Configure global socket timeout and gRPC keepalive to prevent deadlocks
+# This must be done at import time before any gRPC connections are established
+socket.setdefaulttimeout(30)  # 30-second socket timeout
+
+# Enable gRPC keepalive to detect dead connections faster
+os.environ.setdefault("GRPC_KEEPALIVE_TIME_MS", "30000")  # 30 seconds
+os.environ.setdefault("GRPC_KEEPALIVE_TIMEOUT_MS", "5000")  # 5 seconds
+os.environ.setdefault("GRPC_KEEPALIVE_PERMIT_WITHOUT_CALLS", "true")
+
 LOG_DIR = Path(__file__).parent.parent.parent.parent / "logs"
 LOG_FILE = "activity.log"
 DEBUG_LOG_FILE = "debug.log"
@@ -79,7 +90,6 @@ def configure_logging(force_cloud_logging: bool = False) -> None:
    Note: This function is typically called at the start of the application
    to set up the logging infrastructure.
    """
-
    config = LoggingConfig()
    log_handlers: list[logging.Handler] = []

@@ -105,13 +115,17 @@ def configure_logging(force_cloud_logging: bool = False) -> None:
    if config.enable_cloud_logging or force_cloud_logging:
        import google.cloud.logging
        from google.cloud.logging.handlers import CloudLoggingHandler
-        from google.cloud.logging_v2.handlers.transports.sync import SyncTransport
+        from google.cloud.logging_v2.handlers.transports import (
+            BackgroundThreadTransport,
+        )

        client = google.cloud.logging.Client()
+        # Use BackgroundThreadTransport to prevent blocking the main thread
+        # and deadlocks when gRPC calls to Google Cloud Logging hang
        cloud_handler = CloudLoggingHandler(
            client,
            name="autogpt_logs",
-            transport=SyncTransport,
+            transport=BackgroundThreadTransport,
        )
        cloud_handler.setLevel(config.level)
        log_handlers.append(cloud_handler)