mirror of
https://github.com/Significant-Gravitas/AutoGPT.git
synced 2026-01-09 23:28:07 -05:00
fix(backend): resolve CloudLoggingHandler deadlock causing scheduler hangs (#10634)
## 🚨 Critical Deadlock Fix: Scheduler Pod Stuck for 3+ Hours This PR resolves a critical production deadlock where scheduler pods become completely unresponsive due to a CloudLoggingHandler locking issue. ## 📋 Incident Summary **Affected Pod**: `autogpt-scheduler-server-6d7b89c4f9-mqp59` - **Duration**: Stuck for 3+ hours (still ongoing) - **Symptoms**: Health checks failing, appears completely dead - **Impact**: No new job executions, system appears down - **Root Cause**: CloudLoggingHandler deadlock with gRPC timeout failure ## 🔍 Detailed Incident Analysis ### The Deadlock Chain 1. **Thread 58 (APScheduler Worker)**: - Completed job successfully - Called `logger.info("Job executed successfully")` - CloudLoggingHandler acquired lock at `logging/__init__.py:976` - Made gRPC call to Google Cloud Logging - **Got stuck in TCP black hole for 3+ hours** 2. **Thread 26 (FastAPI Health Check)**: - Tried to log health check response - **Blocked at `logging/__init__.py:927` waiting for same lock** - Health check never completes → Kubernetes thinks pod is dead 3. **All Other Threads**: Similarly blocked on any logging attempt ### Why gRPC Timeout Failed The gRPC call had a 60-second timeout but has been stuck for 10,775+ seconds because: - **TCP Black Hole**: Network packets silently dropped (firewall/load balancer timeout) - **No Socket Timeout**: Python default is `None` (infinite wait) - **TCP Keepalive Disabled**: Dead connections hang forever - **Kernel-Level Block**: gRPC timeout can't interrupt `socket.recv()` syscall ### Evidence from Thread Dump ```python Thread 58: "ThreadPoolExecutor-0_1" _blocking (grpc/_channel.py:1162) timeout: 60 # ← Should have timed out deadline: 1755061203 # ← Expired 3 hours ago\! emit (logging_v2/handlers/handlers.py:225) # ← HOLDING LOCK handle (logging/__init__.py:978) # ← After acquire() Thread 26: "Thread-4 (__start_fastapi)" acquire (logging/__init__.py:927) # ← BLOCKED waiting for lock self: <CloudLoggingHandler at 0x7a657280d550> # ← Same instance\! ``` ## 🔧 The Fix ### Primary Solution Replace **blocking** `SyncTransport` with **non-blocking** `BackgroundThreadTransport`: ```python # BEFORE (Dangerous - blocks while holding lock) transport=SyncTransport, # AFTER (Safe - queues and returns immediately) transport=BackgroundThreadTransport, ``` ### Why BackgroundThreadTransport Solves It 1. **Non-blocking**: `emit()` returns immediately after queuing 2. **Lock Released**: No network I/O while holding the logging lock 3. **Isolated Failures**: Background thread hangs don't affect main app 4. **Better Performance**: Built-in batching and retry logic ### Additional Hardening - **Socket Timeout**: 30-second global timeout prevents infinite hangs - **gRPC Keepalive**: Detects and closes dead connections faster - **Comprehensive Logging**: Comments explain the deadlock prevention ## 🧪 Technical Validation ### Before (SyncTransport) ``` log.info("message") ↓ acquire_lock() ✅ ↓ gRPC_call() ❌ HANGS FOR HOURS ↓ [DEADLOCK - lock never released] ``` ### After (BackgroundThreadTransport) ``` log.info("message") ↓ acquire_lock() ✅ ↓ queue_message() ✅ Instant ↓ release_lock() ✅ Immediate ↓ [Background thread handles gRPC separately] ``` ## 🚀 Impact & Benefits **Immediate Impact**: - ✅ Prevents CloudLoggingHandler deadlocks - ✅ Health checks respond normally - ✅ System remains observable during network issues - ✅ Scheduler can continue processing jobs **Long-term Benefits**: - 📈 Better logging performance (batching + async) - 🛡️ Resilient to network partitions and timeouts - 🔍 Maintained observability during failures - ⚡ No blocking I/O on critical application threads ## 📊 Files Changed - `autogpt_libs/autogpt_libs/logging/config.py`: Transport change + socket hardening ## 🧪 Test Plan - [x] Validate BackgroundThreadTransport import works - [x] Confirm socket timeout configuration applies - [x] Verify gRPC keepalive environment variables set - [ ] Deploy to staging and verify no deadlocks under load - [ ] Monitor Cloud Logging delivery remains reliable ## 🔍 Monitoring After Deploy - Watch for any logging delivery delays (expected: minimal) - Confirm health checks respond consistently - Verify no more scheduler "hanging" incidents - Monitor gRPC connection patterns in Cloud Logging metrics ## 🎯 Risk Assessment - **Risk**: Very Low - BackgroundThreadTransport is the recommended approach - **Rollback**: Simple revert if any issues observed - **Testing**: Extensively used in production Google Cloud services --- **This fixes a critical production stability issue affecting scheduler reliability and system observability.** 🤖 Generated with [Claude Code](https://claude.ai/code) --------- Co-authored-by: Claude <noreply@anthropic.com>
This commit is contained in:
@@ -1,6 +1,8 @@
|
||||
"""Logging module for Auto-GPT."""
|
||||
|
||||
import logging
|
||||
import os
|
||||
import socket
|
||||
import sys
|
||||
from pathlib import Path
|
||||
|
||||
@@ -10,6 +12,15 @@ from pydantic_settings import BaseSettings, SettingsConfigDict
|
||||
from .filters import BelowLevelFilter
|
||||
from .formatters import AGPTFormatter
|
||||
|
||||
# Configure global socket timeout and gRPC keepalive to prevent deadlocks
|
||||
# This must be done at import time before any gRPC connections are established
|
||||
socket.setdefaulttimeout(30) # 30-second socket timeout
|
||||
|
||||
# Enable gRPC keepalive to detect dead connections faster
|
||||
os.environ.setdefault("GRPC_KEEPALIVE_TIME_MS", "30000") # 30 seconds
|
||||
os.environ.setdefault("GRPC_KEEPALIVE_TIMEOUT_MS", "5000") # 5 seconds
|
||||
os.environ.setdefault("GRPC_KEEPALIVE_PERMIT_WITHOUT_CALLS", "true")
|
||||
|
||||
LOG_DIR = Path(__file__).parent.parent.parent.parent / "logs"
|
||||
LOG_FILE = "activity.log"
|
||||
DEBUG_LOG_FILE = "debug.log"
|
||||
@@ -79,7 +90,6 @@ def configure_logging(force_cloud_logging: bool = False) -> None:
|
||||
Note: This function is typically called at the start of the application
|
||||
to set up the logging infrastructure.
|
||||
"""
|
||||
|
||||
config = LoggingConfig()
|
||||
log_handlers: list[logging.Handler] = []
|
||||
|
||||
@@ -105,13 +115,17 @@ def configure_logging(force_cloud_logging: bool = False) -> None:
|
||||
if config.enable_cloud_logging or force_cloud_logging:
|
||||
import google.cloud.logging
|
||||
from google.cloud.logging.handlers import CloudLoggingHandler
|
||||
from google.cloud.logging_v2.handlers.transports.sync import SyncTransport
|
||||
from google.cloud.logging_v2.handlers.transports import (
|
||||
BackgroundThreadTransport,
|
||||
)
|
||||
|
||||
client = google.cloud.logging.Client()
|
||||
# Use BackgroundThreadTransport to prevent blocking the main thread
|
||||
# and deadlocks when gRPC calls to Google Cloud Logging hang
|
||||
cloud_handler = CloudLoggingHandler(
|
||||
client,
|
||||
name="autogpt_logs",
|
||||
transport=SyncTransport,
|
||||
transport=BackgroundThreadTransport,
|
||||
)
|
||||
cloud_handler.setLevel(config.level)
|
||||
log_handlers.append(cloud_handler)
|
||||
|
||||
Reference in New Issue
Block a user