fix(backend): resolve CloudLoggingHandler deadlock causing scheduler hangs (#10634)

## 🚨 Critical Deadlock Fix: Scheduler Pod Stuck for 3+ Hours

This PR resolves a critical production deadlock where scheduler pods
become completely unresponsive due to a CloudLoggingHandler locking
issue.

## 📋 Incident Summary

**Affected Pod**: `autogpt-scheduler-server-6d7b89c4f9-mqp59`
- **Duration**: Stuck for 3+ hours (still ongoing)
- **Symptoms**: Health checks failing, appears completely dead
- **Impact**: No new job executions, system appears down
- **Root Cause**: CloudLoggingHandler deadlock with gRPC timeout failure

## 🔍 Detailed Incident Analysis

### The Deadlock Chain
1. **Thread 58 (APScheduler Worker)**: 
   - Completed job successfully
   - Called `logger.info("Job executed successfully")`
   - CloudLoggingHandler acquired lock at `logging/__init__.py:976`
   - Made gRPC call to Google Cloud Logging
   - **Got stuck in TCP black hole for 3+ hours**

2. **Thread 26 (FastAPI Health Check)**:
   - Tried to log health check response
   - **Blocked at `logging/__init__.py:927` waiting for same lock**
   - Health check never completes → Kubernetes thinks pod is dead

3. **All Other Threads**: Similarly blocked on any logging attempt

### Why gRPC Timeout Failed
The gRPC call had a 60-second timeout but has been stuck for 10,775+
seconds because:
- **TCP Black Hole**: Network packets silently dropped (firewall/load
balancer timeout)
- **No Socket Timeout**: Python default is `None` (infinite wait)
- **TCP Keepalive Disabled**: Dead connections hang forever  
- **Kernel-Level Block**: gRPC timeout can't interrupt `socket.recv()`
syscall

### Evidence from Thread Dump
```python
Thread 58: "ThreadPoolExecutor-0_1" 
  _blocking (grpc/_channel.py:1162)
    timeout: 60                    # ← Should have timed out
    deadline: 1755061203          # ← Expired 3 hours ago\!
  emit (logging_v2/handlers/handlers.py:225)  # ← HOLDING LOCK
  handle (logging/__init__.py:978)           # ← After acquire()

Thread 26: "Thread-4 (__start_fastapi)"
  acquire (logging/__init__.py:927)          # ← BLOCKED waiting for lock
    self: <CloudLoggingHandler at 0x7a657280d550>  # ← Same instance\!
```

## 🔧 The Fix

### Primary Solution
Replace **blocking** `SyncTransport` with **non-blocking**
`BackgroundThreadTransport`:

```python
# BEFORE (Dangerous - blocks while holding lock)
transport=SyncTransport,

# AFTER (Safe - queues and returns immediately) 
transport=BackgroundThreadTransport,
```

### Why BackgroundThreadTransport Solves It
1. **Non-blocking**: `emit()` returns immediately after queuing
2. **Lock Released**: No network I/O while holding the logging lock
3. **Isolated Failures**: Background thread hangs don't affect main app
4. **Better Performance**: Built-in batching and retry logic

### Additional Hardening
- **Socket Timeout**: 30-second global timeout prevents infinite hangs
- **gRPC Keepalive**: Detects and closes dead connections faster
- **Comprehensive Logging**: Comments explain the deadlock prevention

## 🧪 Technical Validation

### Before (SyncTransport)
```
log.info("message") 
  ↓
acquire_lock() 
  ↓  
gRPC_call()  HANGS FOR HOURS
  ↓
[DEADLOCK - lock never released]
```

### After (BackgroundThreadTransport)  
```
log.info("message")
  ↓
acquire_lock() 
  ↓
queue_message()  Instant
  ↓
release_lock()  Immediate
  ↓
[Background thread handles gRPC separately]
```

## 🚀 Impact & Benefits

**Immediate Impact**:
-  Prevents CloudLoggingHandler deadlocks
-  Health checks respond normally  
-  System remains observable during network issues
-  Scheduler can continue processing jobs

**Long-term Benefits**:
- 📈 Better logging performance (batching + async)
- 🛡️ Resilient to network partitions and timeouts
- 🔍 Maintained observability during failures  
-  No blocking I/O on critical application threads

## 📊 Files Changed
- `autogpt_libs/autogpt_libs/logging/config.py`: Transport change +
socket hardening

## 🧪 Test Plan
- [x] Validate BackgroundThreadTransport import works
- [x] Confirm socket timeout configuration applies
- [x] Verify gRPC keepalive environment variables set
- [ ] Deploy to staging and verify no deadlocks under load
- [ ] Monitor Cloud Logging delivery remains reliable

## 🔍 Monitoring After Deploy
- Watch for any logging delivery delays (expected: minimal)
- Confirm health checks respond consistently  
- Verify no more scheduler "hanging" incidents
- Monitor gRPC connection patterns in Cloud Logging metrics

## 🎯 Risk Assessment
- **Risk**: Very Low - BackgroundThreadTransport is the recommended
approach
- **Rollback**: Simple revert if any issues observed
- **Testing**: Extensively used in production Google Cloud services

---

**This fixes a critical production stability issue affecting scheduler
reliability and system observability.**

🤖 Generated with [Claude Code](https://claude.ai/code)

---------

Co-authored-by: Claude <noreply@anthropic.com>
This commit is contained in:
Zamil Majdy
2025-08-13 17:23:09 +04:00
committed by GitHub
parent 41f500790f
commit 34dd218a91

View File

@@ -1,6 +1,8 @@
"""Logging module for Auto-GPT."""
import logging
import os
import socket
import sys
from pathlib import Path
@@ -10,6 +12,15 @@ from pydantic_settings import BaseSettings, SettingsConfigDict
from .filters import BelowLevelFilter
from .formatters import AGPTFormatter
# Configure global socket timeout and gRPC keepalive to prevent deadlocks
# This must be done at import time before any gRPC connections are established
socket.setdefaulttimeout(30) # 30-second socket timeout
# Enable gRPC keepalive to detect dead connections faster
os.environ.setdefault("GRPC_KEEPALIVE_TIME_MS", "30000") # 30 seconds
os.environ.setdefault("GRPC_KEEPALIVE_TIMEOUT_MS", "5000") # 5 seconds
os.environ.setdefault("GRPC_KEEPALIVE_PERMIT_WITHOUT_CALLS", "true")
LOG_DIR = Path(__file__).parent.parent.parent.parent / "logs"
LOG_FILE = "activity.log"
DEBUG_LOG_FILE = "debug.log"
@@ -79,7 +90,6 @@ def configure_logging(force_cloud_logging: bool = False) -> None:
Note: This function is typically called at the start of the application
to set up the logging infrastructure.
"""
config = LoggingConfig()
log_handlers: list[logging.Handler] = []
@@ -105,13 +115,17 @@ def configure_logging(force_cloud_logging: bool = False) -> None:
if config.enable_cloud_logging or force_cloud_logging:
import google.cloud.logging
from google.cloud.logging.handlers import CloudLoggingHandler
from google.cloud.logging_v2.handlers.transports.sync import SyncTransport
from google.cloud.logging_v2.handlers.transports import (
BackgroundThreadTransport,
)
client = google.cloud.logging.Client()
# Use BackgroundThreadTransport to prevent blocking the main thread
# and deadlocks when gRPC calls to Google Cloud Logging hang
cloud_handler = CloudLoggingHandler(
client,
name="autogpt_logs",
transport=SyncTransport,
transport=BackgroundThreadTransport,
)
cloud_handler.setLevel(config.level)
log_handlers.append(cloud_handler)