Files
AutoGPT/autogpt_platform/autogpt_libs
Zamil Majdy 34dd218a91 fix(backend): resolve CloudLoggingHandler deadlock causing scheduler hangs (#10634)
## 🚨 Critical Deadlock Fix: Scheduler Pod Stuck for 3+ Hours

This PR resolves a critical production deadlock where scheduler pods
become completely unresponsive due to a CloudLoggingHandler locking
issue.

## 📋 Incident Summary

**Affected Pod**: `autogpt-scheduler-server-6d7b89c4f9-mqp59`
- **Duration**: Stuck for 3+ hours (still ongoing)
- **Symptoms**: Health checks failing, appears completely dead
- **Impact**: No new job executions, system appears down
- **Root Cause**: CloudLoggingHandler deadlock with gRPC timeout failure

## 🔍 Detailed Incident Analysis

### The Deadlock Chain
1. **Thread 58 (APScheduler Worker)**: 
   - Completed job successfully
   - Called `logger.info("Job executed successfully")`
   - CloudLoggingHandler acquired lock at `logging/__init__.py:976`
   - Made gRPC call to Google Cloud Logging
   - **Got stuck in TCP black hole for 3+ hours**

2. **Thread 26 (FastAPI Health Check)**:
   - Tried to log health check response
   - **Blocked at `logging/__init__.py:927` waiting for same lock**
   - Health check never completes → Kubernetes thinks pod is dead

3. **All Other Threads**: Similarly blocked on any logging attempt

### Why gRPC Timeout Failed
The gRPC call had a 60-second timeout but has been stuck for 10,775+
seconds because:
- **TCP Black Hole**: Network packets silently dropped (firewall/load
balancer timeout)
- **No Socket Timeout**: Python default is `None` (infinite wait)
- **TCP Keepalive Disabled**: Dead connections hang forever  
- **Kernel-Level Block**: gRPC timeout can't interrupt `socket.recv()`
syscall

### Evidence from Thread Dump
```python
Thread 58: "ThreadPoolExecutor-0_1" 
  _blocking (grpc/_channel.py:1162)
    timeout: 60                    # ← Should have timed out
    deadline: 1755061203          # ← Expired 3 hours ago\!
  emit (logging_v2/handlers/handlers.py:225)  # ← HOLDING LOCK
  handle (logging/__init__.py:978)           # ← After acquire()

Thread 26: "Thread-4 (__start_fastapi)"
  acquire (logging/__init__.py:927)          # ← BLOCKED waiting for lock
    self: <CloudLoggingHandler at 0x7a657280d550>  # ← Same instance\!
```

## 🔧 The Fix

### Primary Solution
Replace **blocking** `SyncTransport` with **non-blocking**
`BackgroundThreadTransport`:

```python
# BEFORE (Dangerous - blocks while holding lock)
transport=SyncTransport,

# AFTER (Safe - queues and returns immediately) 
transport=BackgroundThreadTransport,
```

### Why BackgroundThreadTransport Solves It
1. **Non-blocking**: `emit()` returns immediately after queuing
2. **Lock Released**: No network I/O while holding the logging lock
3. **Isolated Failures**: Background thread hangs don't affect main app
4. **Better Performance**: Built-in batching and retry logic

### Additional Hardening
- **Socket Timeout**: 30-second global timeout prevents infinite hangs
- **gRPC Keepalive**: Detects and closes dead connections faster
- **Comprehensive Logging**: Comments explain the deadlock prevention

## 🧪 Technical Validation

### Before (SyncTransport)
```
log.info("message") 
  ↓
acquire_lock() 
  ↓  
gRPC_call()  HANGS FOR HOURS
  ↓
[DEADLOCK - lock never released]
```

### After (BackgroundThreadTransport)  
```
log.info("message")
  ↓
acquire_lock() 
  ↓
queue_message()  Instant
  ↓
release_lock()  Immediate
  ↓
[Background thread handles gRPC separately]
```

## 🚀 Impact & Benefits

**Immediate Impact**:
-  Prevents CloudLoggingHandler deadlocks
-  Health checks respond normally  
-  System remains observable during network issues
-  Scheduler can continue processing jobs

**Long-term Benefits**:
- 📈 Better logging performance (batching + async)
- 🛡️ Resilient to network partitions and timeouts
- 🔍 Maintained observability during failures  
-  No blocking I/O on critical application threads

## 📊 Files Changed
- `autogpt_libs/autogpt_libs/logging/config.py`: Transport change +
socket hardening

## 🧪 Test Plan
- [x] Validate BackgroundThreadTransport import works
- [x] Confirm socket timeout configuration applies
- [x] Verify gRPC keepalive environment variables set
- [ ] Deploy to staging and verify no deadlocks under load
- [ ] Monitor Cloud Logging delivery remains reliable

## 🔍 Monitoring After Deploy
- Watch for any logging delivery delays (expected: minimal)
- Confirm health checks respond consistently  
- Verify no more scheduler "hanging" incidents
- Monitor gRPC connection patterns in Cloud Logging metrics

## 🎯 Risk Assessment
- **Risk**: Very Low - BackgroundThreadTransport is the recommended
approach
- **Rollback**: Simple revert if any issues observed
- **Testing**: Extensively used in production Google Cloud services

---

**This fixes a critical production stability issue affecting scheduler
reliability and system observability.**

🤖 Generated with [Claude Code](https://claude.ai/code)

---------

Co-authored-by: Claude <noreply@anthropic.com>
2025-08-13 13:23:09 +00:00
..

AutoGPT Libs

This is a new project to store shared functionality across different services in the AutoGPT Platform (e.g. authentication)