mirror of
https://github.com/Significant-Gravitas/AutoGPT.git
synced 2026-02-19 02:54:28 -05:00
## Summary Fix critical queue blocking issue where rate-limited user messages prevent other users' executions from being processed, causing the 135 late executions reported in production. ## Root Cause Analysis When a user exceeds `max_concurrent_graph_executions_per_user` (25), the executor uses `basic_nack(requeue=True)` which sends the message to the **FRONT** of the RabbitMQ queue. This creates an infinite blocking loop where: 1. Rate-limited message goes to front of queue 2. Gets processed, hits rate limit again 3. Goes back to front of queue 4. Blocks all other users' messages indefinitely ## Solution Implementation ### 🔧 Core Changes - **New setting**: `requeue_by_republishing` (default: `True`) in `backend/util/settings.py` - **Smart `_ack_message`**: Automatically uses republishing when `requeue=True` and setting enabled - **Efficient implementation**: Uses existing `self.run_client` connection instead of creating new ones - **Integration test**: Real RabbitMQ test validates queue ordering behavior ### 🔄 Technical Implementation **Before (blocking):** ```python basic_nack(delivery_tag, requeue=True) # Goes to FRONT of queue ❌ ``` **After (non-blocking):** ```python if requeue and self.config.requeue_by_republishing: # First: Republish to BACK of queue self.run_client.publish_message(...) # Then: Reject without requeue basic_nack(delivery_tag, requeue=False) ``` ### 📊 Impact - ✅ **Other users' executions no longer blocked** by rate-limited users - ✅ **Fair queue processing** - FIFO behavior maintained for all users - ✅ **Rate limiting still works** - just doesn't block others - ✅ **Configurable** - can revert to old behavior with `requeue_by_republishing=False` - ✅ **Zero performance impact** - uses existing connections ## Test Plan - **Integration test**: `test_requeue_integration.py` validates real RabbitMQ queue ordering - **Scenario testing**: Confirms rate-limited messages go to back of queue - **Cross-user validation**: Verifies other users' messages process correctly - **Setting test**: Confirms configuration loads with correct defaults ## Deployment Strategy This is a **hotfix** that can be deployed immediately: - **Backward compatible**: Old behavior available via config - **Safe default**: New behavior is safer than current state - **No breaking changes**: All existing functionality preserved - **Immediate relief**: Resolves production queue blocking ## Files Modified - `backend/executor/manager.py`: Enhanced `_ack_message` logic and `_requeue_message_to_back` method - `backend/util/settings.py`: Added `requeue_by_republishing` configuration field - `test_requeue_integration.py`: Integration test for queue ordering validation ## Related Issues Fixes the 135 late executions issue where messages were stuck in QUEUED state despite available executor capacity (583m/600m utilization). 🤖 Generated with [Claude Code](https://claude.ai/code) Co-Authored-By: Claude <noreply@anthropic.com> --------- Co-authored-by: Claude <noreply@anthropic.com>
13 KiB
13 KiB