mirror of
https://github.com/Significant-Gravitas/AutoGPT.git
synced 2026-04-30 03:00:41 -04:00
## Why Pre-launch scaling. Redis is currently a single-master pod — a real SPOF, and not scalable horizontally. To move it to a sharded Redis Cluster (via KubeBlocks in GKE), the backend has to speak the cluster protocol. Keeping both "standalone" and "cluster" code paths would have local dev not reflect prod. Going **cluster-only**. ## What - `backend.data.redis_client` now always constructs `RedisCluster` (sync) / `redis.asyncio.cluster.RedisCluster` (async). Type aliases `RedisClient` / `AsyncRedisClient` point at the cluster classes. - `RedisCluster` uses the existing `REDIS_HOST` / `REDIS_PORT` as a startup node and auto-discovers peers via `CLUSTER SLOTS`. - Classic Redis pub/sub is broadcast cluster-wide and redis-py's async `RedisCluster` has no `.pubsub()`; dedicated `get_redis_pubsub[_async]` helpers return plain `(Async)Redis` clients to the seed node. All pub/sub callers (`event_bus`, `notification_bus`, `copilot.pending_messages`) route through these helpers. - `rate_limit.py` MULTI/EXEC pipelines are split per-counter — daily and weekly counters hash to different slots, which `RedisCluster` correctly rejects as `CrossSlotTransactionError`. Per-counter `INCRBY + EXPIRE` atomicity is preserved; the counters are logically independent budgets. - `util/cache.py` shared-cache client is also `RedisCluster` now. - Pre-existing mock-based unit tests updated; new `redis_client_test.py` covers the swap. ## Local dev `docker-compose.platform.yml` now runs **2-master Redis Cluster** (`redis` + `redis-2`, 16384 slots split 0-8191 / 8192-16383). A one-shot `redis-init` sidecar bootstraps it on first boot via raw `CLUSTER MEET` + `CLUSTER ADDSLOTSRANGE` (bundled `redis-cli --cluster create` enforces a 3-node minimum). This deliberately catches cross-slot bugs on a laptop rather than in prod: ``` >>> ALL SMOKE TESTS PASS <<< [sync] class: RedisCluster [sync] 20 keys across slots: OK [sync] colocated MULTI/EXEC: OK [5, 12, 1] [sync] cross-slot MULTI/EXEC rejected as expected: CrossSlotTransactionError [sync] EVAL single-key: OK [sync] pub/sub (classic, broadcast): OK [async] class: RedisCluster [async] 15 keys across slots: OK [async] colocated pipeline: OK [async] pub/sub: OK ``` `rest_server` `/health` → 200, both shards have connected clients + keys distributed 19/19 under the smoke run. `executor` boots + connects to RabbitMQ + Redis cleanly. For a 3-shard override (6 pods, with replicas) when you want to test real KubeBlocks topology: ``` docker compose -f docker-compose.yml -f docker-compose.redis-cluster.yml up -d ``` ## Deploy order (companion infra PR: [cloud_infrastructure#312](https://github.com/Significant-Gravitas/AutoGPT_cloud_infrastructure/pull/312)) The existing `helm/redis` chart is updated in that PR to run as a 1-shard cluster (backwards-compatible toggle, default on). That rollout must land before this PR's image goes live so the backend's `RedisCluster` client has something to discover. Sequence: 1. Infra: `helm upgrade redis` (1-shard cluster-enabled) 2. Infra: `helm upgrade rabbit-mq` (3-node cluster) 3. Backend: merge + deploy this PR 4. Follow-up: swap to KubeBlocks `redis-cluster` chart (3-shard sharded, already staged in infra PR) ## Caveats / follow-ups - Classic pub/sub via seed node means every node in the cluster sees every message (broadcast). Fine at current volume; if it becomes hot, migrate to `SPUBLISH`/`SSUBSCRIBE` (Redis 7+ sharded pub/sub). - Per-user rate-limit counters (daily vs weekly) lost cross-counter transactionality, but per-counter atomicity is preserved — the two counters are independent budgets so no correctness regression. - Local 2-master cluster crashes lose the cluster state; `redis-init` idempotently rebootstraps. ## Checklist - [x] Lint + format pass (`poetry run format` + `poetry run lint`) - [x] Unit tests pass — `redis_client_test`, `redis_helpers_test`, `event_bus_test`, `pending_messages_test`, `rate_limit_test`, `cluster_lock_test` - [x] Live smoke against 2-master cluster — sync + async; MULTI/EXEC; EVAL; pub/sub; cross-slot rejection - [x] Full stack smoke — `rest_server` /health, `executor` boot, keys distributed across both shards - [ ] Dev deploy (pending infra PR merge + manual validation)