Files
AutoGPT/.github
Zamil Majdy 2879528308 feat(backend): Redis Cluster client support (#12900)
## Why

Pre-launch scaling. Redis is currently a single-master pod — a real
SPOF, and not scalable horizontally. To move it to a sharded Redis
Cluster (via KubeBlocks in GKE), the backend has to speak the cluster
protocol.

Keeping both "standalone" and "cluster" code paths would have local dev
not reflect prod. Going **cluster-only**.

## What

- `backend.data.redis_client` now always constructs `RedisCluster`
(sync) / `redis.asyncio.cluster.RedisCluster` (async). Type aliases
`RedisClient` / `AsyncRedisClient` point at the cluster classes.
- `RedisCluster` uses the existing `REDIS_HOST` / `REDIS_PORT` as a
startup node and auto-discovers peers via `CLUSTER SLOTS`.
- Classic Redis pub/sub is broadcast cluster-wide and redis-py's async
`RedisCluster` has no `.pubsub()`; dedicated `get_redis_pubsub[_async]`
helpers return plain `(Async)Redis` clients to the seed node. All
pub/sub callers (`event_bus`, `notification_bus`,
`copilot.pending_messages`) route through these helpers.
- `rate_limit.py` MULTI/EXEC pipelines are split per-counter — daily and
weekly counters hash to different slots, which `RedisCluster` correctly
rejects as `CrossSlotTransactionError`. Per-counter `INCRBY + EXPIRE`
atomicity is preserved; the counters are logically independent budgets.
- `util/cache.py` shared-cache client is also `RedisCluster` now.
- Pre-existing mock-based unit tests updated; new `redis_client_test.py`
covers the swap.

## Local dev

`docker-compose.platform.yml` now runs **2-master Redis Cluster**
(`redis` + `redis-2`, 16384 slots split 0-8191 / 8192-16383). A one-shot
`redis-init` sidecar bootstraps it on first boot via raw `CLUSTER MEET`
+ `CLUSTER ADDSLOTSRANGE` (bundled `redis-cli --cluster create` enforces
a 3-node minimum).

This deliberately catches cross-slot bugs on a laptop rather than in
prod:

```
>>> ALL SMOKE TESTS PASS <<<
[sync] class: RedisCluster
[sync] 20 keys across slots: OK
[sync] colocated MULTI/EXEC: OK [5, 12, 1]
[sync] cross-slot MULTI/EXEC rejected as expected: CrossSlotTransactionError
[sync] EVAL single-key: OK
[sync] pub/sub (classic, broadcast): OK
[async] class: RedisCluster
[async] 15 keys across slots: OK
[async] colocated pipeline: OK
[async] pub/sub: OK
```

`rest_server` `/health` → 200, both shards have connected clients + keys
distributed 19/19 under the smoke run. `executor` boots + connects to
RabbitMQ + Redis cleanly.

For a 3-shard override (6 pods, with replicas) when you want to test
real KubeBlocks topology:
```
docker compose -f docker-compose.yml -f docker-compose.redis-cluster.yml up -d
```

## Deploy order (companion infra PR:
[cloud_infrastructure#312](https://github.com/Significant-Gravitas/AutoGPT_cloud_infrastructure/pull/312))

The existing `helm/redis` chart is updated in that PR to run as a
1-shard cluster (backwards-compatible toggle, default on). That rollout
must land before this PR's image goes live so the backend's
`RedisCluster` client has something to discover.

Sequence:
1. Infra: `helm upgrade redis` (1-shard cluster-enabled)
2. Infra: `helm upgrade rabbit-mq` (3-node cluster)
3. Backend: merge + deploy this PR
4. Follow-up: swap to KubeBlocks `redis-cluster` chart (3-shard sharded,
already staged in infra PR)

## Caveats / follow-ups

- Classic pub/sub via seed node means every node in the cluster sees
every message (broadcast). Fine at current volume; if it becomes hot,
migrate to `SPUBLISH`/`SSUBSCRIBE` (Redis 7+ sharded pub/sub).
- Per-user rate-limit counters (daily vs weekly) lost cross-counter
transactionality, but per-counter atomicity is preserved — the two
counters are independent budgets so no correctness regression.
- Local 2-master cluster crashes lose the cluster state; `redis-init`
idempotently rebootstraps.

## Checklist

- [x] Lint + format pass (`poetry run format` + `poetry run lint`)
- [x] Unit tests pass — `redis_client_test`, `redis_helpers_test`,
`event_bus_test`, `pending_messages_test`, `rate_limit_test`,
`cluster_lock_test`
- [x] Live smoke against 2-master cluster — sync + async; MULTI/EXEC;
EVAL; pub/sub; cross-slot rejection
- [x] Full stack smoke — `rest_server` /health, `executor` boot, keys
distributed across both shards
- [ ] Dev deploy (pending infra PR merge + manual validation)
2026-04-28 22:21:23 +07:00
..