fix(copilot): keep Redis stream alive during sub-agent execution

During long sub-agent runs, the SDK sends task_progress SystemMessages
that were previously silent (no stream chunks produced). This meant
publish_chunk was never called during those gaps, causing BOTH the
meta key and stream key to expire in Redis.

Fix:
- response_adapter: emit StreamHeartbeat for task_progress events,
  so publish_chunk is called even during sub-agent gaps
- stream_registry: refresh stream key TTL alongside meta key in the
  periodic keepalive block (every 60s)

This ensures that as long as the SDK is producing any events (including
task_progress), both Redis keys stay alive. Confirmed via live
reproduction: session d2f7cba3 T13 ran for 1h45min+ with both keys
expired because only task_progress events were arriving.
This commit is contained in:
Zamil Majdy
2026-03-30 16:29:07 +00:00
parent a504fe532a
commit 8763a94436
2 changed files with 15 additions and 3 deletions

View File

@@ -27,6 +27,7 @@ from backend.copilot.response_model import (
StreamError,
StreamFinish,
StreamFinishStep,
StreamHeartbeat,
StreamStart,
StreamStartStep,
StreamTextDelta,
@@ -76,6 +77,12 @@ class SDKResponseAdapter:
# Open the first step (matches non-SDK: StreamStart then StreamStartStep)
responses.append(StreamStartStep())
self.step_open = True
elif sdk_message.subtype == "task_progress":
# Emit a heartbeat so publish_chunk is called during long
# sub-agent runs. Without this, the Redis stream and meta
# key TTLs expire during gaps where no real chunks are
# produced (task_progress events were previously silent).
responses.append(StreamHeartbeat())
elif isinstance(sdk_message, AssistantMessage):
# Flush any SDK built-in tool calls that didn't get a UserMessage

View File

@@ -280,16 +280,21 @@ async def publish_chunk(
# Set TTL on stream to match session metadata TTL
await redis.expire(stream_key, config.stream_ttl)
# Periodically refresh the session meta key TTL so it doesn't expire
# Periodically refresh session-related TTLs so they don't expire
# during long-running turns. Without this, turns exceeding stream_ttl
# (default 1h) lose their "running" status and become invisible to
# the resume endpoint — causing empty sessions on page reload.
# (default 1h) lose their "running" status and stream data, making
# the session invisible to the resume endpoint (empty on page reload).
# Both meta key AND stream key are refreshed: the stream key's expire
# above only fires when publish_chunk is called, but during long
# sub-agent gaps (task_progress events don't produce chunks), neither
# key gets refreshed.
if session_id:
now = time.perf_counter()
last_refresh = _meta_ttl_refresh_at.get(session_id, 0)
if now - last_refresh >= _META_TTL_REFRESH_INTERVAL:
meta_key = _get_session_meta_key(session_id)
await redis.expire(meta_key, config.stream_ttl)
await redis.expire(stream_key, config.stream_ttl)
_meta_ttl_refresh_at[session_id] = now
total_time = (time.perf_counter() - start_time) * 1000