Compare commits

...

30 Commits

Author SHA1 Message Date
Zamil Majdy
79bc0aed91 fix(backend): guard intermediate DB flush with is_final_attempt flag
Add is_final_attempt field to _RetryState so intermediate DB flushes
only run on attempt 0 (optimistic — most turns succeed first try) and
the final retry attempt. Middle retry attempts may be rolled back, and
messages already flushed to DB would persist as orphans since the
in-memory rollback (session.messages truncation) has no corresponding
DB delete.
2026-04-01 06:44:02 +02:00
Zamil Majdy
2fa5c37413 Merge branch 'dev' of github.com:Significant-Gravitas/AutoGPT into fix/copilot-search-cap-and-persistence 2026-04-01 06:15:30 +02:00
Zamil Majdy
d7f324bc9f Merge branch 'dev' of github.com:Significant-Gravitas/AutoGPT into fix/copilot-search-cap-and-persistence 2026-03-31 19:07:29 +02:00
Zamil Majdy
ffcb88251a style: move StreamHeartbeat import to module top-level 2026-03-31 16:34:45 +02:00
Zamil Majdy
f3aef1ecbc revert(copilot): remove delete_messages_from_sequence
The orphan message scenario (intermediate flush persists messages, then
stream attempt is retried and rolls back) is practically unreachable:
retries only fire on context-too-long with events_yielded==0, meaning
the stream barely started and the flush threshold (30s/10 messages)
could not have been reached. Removing the delete operation eliminates
the risk of accidentally deleting legitimate messages.
2026-03-31 16:14:28 +02:00
Zamil Majdy
053afde64d revert(copilot): remove circuit breakers and re-enable Task tool
The WebSearch/total tool call caps and Task tool disabling were band-aid
fixes that limited capability rather than addressing root causes. The
real bug fixes (Redis TTL refresh, intermediate DB persistence, orphan
message cleanup, StreamHeartbeat) remain in place. Task concurrency
limits (max_subtasks) and prompt-level search best practices provide
sufficient guardrails without artificially capping tool usage.
2026-03-31 15:59:07 +02:00
Zamil Majdy
687ee1f280 Merge branch 'dev' of github.com:Significant-Gravitas/AutoGPT into fix/copilot-search-cap-and-persistence 2026-03-31 15:18:06 +02:00
Zamil Majdy
41a11d74b3 Merge branch 'fix/copilot-search-cap-and-persistence' of github.com:Significant-Gravitas/AutoGPT into fix/copilot-search-cap-and-persistence 2026-03-31 15:12:40 +02:00
Zamil Majdy
4df0714e2a fix(copilot): clean up orphaned DB messages in _HandledStreamError rollback 2026-03-31 15:12:18 +02:00
Zamil Majdy
b365a3337b Reapply "fix(copilot): detect truncated write_workspace_file and guide LLM to source_path"
This reverts commit ac49e72745.
2026-03-31 12:49:10 +00:00
Zamil Majdy
935b59ce43 Merge branch 'fix/copilot-search-cap-and-persistence' of github.com:Significant-Gravitas/AutoGPT into fix/copilot-search-cap-and-persistence 2026-03-31 14:46:03 +02:00
Zamil Majdy
77112e79a2 fix(copilot): clean up orphaned DB messages on stream attempt rollback
When intermediate flushes persist messages during a stream attempt that
later fails and is retried, the in-memory rollback now also deletes the
orphaned messages from the DB via delete_messages_from_sequence. This
prevents stale messages from resurfacing on page reload.
2026-03-31 14:45:21 +02:00
Zamil Majdy
ac49e72745 Revert "fix(copilot): detect truncated write_workspace_file and guide LLM to source_path"
This reverts commit 8ae344863e.
2026-03-31 12:45:04 +00:00
Zamil Majdy
8ae344863e fix(copilot): detect truncated write_workspace_file and guide LLM to source_path
When the LLM tries to inline a very large file as content in
write_workspace_file, the SDK silently truncates the tool call
arguments to {}. The tool then returns a generic 'filename required'
error, which the LLM doesn't understand and retries the same way
(wasting 500s+ per attempt, as seen in session c465eff9).

Now: when ALL parameters are missing (likely truncation), return an
actionable error explaining what happened and how to fix it — write
the file to disk with bash_exec first, then use source_path to copy
it to workspace. This gives the LLM a clear recovery path instead
of a retry loop.
2026-03-31 12:43:42 +00:00
Zamil Majdy
cbd3ebce00 feat(copilot): add proactive budget warnings before hitting tool call caps
When WebSearch or total tool call usage reaches 80% of the cap, the
PreToolUse hook now returns additionalContext warning the model about
remaining budget. This lets the model plan its remaining calls instead
of hitting a hard denial wall with no prior notice.
2026-03-31 13:36:22 +02:00
Zamil Majdy
2d00268516 fix(copilot): raise total tool call cap to 500 (10x web search limit)
100 total tool calls per turn was too tight for complex autopilot tasks
that involve many file reads/writes and sub-agent delegations. Bumping
to 500 keeps the circuit breaker effective against runaway loops while
giving legitimate long-running turns sufficient headroom.
2026-03-31 12:41:56 +02:00
Zamil Majdy
751382fcff fix(backend): clean up _meta_ttl_refresh_at on session completion
Remove session entries from the module-level _meta_ttl_refresh_at dict
when mark_session_completed is called, preventing unbounded memory
growth over the lifetime of the backend process.
2026-03-31 09:19:01 +02:00
Zamil Majdy
780c44c051 fix: update Task tests — Task is now in BLOCKED_TOOLS, always denied
Task was added to SDK_DISALLOWED_TOOLS so all Task tests now expect
denial. Removed concurrency slot tests since they're unreachable when
the tool is blocked at the access level.
2026-03-30 17:04:33 +00:00
Zamil Majdy
77eb07c458 style: black formatting for test files 2026-03-30 16:49:41 +00:00
Zamil Majdy
13a2e623a0 test: add tests for Task disallowed, StreamHeartbeat on task_progress, WebSearch denial budget
- tool_adapter_test: verify Task, Bash, WebFetch are in SDK_DISALLOWED_TOOLS
- response_adapter_test: verify task_progress emits StreamHeartbeat
- security_hooks_test: verify denied WebSearches don't consume total tool budget
2026-03-30 16:46:59 +00:00
Zamil Majdy
8d99660ba0 chore: bump WebSearch cap to 50 per turn 2026-03-30 16:38:26 +00:00
Zamil Majdy
bfbec703ce fix(copilot): disable SDK Task tool, bump search cap to 30
- Disable the SDK built-in Task (sub-agent) tool by adding it to
  SDK_DISALLOWED_TOOLS. The AutoPilotBlock via run_block is the
  preferred delegation mechanism — it has full Langfuse observability,
  unlike the SDK Task tool which runs opaquely.
  The Task tool was the root cause of the d2f7cba3 incident: it
  spawned 5 sub-agents with no shared context, each independently
  hammering WebSearch with overlapping queries.
- Bump WebSearch cap from 15 to 30 per turn — less restrictive
  while still preventing the worst-case runaway.
- Update prompt to reflect Task tool is disabled, point to
  AutoPilotBlock for sub-agent delegation.
2026-03-30 16:34:15 +00:00
Zamil Majdy
8763a94436 fix(copilot): keep Redis stream alive during sub-agent execution
During long sub-agent runs, the SDK sends task_progress SystemMessages
that were previously silent (no stream chunks produced). This meant
publish_chunk was never called during those gaps, causing BOTH the
meta key and stream key to expire in Redis.

Fix:
- response_adapter: emit StreamHeartbeat for task_progress events,
  so publish_chunk is called even during sub-agent gaps
- stream_registry: refresh stream key TTL alongside meta key in the
  periodic keepalive block (every 60s)

This ensures that as long as the SDK is producing any events (including
task_progress), both Redis keys stay alive. Confirmed via live
reproduction: session d2f7cba3 T13 ran for 1h45min+ with both keys
expired because only task_progress events were arriving.
2026-03-30 16:29:07 +00:00
Zamil Majdy
a504fe532a fix(copilot): refresh Redis session meta TTL during long-running turns
Root cause of empty session on reload: the session meta key in Redis has
a 1h TTL set once at create_session time. Turns exceeding 1h (like
session d2f7cba3 at 82min) cause the meta key to expire, making
get_active_session return False. The resume endpoint then returns 204
and the frontend shows an empty session.

Fix: publish_chunk now periodically refreshes the meta key TTL (every
60s) when session_id is provided. stream_and_publish already has
session_id and passes it through. This keeps the meta key alive for
as long as chunks are being published.

GCP logs confirmed the bug: at 09:49 (73 min into the turn),
GET_SESSION returned active_session=False, msg_count=1 — the meta
key had expired 13 minutes earlier.
2026-03-30 16:06:09 +00:00
Zamil Majdy
b3f52ce3b3 revert: remove Perplexity guidance from code supplement
Perplexity guidance is managed in the Langfuse prompt (source of truth),
not in the code supplement. Reverts the redundant addition from fc13c30.
The Langfuse prompt has been updated with stronger, actionable guidance
including block ID, model names, and clear trigger (3+ searches).
2026-03-30 12:30:36 +00:00
Zamil Majdy
0ac208603c feat(copilot): nudge LLM to use Perplexity for deep research over WebSearch
For research-heavy tasks (5+ searches), the prompt now directs the LLM
to use run_block with PerplexityBlock (sonar-pro) instead of repeated
WebSearch calls. Perplexity returns synthesized, cited answers in a
single call — avoiding the 29-54s per-call latency of SDK WebSearch
and reducing total search count significantly.
2026-03-30 12:15:06 +00:00
Zamil Majdy
57401a9b13 fix: re-enable intermediate flush for all attempts, add rollback note
The is_final_attempt guard disabled intermediate flush for the common
case (first attempt succeeds, which is 99%+ of turns). Retries only
fire on context-too-long errors with events_yielded==0, meaning the
stream barely started and flush threshold was almost certainly not
reached. Keep flush always enabled and document the theoretical edge
case.
2026-03-30 11:53:33 +00:00
Zamil Majdy
df9ae41c25 fix: update prompt to remove 'per session' scope qualifier for web search cap 2026-03-30 11:50:16 +00:00
Zamil Majdy
bfd152dcc7 fix: guard intermediate flush against retry rollback, fix counter scope labels, reorder checks, top-level imports
- Guard intermediate DB flush with is_final_attempt flag to prevent
  persisting messages from attempts that may be rolled back on retry
- Fix 'per session' → 'per turn' in comments/docstrings/denial messages
  since hooks are recreated per stream invocation
- Reorder circuit breaker checks: WebSearch cap before total counter
  increment so denied searches don't consume total budget slots
- Move create_security_hooks import to module top-level in tests per
  CLAUDE.md coding guidelines
2026-03-30 11:48:21 +00:00
Zamil Majdy
0f92e585ab fix(copilot): add tool call circuit breakers and intermediate persistence
- Add WebSearch call cap (15/session) to prevent runaway research loops
- Add total tool call cap (100/turn) as hard circuit breaker
- Add web search best practices guidance to system prompt
- Add intermediate session persistence (every 30s or 10 messages)
- Add tests for WebSearch cap and total tool call cap

Addresses findings from session d2f7cba3: 179 WebSearch calls,
$20.66 cost, 82 minutes for a single user message.
2026-03-30 11:31:17 +00:00

View File

@@ -266,6 +266,7 @@ class _RetryState:
adapter: SDKResponseAdapter
transcript_builder: TranscriptBuilder
usage: _TokenUsage
is_final_attempt: bool = True
@dataclass
@@ -1493,9 +1494,14 @@ async def _run_stream_attempt(
# --- Intermediate persistence ---
# Flush session messages to DB periodically so page reloads
# show progress during long-running turns.
# Guarded by is_final_attempt: earlier retry attempts may be
# rolled back in memory (session.messages truncated), but
# messages already flushed to DB would persist as orphans.
# is_final_attempt is True on attempt 0 (optimistic — most
# turns succeed on the first try) and on the last retry.
_msgs_since_flush += 1
now = time.monotonic()
if (
if state.is_final_attempt and (
_msgs_since_flush >= _FLUSH_MESSAGE_THRESHOLD
or (now - _last_flush_time) >= _FLUSH_INTERVAL_SECONDS
):
@@ -1986,6 +1992,11 @@ async def stream_chat_completion_sdk(
)
for attempt in range(_MAX_STREAM_ATTEMPTS):
# Enable intermediate DB flushes on attempt 0 (optimistic: most
# turns succeed on the first try) and the last attempt. Middle
# retry attempts may be rolled back, and flushed messages would
# persist as DB orphans — so flushes are disabled for those.
state.is_final_attempt = attempt == 0 or attempt == _MAX_STREAM_ATTEMPTS - 1
# Clear any stale stash signal from the previous attempt so
# wait_for_stash() doesn't fire prematurely on a leftover event.
reset_stash_event()