Commit Graph

8428 Commits

Author SHA1 Message Date
Toran Bruce Richards
23b5e1272e feat: add OpenAI GPT-image models to image blocks
Add gpt-image-1, gpt-image-1.5, gpt-image-2, and gpt-image-1-mini
as model options in:
- AIImageGeneratorBlock
- AIImageEditorBlock
- AIImageCustomizerBlock

Changes:
- Expand model enums in all three blocks
- Update credentials to accept Union[Replicate, OpenAI]
- Add OpenAI API branches using images.generate and images.edit
- Add block pricing (8-20 credits) consistent with existing tiers
- Rename GeminiImageModel -> ImageCustomizerModel (backwards compatible)
- Rename FluxKontextModelName -> ImageEditorModel (backwards compatible alias)
2026-04-22 20:08:28 +00:00
Zamil Majdy
c56c1e5dd6 fix(backend/copilot): disable ask_question tool pending UX rework (#12887)
### Why / What / How

**Why:** The in-conversation Question GUI is unreliable in production —
users submitting answers can get their messages dropped and the agent
gets stuck on the auto-generated "please proceed" step with no way to
make progress. Discord report:
https://discord.com/channels/1126875755960336515/1496474512966029472/1496537943287005365
(see attached video). Pause/queue semantics still need a rework; until
then, the right call is to stop the model from reaching for this tool.

**What:** Removes `ask_question` from the copilot tool registry so the
model never sees or calls it. Historical sessions that already contain
`ask_question` tool calls still render (frontend renderers + response
model untouched), so this is non-destructive to existing chats.
Re-enabling once UX is reworked is a small revert.

**How:**
- Drop the `AskQuestionTool` import + registry entry from
`backend/copilot/tools/__init__.py`.
- Drop `"ask_question"` from the `ToolName` literal in
`backend/copilot/permissions.py` — required because a runtime
consistency check asserts the literal matches `TOOL_REGISTRY.keys()`.
- Delete the "Clarifying — Before or During Building" section from
`backend/copilot/sdk/agent_generation_guide.md` so the SDK-mode system
prompt no longer instructs the model to call `ask_question`.
- Drop the three `prompting_test.py` tests that asserted the guide
mentions that section.
- Keep `ask_question.py`, its unit test, `ClarificationNeededResponse`,
and the frontend `AskQuestion`/`ClarificationQuestionsCard` components
untouched so old sessions still render and re-enabling is a small
revert.

### Changes 🏗️

- `backend/copilot/tools/__init__.py` — remove `AskQuestionTool` import
and `"ask_question"` entry in `TOOL_REGISTRY`.
- `backend/copilot/permissions.py` — remove `"ask_question"` from the
`ToolName` literal.
- `backend/copilot/sdk/agent_generation_guide.md` — remove the
"Clarifying — Before or During Building" section.
- `backend/copilot/prompting_test.py` — remove
`TestAgentGenerationGuideContainsClarifySection` and the now-unused
`Path` import.

### Checklist 📋

#### For code changes:
- [x] I have clearly listed my changes in the PR description
- [x] I have made a test plan
- [ ] I have tested my changes according to the test plan:
- [x] `poetry run pytest backend/copilot/tools/
backend/copilot/permissions_test.py backend/copilot/prompting_test.py` —
805+78 tests pass, consistency check between `ToolName` literal and
`TOOL_REGISTRY` still holds.
- [ ] Smoke-test in dev: start a copilot session and confirm the model
no longer lists/calls `ask_question` (its OpenAI tool schema is gone
from `get_available_tools()` and from the SDK `allowed_tools`).
- [ ] Load a historical session that contains an `ask_question` tool
call in its transcript — confirm the frontend still renders the question
card (no regression on legacy sessions).
2026-04-22 23:34:04 +07:00
Bentlybro
6fcbe95645 Merge branch 'master' into dev autogpt-platform-beta-v0.6.57 2026-04-22 15:36:37 +01:00
Zamil Majdy
9703da3dfd refactor(backend/copilot): Moonshot module + cache_control widening + partial-messages default-on + title cost (#12882)
## Why

Several loose ends from the Kimi SDK-default merge (#12878), plus
follow-ups surfaced during review + E2E testing:

1. **Kimi-specific pricing lived inline in `sdk/service.py`** alongside
unrelated SDK plumbing — any future non-Anthropic vendor would have
piled onto the same file.
2. **Moonshot's Anthropic-compat endpoint honours `cache_control: {type:
ephemeral}`**, but the baseline cache-marking gate
(`_is_anthropic_model`) was narrow enough to exclude it → Moonshot fell
back to automatic prefix caching, which drifts readily between turns.
3. **Kimi reasoning rendered AFTER the answer text** on dev because the
summary-walk hoist only reorders within one `AssistantMessage.content`
list, and Moonshot splits each turn into multiple sequential
AssistantMessages (text-only, then thinking-only).
4. **Title generation's LLM call bypassed cost tracking** — admin
dashboard under-reported total provider spend by the aggregate of those
per-session calls.
5. **Cost override** was using the requested primary model, not the
actually-executed model — when the SDK fallback activates the override
mis-routes pricing.

## What

### Moonshot module
New `backend/copilot/moonshot.py`:
- `is_moonshot_model(model)` — prefix check against `moonshotai/`
- `rate_card_usd(model)` — published Moonshot rates, default `(0.60,
2.80)` per MTok with per-slug override slot
- `override_cost_usd(...)` — moved from `sdk/service.py`, replaces CLI's
Sonnet-rate estimate with real rate card
- `moonshot_supports_cache_control(model)` — narrow gate for cache
markers

Rate card is **not canonical** — authoritative cost comes from the
OpenRouter `/generation` reconcile; this module only improves the
in-turn estimate and the reconcile's lookup-fail fallback. Signal
authority: reconcile >> rate card >> CLI.

### Baseline cache-control widened to Moonshot
- New `_supports_prompt_cache_markers` = `_is_anthropic_model OR
is_moonshot_model`
- Both call sites (system-message cache dict, last-tool cache marker)
switched to the wider gate
- OpenAI / Grok / Gemini still return `false` — those endpoints 400 on
the unknown field

**Measured impact in /pr-test:** baseline Kimi continuation turns jumped
to ~98% cache hit (334 uncached + 12.8K cache_read on a 13.1K prompt).

### SDK partial-messages default-on (fixes the reasoning-order bug)
- `CHAT_SDK_INCLUDE_PARTIAL_MESSAGES` flipped from `default=False` →
`default=True`
- Kimi stream now emits `reasoning-start → reasoning-delta* →
reasoning-end → text-start → text-delta*` in the correct order —
verified in /pr-test
- Kill-switch: set `CHAT_SDK_INCLUDE_PARTIAL_MESSAGES=false` to fall
back to summary-only emission

### SDK cost override scoped to Moonshot
- Call site now explicitly gates `if _is_moonshot_model(active_model)` —
Anthropic turns trust CLI's number directly
- Added `_RetryState.observed_model` populated from
`AssistantMessage.model`, preferred over `state.options.model` so
fallback-model turns bill correctly (addresses CodeRabbit review)

### Title cost capture
- `_generate_session_title` now returns `(title, ChatCompletion)` so the
caller controls cost persistence
- `_update_title_async` runs title-persist and cost-record as
independent best-effort steps
- `_title_usage_from_response` helper reads `prompt_tokens /
completion_tokens / cost_usd` (OR's `usage.cost` off `model_extra`)
- Provider label derived from `ChatConfig.base_url` (`open_router` /
`openai`)
- No exception suppressors — `isinstance(cost_raw, (int, float))` check
replaces the inner `float()` try/except

### Misc
- Kimi tool-name whitespace strip in the response adapter — Kimi
occasionally emits tool names with leading spaces the CLI dispatcher
can't resolve
- TODO marker on the rate-card for post-prod-soak removal

## How

- Detection is **prefix-based** (`moonshotai/`) — future Kimi SKUs
transparently inherit rate card + cache-control gate
- Baseline cache-marking was already structured; only the gate changes
- Partial-messages default-on relies on the adapter's diff-based
reconcile (shipped in #12878) which has soaked stable
- Title cost path mirrors `tools/web_search.py`'s pattern for reading
OR's `usage.cost`

## Test plan

- [x] `pytest backend/copilot/moonshot_test.py` — 21 tests
- [x] `pytest backend/copilot/baseline/service_unit_test.py` — updated
for widened gate
- [x] `pytest backend/copilot/sdk/*_test.py
backend/copilot/service_test.py` — no regressions
- [x] Full E2E on local native stack — 10/10 scenarios pass (see
test-report comment)
- [x] Measured: baseline Kimi ~98% cache hit on continuation, SDK Kimi
~62% (capped by Moonshot's prefix ceiling)

## Deferred

SDK-path Moonshot cache hit rate stays at ~62% on long prompts.
`native_tokens_cached=18432` regardless of turn/session suggests a
Moonshot-side cap on cached prefix size. Not fixable by our code —
requires proxy rewriting requests or upstream Moonshot change.
2026-04-22 20:42:47 +07:00
Zamil Majdy
ebb0d3b95b feat(backend/copilot): LaunchDarkly per-user model routing (#12881)
## Summary

Per-user model routing for the copilot via LaunchDarkly. Replaces the
pure-env-var pick on every `(mode, tier)` cell of the model matrix with
an LD-first resolver that falls back to the `ChatConfig` default. Lets
us roll out non-default routes (e.g. Kimi K2.6 on baseline standard) to
a user cohort without shipping a deploy.

| | standard | advanced |

|----------|------------------------------------|------------------------------------|
| fast | `copilot-fast-standard-model` | `copilot-fast-advanced-model` |
| thinking | `copilot-thinking-standard-model` |
`copilot-thinking-advanced-model` |

All four flags are **string-valued** — the value IS the model identifier
(e.g. `"anthropic/claude-sonnet-4-6"` or `"moonshotai/kimi-k2.6"`).

## What ships

- **New module `backend/copilot/model_router.py`** with a single
`resolve_model(mode, tier, user_id, *, config)` coroutine. That's the
one place both paths consult.
- **4 new `Flag` enum values** in `backend/util/feature_flag.py`
(reusing the existing `get_feature_flag_value` helper which already
supports arbitrary return types).
- **`baseline/service.py::_resolve_baseline_model`** → async, takes
`user_id`.
- **`sdk/service.py::_resolve_sdk_model_for_request`** → takes
`user_id`, consults LD for both standard and advanced thinking cells.
- **Default flip**: `fast_standard_model` default goes back to
`anthropic/claude-sonnet-4-6`. Non-Anthropic routes now ship via LD
targeting — safer rollback, per-user cohort control, no redeploy
required to flip.

## Behavior preserved

- `config.claude_agent_model` explicit override still wins
unconditionally (existing escape hatch for ops).
- `use_claude_code_subscription=true` on the standard thinking tier
still returns `None` so the CLI picks the model tied to the user's
Claude Code subscription.
- All legacy env var aliases (`CHAT_MODEL`, `CHAT_ADVANCED_MODEL`,
`CHAT_FAST_MODEL`) still bind to their cells.
- LD client exceptions / misconfigured (non-string) flag values fall
back silently to config default with a single warning log — never fails
the request.

## Files

| File | Change |
|---|---|
| `backend/copilot/model_router.py` | new — `resolve_model` +
`_config_default` + `_FLAG_BY_CELL` map |
| `backend/copilot/model_router_test.py` | new — 11 cases |
| `backend/util/feature_flag.py` | add 4 string-valued `Flag` entries |
| `backend/copilot/config.py` | flip `fast_standard_model` default to
Sonnet |
| `backend/copilot/baseline/service.py` | `_resolve_baseline_model` →
async + LD resolver |
| `backend/copilot/sdk/service.py` | `_resolve_sdk_model_for_request` →
LD resolver + user_id |
| `backend/copilot/baseline/transcript_integration_test.py` | update
tests for new signature + default |

## Test plan

- [x] `poetry run pytest backend/copilot/model_router_test.py
backend/copilot/baseline/transcript_integration_test.py
backend/copilot/sdk/service_test.py backend/copilot/config_test.py` —
**112 passing**
- [x] 11 resolver cases: missing user → fallback, LD string wins,
whitespace stripped, non-string value → fallback, empty string →
fallback, LD exception → fallback + warn, each of 4 cells routes to its
distinct flag
- [x] Legacy env aliases still bind to their new fields
- [ ] Manual dev-env smoke: flip `copilot-fast-standard-model` LD
targeting to `moonshotai/kimi-k2.6` for one user and confirm baseline
uses Kimi while other users stay on Sonnet
- [ ] Confirm SDK path still honors subscription mode (LD not consulted
when `use_claude_code_subscription=true` + standard tier)

## Rollout

1. Merge this PR → default stays Sonnet / Opus across the matrix, no
behavior change.
2. Create the 4 LD flags as string-typed in the LaunchDarkly console
(defaults matching config, so no drift if targeting empty).
3. Add per-user / per-cohort targeting in LD for the routes we want to
roll out (Kimi on baseline standard for a percentage, etc.).
2026-04-22 20:08:37 +07:00
Zamil Majdy
b98bcf31c8 feat(backend/copilot): SDK fast tier defaults to Kimi K2.6 via OpenRouter + vendor-aware cost + cross-model fix (#12878)
## Summary

Make Kimi K2.6 the default for the SDK (extended-thinking) copilot path,
mirroring the baseline default landed in #12871. The SDK already routes
through OpenRouter (see
[`build_sdk_env`](autogpt_platform/backend/backend/copilot/sdk/env.py) —
`ANTHROPIC_BASE_URL` is set to OpenRouter's Anthropic-compatible
`/v1/messages` endpoint), but the model resolver was unconditionally
stripping the vendor prefix, which prevented routing to anything except
Anthropic models. This PR unblocks Kimi (and any other non-Anthropic
OpenRouter vendor) on the SDK fast tier and flips the default to match
the baseline path.

## Why

After #12871 the baseline (`fast_*`) path runs Kimi K2.6 by default —
~5x cheaper than Sonnet at SWE-Bench parity — but the SDK (`thinking_*`)
path was still pinned to Sonnet because:

1. **Model name normalization stripped the vendor prefix.**
`_normalize_model_name("moonshotai/kimi-k2.6")` returned `"kimi-k2.6"`,
which OpenRouter cannot route — the unprefixed form only resolves for
Anthropic models. The docstring on `thinking_standard_model` claimed
"the Claude Agent SDK CLI only speaks to Anthropic endpoints", but the
env builder shows the CLI happily talks to OpenRouter's `/messages`
endpoint, which routes to any vendor in the catalog.
2. **The default was `anthropic/claude-sonnet-4-6`.** Same model on a
more expensive route.
3. **Cost label was hardcoded to `provider="anthropic"`** on the SDK
path's `persist_and_record_usage` call, making cost-analytics rows
misleading once Kimi runs.

## What

1. **`_normalize_model_name`**
([sdk/service.py](autogpt_platform/backend/backend/copilot/sdk/service.py))
— when `config.openrouter_active` is True, the canonical `vendor/model`
slug is preserved unchanged so OpenRouter can route to the correct
provider. Direct-Anthropic mode keeps the existing strip-prefix +
dot-to-hyphen conversion (Anthropic API requires both) and now **raises
`ValueError`** when paired with a non-Anthropic vendor slug — silent
strip would have sent `kimi-k2.6` to the Anthropic API and produced an
opaque `model_not_found`.
2. **`thinking_standard_model`**
([config.py](autogpt_platform/backend/backend/copilot/config.py)) —
default flipped from `anthropic/claude-sonnet-4-6` to
`moonshotai/kimi-k2.6`. Field description rewritten; rollback to Sonnet
is one env var
(`CHAT_THINKING_STANDARD_MODEL=anthropic/claude-sonnet-4.6`).
3. **`@model_validator(mode="after")`** on `ChatConfig`
([config.py:_validate_sdk_model_vendor_compatibility](autogpt_platform/backend/backend/copilot/config.py))
— fail at config load when `use_openrouter=False` is paired with a
non-Anthropic SDK slug. The runtime guard in `_normalize_model_name` is
kept as defence-in-depth, but the validator turns a per-request 500 into
a boot-time error message the operator sees once, before any traffic
lands. Covers `thinking_standard_model`, `thinking_advanced_model`, and
`claude_agent_fallback_model`. Subscription mode is exempt (resolver
returns `None` and never normalizes). The credential-missing case
(`use_openrouter=True` + no `api_key`) is intentionally NOT a boot-time
error so CI builds and OpenAPI-schema export jobs that construct
`ChatConfig()` without secrets keep working — the runtime guard still
catches it on the first SDK turn.
4. **Cost provider attribution**
([sdk/service.py:stream_chat_completion_sdk](autogpt_platform/backend/backend/copilot/sdk/service.py))
— `persist_and_record_usage` now passes `provider="open_router" if
config.openrouter_active else "anthropic"` instead of hardcoded
`"anthropic"`. The dollar value still comes from
`ResultMessage.total_cost_usd`; this just fixes the analytics label.
5. **Baseline rollback example** ([config.py:fast_standard_model
description](autogpt_platform/backend/backend/copilot/config.py)) — same
dot-vs-hyphen footgun fix (CodeRabbit catch).
6. **Tests** — `TestNormalizeModelName` (sdk/) monkeypatches a
deterministic config per case (the helper-test variants were passing
accidentally based on ambient env). New
`TestSdkModelVendorCompatibility` class in `config_test.py` covers all
five validator shapes (default-Kimi + direct-Anthropic raises, anthropic
override succeeds, openrouter mode succeeds, subscription mode skips
check, advanced+fallback tier also validated, empty fallback skipped).
`_ENV_VARS_TO_CLEAR` extended to all model/SDK/subscription env aliases
so a leftover dev `.env` value can't mask validator behaviour. New
`_make_direct_safe_config` helper for direct-Anthropic tests.

## Test plan

- [x] `poetry run pytest backend/copilot/config_test.py
backend/copilot/sdk/service_test.py
backend/copilot/sdk/service_helpers_test.py
backend/copilot/sdk/env_test.py
backend/copilot/sdk/p0_guardrails_test.py` — 238 pass
- [x] `poetry run pytest backend/copilot/` — 2560 pass + 5 pre-existing
integration failures (need real API keys / browser env, unrelated)
- [x] CI green on `feat/copilot-sdk-kimi-default` (35 pass / 0 fail / 1
neutral)
- [x] Manual: SDK extended_thinking turn against Kimi K2.6 via
OpenRouter on the native dev stack — request lands with
`model=moonshotai/kimi-k2.6`, response streams back, multi-turn
`--resume` recalls facts across turns. Backend log: `[SDK] Per-request
model override: standard (moonshotai/kimi-k2.6)`.
- [x] Manual: rollback path —
`CHAT_THINKING_STANDARD_MODEL=anthropic/claude-sonnet-4.6` resumes
Sonnet routing.

## Known follow-ups (not in this PR)

These surfaced during manual testing and will need separate PRs:

- **SDK CLI cost is wrong for non-Anthropic models.**
`ResultMessage.total_cost_usd` comes from a static Anthropic pricing
table baked into the CLI binary; for Kimi K2.6 it falls back to Sonnet
rates, **over-billing ~5x** ($0.089 vs the real ~$0.018 for ~30K prompt
+ ~80 completion). The `provider` label is now correct but the dollar
value isn't. Needs either a per-model rate card override on our side or
a CLI patch upstream.
- **Mid-session model switch (Kimi → Opus) breaks.** Kimi's
`ThinkingBlock`s have no Anthropic `signature` field; when the user
toggles standard → advanced after a Kimi turn, Opus rejects the replayed
transcript with `Invalid signature in thinking block`. Needs transcript
scrubbing on model switch (similar to existing
`TestStripStaleThinkingBlocks` pattern).
- **Reasoning UI ordering on Kimi.** Moonshot/OpenRouter places
`reasoning` AFTER text in the response; the SDK's
`AssistantMessage.content` reflects that order, and `response_adapter`
emits SSE events in the same order — so reasoning lands BELOW the answer
in the UI instead of above. Needs `ThinkingBlock` hoisting in
`response_adapter.py`.
2026-04-22 18:35:01 +07:00
Zamil Majdy
4f11867d92 feat(backend/copilot): TodoWrite for baseline copilot (#12879)
## Summary

Add `TodoWrite` to baseline copilot so the "task checklist" UI works on
non-Claude models (Kimi, GPT, Grok, etc.) the same way it works on the
SDK path. Baseline previously had no `TodoWrite` tool at all — only SDK
mode did via the Claude Code CLI's built-in — so models on baseline just
couldn't reach for a planning checklist.

This closes the last clear feature gap blocking baseline from being the
primary copilot path without giving up model flexibility.

## What ships

- **New MCP tool `TodoWrite`** in `TOOL_REGISTRY`, schema matching the
one the frontend's `GenericTool.helpers.ts` (`getToolCategory → "todo"`)
already renders as the **Steps** accordion. The tool is a stateless echo
— the canonical list lives in the model's latest tool-call args and
replays from transcript on subsequent turns.
- **Prompt guidance** in `SHARED_TOOL_NOTES` teaching the model when to
use it (3+ step tasks; always send the full list; exactly one
`in_progress` at a time).
- **Sharpened `run_sub_session` guidance** in the same prompt section —
framed explicitly as the context-isolation primitive for baseline.
Clearer for the model, no dual-primitive confusion.

## How the SDK path stays untouched

- SDK mode keeps using the CLI-native `TodoWrite` built-in.
- `BASELINE_ONLY_MCP_TOOLS = {"TodoWrite"}` in `sdk/tool_adapter.py`
filters the baseline MCP wrapper out of SDK's `allowed_tools` — no name
shadowing.
- `SDK_BUILTIN_TOOL_NAMES` is now an explicit allowlist (not
auto-derived from capitalization) so the classification stays coherent
when a capitalized tool is platform-owned.

## Files

| File | Change |
|---|---|
| `backend/copilot/tools/todo_write.py` | new — `TodoWriteTool` |
| `backend/copilot/tools/__init__.py` | register in `TOOL_REGISTRY` |
| `backend/copilot/tools/models.py` | add `TodoItem` +
`TodoWriteResponse` + `ResponseType.TODO_WRITE` |
| `backend/copilot/permissions.py` | explicit `SDK_BUILTIN_TOOL_NAMES`;
`apply_tool_permissions` maps baseline-only tools to CLI name for SDK |
| `backend/copilot/sdk/tool_adapter.py` | `BASELINE_ONLY_MCP_TOOLS`
filter |
| `backend/copilot/prompting.py` | `TodoWrite` + sharpened
`run_sub_session` guidance |
| `backend/api/features/chat/routes.py` | add `TodoWriteResponse` to
`ToolResponseUnion` |
| `backend/copilot/tools/todo_write_test.py` | new — schema + execute
tests |
| `frontend/src/app/api/openapi.json` | regenerated |
| `tools/tool_schema_test.py` | budget bumped `32_800 → 34_000` (actual
33_865, +1_065 headroom) |

## Test plan

- [x] `poetry run pytest backend/copilot/
backend/api/features/chat/routes_test.py` — **1010 passing**
- [x] Tool schema char budget regression gate passes
- [x] `_assert_tool_names_consistent` passes
- [x] **E2E on local native stack (Kimi K2.6 via OpenRouter,
`CHAT_USE_CLAUDE_AGENT_SDK=false`)**: baseline called `TodoWrite` on a
3-step prompt, SSE stream carried the exact `{content, activeForm,
status}` shape the UI expects, "Steps" dialog renders `Task list — 0/3
completed` with all three items (see test-report comment below).
- [x] Negative cases covered: two `in_progress` → rejected, missing
`activeForm` → rejected, non-list `todos` → rejected.
2026-04-22 17:28:15 +07:00
Zamil Majdy
33a608ec78 feat(platform/copilot): live baseline streaming + render flag + Sonar web_search + simulator cost tracking + reconnect fixes (#12873)
### Why / What / How

**Why.** Three problems on the baseline copilot path that compound:
extended-thinking turns froze the UI for minutes because Kimi K2.6
events were buffered in `state.pending_events: list` until the full
`tool_call_loop` iteration finished (reasoning arrived in one lump at
the end); the SSE stream replayed 1000 events on every reconnect and the
frontend opened multiple SSE streams in quick succession on tab-focus
thrash (reconnect storm → UI flickers, tab freezes); the `web_search`
tool hit Anthropic's server-side beta directly via a dispatch-model
round-trip that fed entire page contents back through the model for a
second inference pass (observed $0.072 on a 74K-token call); and the
simulator dry-run path ran on Gemini Flash without any cost tracking at
all, so every dry-run was free on the platform's microdollar ledger.

**What.** Grouped deltas, all targeting reliability, cost, and UX of the
copilot live-answer pipeline:

- **Live per-token baseline streaming.** `state.pending_events` is now
an `asyncio.Queue` drained concurrently by the outer async generator.
The tool-call loop runs as a background task; reasoning / text / tool
events reach the SSE wire during the upstream OpenRouter stream, not
after it. `None` is the close sentinel; inner-task exceptions are
re-raised via `await loop_task` once the sentinel arrives. An
`emitted_events: list` mirror preserves post-hoc test inspection.
Coalescing widened 32/40 → 64/50 ms to halve the React re-render rate on
extended-thinking turns while staying under the ~100 ms perceptual
threshold.
- **Reasoning render flag** — `ChatConfig.render_reasoning_in_ui: bool =
True` wired through both `BaselineReasoningEmitter` and
`SDKResponseAdapter`. When False the wire `StreamReasoning*` events are
suppressed while the persisted `ChatMessage(role='reasoning')` rows
always survive (decoupled from the render flag so audit/replay is
unaffected); the service-layer yield filter does the gating. Tokens are
still billed upstream; operator kill-switch for UI-level flicker
investigations.
- **Reconnect storm mitigations** — `ChatConfig.stream_replay_count: int
= 200` (was hard-coded 1000) caps `stream_registry.subscribe_to_session`
XREAD size. Frontend `useCopilotStream::handleReconnect` adds a 1500 ms
debounce via `lastReconnectResumeAtRef`, so tab-focus thrash doesn't fan
out into 5–6 parallel replays in the same second.
- **web_search rewritten to Perplexity Sonar via OpenRouter** — single
unified credential, real `usage.cost` flows through
`persist_and_record_usage(provider='open_router')`. Two tiers via a
`deep` param: `perplexity/sonar` (~$0.005/call quick) and
`perplexity/sonar-deep-research` (~$0.50–$1.30/call multi-step
research). Replaces the Anthropic-native + server-tool dispatches; drops
the hardcoded pricing constants entirely.
- **Synthesised answer surfaced end-to-end** — Sonar already writes a
web-grounded answer on the same call we pay for; the new
`WebSearchResponse.answer` field passes it through and the accordion UI
renders it above citations so the agent doesn't re-fetch URLs that are
usually bot-protected anyway.
- **Deep-tier cost warning + UI affordances** — `deep` param description
is explicit that it's ~100× pricier; UI labels read "Researching /
Researched / N research sources" when `deep=true` so users know what's
running.
- **Simulator cost tracking + cheaper default** —
`google/gemini-2.5-flash` → `google/gemini-2.5-flash-lite` (3× cheaper
tokens) and every dry-run now hits
`persist_and_record_usage(provider='open_router')` with real
`usage.cost`. Previously each sim was free against the user's
microdollar budget.
- **Typed access everywhere** — cost extractors now use
`openai.types.CompletionUsage.model_extra["cost"]` and
`openai.types.chat.ChatCompletion` / `Annotation` /
`AnnotationURLCitation` with no `getattr` / duck typing. Mirrors the
baseline service's `_extract_usage_cost` pattern; keep in sync.

**How.** Key file touches:

1. `copilot/config.py` — `render_reasoning_in_ui`,
`stream_replay_count`, `simulation_model` default.
2. `copilot/baseline/service.py` — `_BaselineStreamState.pending_events:
asyncio.Queue`, `_emit` / `_emit_all` helpers, outer generator runs
`tool_call_loop` as a background task + yields from queue concurrently.
3. `copilot/baseline/reasoning.py` —
`BaselineReasoningEmitter(render_in_ui=...)`, coalescing bumped to 64
chars / 50 ms.
4. `copilot/sdk/service.py` — `state.adapter.render_reasoning_in_ui`
threaded through every adapter construction.
5. `copilot/sdk/response_adapter.py` — `render_reasoning_in_ui` wiring +
service-layer yield filter gating for wire suppression while persistence
stays intact.
6. `copilot/stream_registry.py` — `count=config.stream_replay_count`.
7. `frontend/.../useCopilotStream.ts::handleReconnect` — 1500 ms
debounce.
8. `copilot/tools/web_search.py` + `models.py` — Sonar quick/deep paths,
`WebSearchResponse.answer` + typed extractors.
9. `frontend/.../GenericTool/*` — `answer` render + deep-aware labels /
accordion titles.
10. `executor/simulator.py` + `executor/manager.py` +
`copilot/config.py` — cost tracking + model swap + `user_id` threading.

### Changes

- `copilot/config.py` — new `render_reasoning_in_ui`,
`stream_replay_count`; `simulation_model` default flipped to Flash-Lite.
- `copilot/baseline/service.py` — `pending_events: asyncio.Queue`
refactor; outer gen runs loop as task, yields from queue live.
- `copilot/baseline/reasoning.py` —
`BaselineReasoningEmitter(render_in_ui=...)` + 64/50 coalesce.
- `copilot/sdk/service.py` + `response_adapter.py` —
`render_reasoning_in_ui` wire suppression (persistence preserved).
- `copilot/stream_registry.py` — replay cap from config.
- `copilot/tools/web_search.py` + `models.py` — Sonar quick/deep +
`answer` field + typed extractors.
- `copilot/tools/helpers.py` — tool description tightens `deep=true`
cost warning.
- `frontend/.../useCopilotStream.ts` — reconnect debounce.
- `frontend/.../GenericTool/GenericTool.tsx` + `helpers.ts` + tests —
render `answer`, deep-aware verbs / titles.
- `executor/simulator.py` + `simulator_test.py` + `executor/manager.py`
— cost tracking + model swap + user_id plumbing.

### Follow-up (deferred to a separate PR)

SDK per-token streaming via `include_partial_messages=True` was
attempted (commits `599e83543` + `530fa8f95`) and reverted here. The
two-signal model (StreamEvent partial deltas + AssistantMessage summary)
needs proper per-block diff tracking — when the partial stream delivers
a subset of the final block content, emit only
`summary.text[len(already_emitted):]` from the summary rather than
gating on a binary flag. Binary gating truncated replies in the field
when the partial stream delivered less than the summary (observed: "The
analysis template you" cut off mid-sentence because partial had streamed
that much and the rest only lived in the summary). SDK reasoning still
renders end-of-phase (as today); this PR's baseline per-token streaming
is unaffected.

### Checklist

For code changes:
- [x] Changes listed above
- [x] Test plan below
- [x] Tested according to the test plan:
- [x] `poetry run pytest backend/copilot/baseline/ backend/copilot/sdk/
backend/copilot/tools/web_search_test.py
backend/executor/simulator_test.py` — all pass (155 baseline + 927 SDK +
web_search + simulator)
- [x] `pnpm types && pnpm vitest run
src/app/(platform)/copilot/tools/GenericTool/` — pass
- [x] Manual: baseline live-streaming — Kimi K2.6 reasoning arrives
token-by-token, coalesced (no end-of-stream burst).
- [x] Manual: quick web_search via copilot UI — ~$0.005/call, answer +
citations rendered, cost logged as `provider=open_router`.
- [x] Manual: deep web_search — dispatched only on explicit research
phrasing; `sonar-deep-research` billed, UI labels say "Researched" / "N
research sources".
- [x] Manual: simulator dry-run — Gemini Flash-Lite, `[simulator] Turn
usage` log entry, PlatformCostLog row visible.
- [x] Manual: reconnect debounce — tab-focus thrash no longer produces
parallel XREADs in backend log.
- [ ] Manual: `CHAT_RENDER_REASONING_IN_UI=false` smoke-check —
reasoning collapse absent, no persisted reasoning row on reload.

For configuration changes:
- [x] `.env.default` — new config knobs fall back to pydantic defaults;
existing `CHAT_MODEL`/`CHAT_FAST_MODEL`/`CHAT_ADVANCED_MODEL` legacy
envs still honored upstream (unchanged by this PR).

### Companion PR

PR #12876 closes the `run_block`-via-copilot cost-leak gap (registers
`PerplexityBlock` / `FactCheckerBlock` in `BLOCK_COSTS`; documents the
credit/microdollar wallet boundary). Separate because the credit-wallet
side is orthogonal to the copilot microdollar / rate-limit surface this
PR ships.
2026-04-22 13:52:18 +07:00
Zamil Majdy
e3f6d36759 feat(backend/blocks): register 13 paid blocks + document credit/microdollar wallet boundary (#12876)
### Why / What / How

**Why.** Audit of `BLOCK_COSTS` against `credentials_store.py` system
credentials revealed **13 paid blocks** running for free from the credit
wallet's perspective — `BLOCK_COSTS.get(type(block))` returned `None`,
`cost = 0`, no `spend_credits` deduction. Users without their own API
key consumed system credentials with zero credit drain. Separately, the
credit wallet (user-facing prepaid balance) and the copilot microdollar
counter (operator-side meter that gates `daily_cost_limit_microdollars`)
were never documented as separate systems, so future readers kept
tripping on the "why isn't this block charging my limit?" question.

**What.** Three deltas, all credit-wallet-side:

- **Register the 13 paid blocks in `BLOCK_COSTS`** with reasonable
per-call credit prices (1 credit = $0.01). Pricing researched against
the providers' published rates with ~2-3x markup.
- **Document the credit/microdollar boundary** in
`copilot/rate_limit.py`: credits = user-facing prepaid wallet with
marketplace-creator charging; microdollars = operator-side meter that
only ticks on copilot LLM turns (baseline / SDK / web_search /
simulator). Block execution bills credits, not microdollars — explicit
contract.
- **Populate `provider_cost`** on PerplexityBlock so PlatformCostLog
rows carry the real OpenRouter `x-total-cost` value via the existing
`executor/cost_tracking.log_system_credential_cost` path (separate flow
from credit deduction).

### Block costs registered

| Provider | Block | Credits | Raw cost / markup |
|---|---|---|---|
| Perplexity (OpenRouter) | PerplexityBlock — Sonar | 1 | $0.001-0.005 /
call |
| | PerplexityBlock — Sonar Pro | 5 | $0.025 / call |
| | PerplexityBlock — Sonar Deep Research | 10 | up to $0.05 / call |
| Jina | FactCheckerBlock | 1 | $0.005 / call |
| Mem0 | AddMemoryBlock | 1 | $0.0004 / call (1c floor) |
| | SearchMemoryBlock | 1 | $0.004 / call |
| | GetAllMemoriesBlock | 1 | $0.004 / call |
| | GetLatestMemoryBlock | 1 | $0.004 / call |
| ScreenshotOne | ScreenshotWebPageBlock | 2 | $0.0085 / call (2.4x) |
| Nvidia | NvidiaDeepfakeDetectBlock | 2 | est $0.005 (no public SKU) |
| Smartlead | CreateCampaignBlock | 2 | $0.0065 send-equivalent (3x) |
| | AddLeadToCampaignBlock | 1 | $0.0065 (1.5x) |
| | SaveCampaignSequencesBlock | 1 | config-only |
| ZeroBounce | ValidateEmailsBlock | 2 | $0.008 / email (2.5x) |
| E2B + Anthropic | ClaudeCodeBlock | **100** | $0.50-$2 / typical
session (E2B sandbox + in-sandbox Claude) |

**Not in scope** — already covered via the SDK
`ProviderBuilder.with_base_cost()` pattern in their respective
`_config.py`: Exa, Linear, Airtable, Bannerbear, Wolfram, Firecrawl,
Wordpress, Baas, Stagehand, Dataforseo.

### How

1. `backend/data/block_cost_config.py` — 13 new `BlockCost` entries (3
Perplexity models + Fact Checker + 11 from this round).
2. `backend/copilot/rate_limit.py` — boundary docstring.
3. `backend/blocks/perplexity.py` — populate
`NodeExecutionStats.provider_cost` so PlatformCostLog rows carry the
real OpenRouter `x-total-cost` value.
4. Tests — `TestUnregisteredBlockRunsFree` regression +
`TestNewlyRegisteredBlockCosts` pinning every new entry by `cost_amount`
so a future refactor can't quietly drop one.

The companion Notion "Platform System Credentials" database has been
updated with a new `Platform Credit Cost` column populated across all 30
provider rows.

### Scope trim

An earlier revision piped block execution cost into the **copilot
microdollar counter** via `_record_block_microdollar_cost` in
`copilot/tools/helpers.py::execute_block`. That was reverted in
`16ae0f7b5` — the microdollar counter stays scoped to copilot LLM turns
only, credit wallet handles block execution. The pipe-through crossed a
boundary we explicitly want to keep.

### Changes

- `backend/data/block_cost_config.py` — 13 × `BlockCost` entries across
7 providers.
- `backend/blocks/perplexity.py` — populate `provider_cost` on the
execution stats (feeds PlatformCostLog).
- `backend/copilot/rate_limit.py` — boundary docstring only (no
behaviour change).
- `backend/copilot/tools/helpers_test.py` —
`TestUnregisteredBlockRunsFree` + `TestNewlyRegisteredBlockCosts` (8 new
regression tests).
- `backend/blocks/block_cost_tracking_test.py` — provider-cost
extraction pins.

### Checklist

For code changes:
- [x] Changes listed above
- [x] Test plan below
- [x] Tested according to the test plan:
- [x] `poetry run pytest backend/copilot/tools/helpers_test.py
backend/copilot/tools/run_block_test.py
backend/copilot/tools/continue_run_block_test.py
backend/blocks/block_cost_tracking_test.py
backend/blocks/test/test_perplexity.py` — passes
- [x] `poetry run pytest backend/executor/manager_cost_tracking_test.py
backend/copilot/rate_limit_test.py
backend/copilot/token_tracking_test.py` — passes (confirms docstring
edits didn't regress the LLM-turn microdollar path)
  - [x] Pyright clean on all touched files
- [ ] Manual: run PerplexityBlock via copilot `run_block` — credits
deduct, PlatformCostLog row visible with `provider_cost`, no
microdollar-counter tick.
- [ ] Manual: run an unregistered block via copilot — no error, no
credit drain, no silent billing.
- [ ] Manual: run ClaudeCodeBlock via builder — 100 credits deducted
from wallet.

### Companion PR

PR #12873 ships the copilot microdollar / rate-limit work (web_search
cost, simulator cost, reasoning / reconnect fixes). This PR is
credit-wallet only.
2026-04-22 12:03:02 +07:00
Nicholas Tindle
c1b9ed1f5e fix(backend/copilot): allow multiple compactions per turn (#12834)
### Why / What / How

**Why:** The old `CompactionTracker` set a `_done` flag after the first
completion and short-circuited every subsequent compaction in the same
turn. That blocked the SDK-internal compaction from running after a
pre-query compaction had already fired, so prompt-too-long errors
couldn't actually recover — retries saw the flag, bailed, and we re-hit
the context limit.

**What:** Drop the `_done` flag, track attempts and completions as
separate lists, and expose counters + an observability metadata builder
so callers can record compaction activity per turn.

**How:**
- Remove `_done` and `_compact_start` short-circuits.
- Track `_attempted_sources` / `_completed_sources` /
`_completed_count`.
- Expose `attempt_count`, `completed_count`, and
`get_observability_metadata()` / `get_log_summary()` for downstream
instrumentation (no caller change required in this PR).

### Changes 🏗️

- `backend/copilot/sdk/compaction.py` — rewritten `CompactionTracker`
internals; adds properties + observability helpers.
- `backend/copilot/sdk/compaction_test.py` — tests for multi-compaction
flow + new counters.

### Checklist 📋

#### For code changes:
- [x] I have clearly listed my changes in the PR description
- [x] I have made a test plan
- [ ] I have tested my changes according to the test plan:
- [ ] `poetry run pytest backend/copilot/sdk/compaction_test.py -xvs`
passes
- [ ] Local chat that hits prompt-too-long now recovers via SDK
compaction instead of failing the turn

<!-- CURSOR_SUMMARY -->
---

> [!NOTE]
> **Medium Risk**
> Changes core streaming compaction state transitions and persistence
timing, which could affect UI event sequencing or compaction completion
behavior under concurrency; coverage is improved with new
multi-compaction tests.
> 
> **Overview**
> Fixes `CompactionTracker` so compaction is no longer single-shot per
turn: removes the `_done`/event-gate behavior, queues multiple
`on_compact()` hook firings via a pending transcript-path deque, and
allows subsequent SDK-internal compactions after a pre-query compaction
within the same query.
> 
> Adds lightweight instrumentation by tracking attempt/completion
sources and counts, plus `get_observability_metadata()` and
`get_log_summary()` (including source summaries like `sdk_internal:2`).
Updates/expands tests to cover multi-compaction flows, transcript-path
handling, and the new counters/metadata.
> 
> <sup>Reviewed by [Cursor Bugbot](https://cursor.com/bugbot) for commit
9bf8cdd367. Bugbot is set up for automated
code reviews on this repo. Configure
[here](https://www.cursor.com/dashboard/bugbot).</sup>
<!-- /CURSOR_SUMMARY -->

---------

Co-authored-by: majdyz <zamil.majdy@agpt.co>
2026-04-22 02:02:03 +00:00
Zamil Majdy
45bc167184 feat(backend/copilot): Kimi K2.6 fast default + 4-config matrix + coalesced reasoning + web_search tool (#12871)
### Why / What / How

**Why.** Three unrelated but interlocking problems on the baseline
(OpenRouter) copilot path, all blocking us from making Kimi K2.6 the
default fast model:

1. **Cost / capability gap on the default.** Kimi K2.6 prices at $0.60 /
$2.80 per MTok — ~5x cheaper input and ~5.4x cheaper output than Sonnet
4.6 — while tying Opus on SWE-Bench Verified (80.2% vs 80.8%) and
beating it on SWE-Bench Pro (58.6% vs 53.4%). OpenRouter exposes the
same `reasoning` / `include_reasoning` extension on Moonshot endpoints
that #12870 plumbed for Anthropic, so the reasoning collapse lights up
end-to-end without per-provider code.
2. **Kimi reasoning deltas freeze the UI.** K2.6 emits ~4,700
reasoning-delta SSE events per turn vs ~28 on Sonnet — the AI SDK v6
Reasoning UIMessagePart can't keep up and the tab locks. Needs a
coalescing buffer upstream.
3. **Kimi loops on `require_guide_read`.** The guide-guard checks
`session.messages` for a prior `agent_building_guide` call, but tool
calls aren't flushed to `session.messages` until the end of the turn —
mid-turn the check keeps returning False and Kimi calls the guide-load
tool repeatedly in the same turn. Needs an in-flight tracker that lives
on `ChatSession`.
4. **No `web_search` tool on either path.** Kimi doesn't have a native
web-search equivalent and the SDK path's native `WebSearch` (the Claude
Code CLI's built-in) doesn't carry cost accounting. We need one
implementation that both paths share and that reports cost through the
same tracker as every other tool call.

**What.** Five grouped deltas on the baseline service, tool layer, and
config:

- **Kimi K2.6 default.** `fast_standard_model` defaults to
`moonshotai/kimi-k2.6`. Full 2×2 model matrix below. Rollback is one env
var.
- **4-config model matrix.** `fast_standard_model` /
`fast_advanced_model` / `thinking_standard_model` /
`thinking_advanced_model`. Each cell independent so baseline can run a
cheap provider at the standard tier without leaking into the SDK path
(which is Anthropic-only by CLI contract). Legacy env vars
(`CHAT_MODEL`, `CHAT_FAST_MODEL`, `CHAT_ADVANCED_MODEL`) stay aliased
via `validation_alias` so live deployments keep resolving to the same
effective cell.
- **Reasoning delta coalescing.** `BaselineReasoningEmitter` buffers
deltas and flushes on a char-count OR time-interval threshold (32 chars
/ 40 ms). ~4,700 → ~150 SSE events per turn on Kimi; no perceptible
change on Sonnet (which was already well under the threshold).
- **In-flight tool-call tracker.** `ChatSession._inflight_tool_calls`
PrivateAttr is populated when a tool-call block is emitted and cleared
at turn end. `session.has_tool_been_called_this_turn(name)` now returns
True mid-turn, not just after the tool-result lands in
`session.messages` — which is what `require_guide_read` needs to cut the
loop.
- **New `web_search` copilot tool.** Wraps Anthropic's server-side
`web_search_20250305` beta via `AsyncAnthropic` (direct — OpenRouter
can't proxy server-side tool execution). Dispatches through
`claude-haiku-4-5` with `max_uses=1`. Cost estimated from published
rates ($0.010 per search + Haiku tokens) since the Anthropic Messages
API doesn't report cost on the response; reported to
`persist_and_record_usage(provider='anthropic')` on both paths. SDK
native `WebSearch` moved from `_SDK_BUILTIN_ALWAYS` into
`SDK_DISALLOWED_TOOLS` so both paths now dispatch through
`mcp__copilot__web_search`.

**How.**

1. `copilot/config.py` — 2×2 model fields with `AliasChoices` preserving
legacy env var names. `populate_by_name = True` so
`ChatConfig(fast_standard_model=...)` works in tests.
2. `copilot/baseline/service.py::_resolve_baseline_model` — resolves the
active baseline cell from `mode` + `tier`, no longer delegates to the
SDK resolver.
3. `copilot/baseline/reasoning.py` — `BaselineReasoningEmitter` gains
`_pending_delta` / `_last_flush_monotonic` and flushes on
`len(_pending_delta) >= _COALESCE_MIN_CHARS` OR `monotonic() -
_last_flush_monotonic >= _COALESCE_MAX_INTERVAL_MS / 1000`.
`_is_reasoning_route` rewritten as an anchored prefix match covering
`anthropic/`, `anthropic.`, `moonshotai/`, and `openrouter/kimi-` —
split from the narrower `_is_anthropic_model` gate that still governs
`cache_control` markers (which Kimi doesn't support).
4. `copilot/model.py::ChatSession` — `_inflight_tool_calls: set[str] =
PrivateAttr(default_factory=set)` plus `announce_inflight_tool_call` /
`clear_inflight_tool_calls` / `has_tool_been_called_this_turn`.
5. `copilot/tools/helpers.py::require_guide_read` — check
`session.has_tool_been_called_this_turn(_AGENT_GUIDE_TOOL_NAME)` before
falling back to scanning `session.messages`.
6. `copilot/tools/web_search.py` — new `WebSearchTool` +
`_extract_results` + `_estimate_cost_usd`. `is_available` gated on
`Settings().secrets.anthropic_api_key` so the deployment can roll back
just by unsetting the key.
7. `copilot/tools/__init__.py` — registers `web_search` in
`TOOL_REGISTRY` so it becomes `mcp__copilot__web_search` in the SDK
path.
8. `copilot/sdk/tool_adapter.py` — `WebSearch` moves to
`SDK_DISALLOWED_TOOLS`.

### Changes

- `copilot/config.py` — 2×2 model matrix with legacy env alias
preservation; `populate_by_name=True`.
- `copilot/baseline/service.py::_resolve_baseline_model` — resolves
against the new matrix.
- `copilot/baseline/reasoning.py` — `BaselineReasoningEmitter`
coalescing buffer; `_is_reasoning_route` rewritten as anchored prefix
match (covers `anthropic/`, `anthropic.`, `moonshotai/`,
`openrouter/kimi-`).
- `copilot/model.py::ChatSession` — `_inflight_tool_calls` PrivateAttr +
helpers.
- `copilot/baseline/service.py::_baseline_tool_executor` — calls
`announce_inflight_tool_call` after emitting `StreamToolInputAvailable`;
`clear_inflight_tool_calls` in the outer `finally` before persist.
- `copilot/tools/helpers.py::require_guide_read` — reads the new tracker
first.
- `copilot/tools/web_search.py` (new) — Anthropic `web_search_20250305`
wrapper + cost estimator.
- `copilot/tools/web_search_test.py` (new) — extractor / cost / dispatch
/ registry tests (12 total).
- `copilot/tools/models.py` — `WebSearchResponse` + `WebSearchResult` +
`ResponseType.WEB_SEARCH`.
- `copilot/tools/__init__.py` — registers `web_search`.
- `copilot/sdk/tool_adapter.py` — moves native `WebSearch` to
`SDK_DISALLOWED_TOOLS`.

### Checklist

For code changes:
- [x] Changes listed above
- [x] Test plan below
- [ ] Tested according to the test plan:
  - [x] `poetry run pytest backend/copilot/baseline/` — all pass
- [x] `poetry run pytest backend/copilot/sdk/` — all pass (SDK resolver
untouched)
- [x] `poetry run pytest backend/copilot/tools/web_search_test.py` — 12
pass
- [ ] Manual: send a multi-step prompt on fast mode with default config;
confirm backend routes to `moonshotai/kimi-k2.6`, SSE stream carries
`reasoning-start/delta/end` (coalesced), Reasoning collapse renders +
survives hard reload.
- [ ] Manual: 43-tool payload reliability on Kimi — watch for malformed
tool-call JSON or wrong-tool selection.
- [ ] Manual: `CHAT_FAST_STANDARD_MODEL=anthropic/claude-sonnet-4-6`
restarts confirm Sonnet routing (rollback path works).
- [ ] Manual: SDK path (`CHAT_USE_CLAUDE_AGENT_SDK=true`) still selects
the SDK service and uses `thinking_standard_model` = Sonnet (no Kimi
leaked into extended thinking).
- [ ] Manual: prompt that forces `web_search` — confirm results render,
`persist_and_record_usage(provider='anthropic')` runs, cost lands in the
per-user ledger.
- [ ] Manual: ask Kimi a question that would require
`agent_building_guide` — confirm the guide loads exactly once per turn
(no loop).

For configuration changes:
- [x] `.env.default` — all four model fields fall back to the pydantic
defaults; legacy `CHAT_MODEL` / `CHAT_FAST_MODEL` /
`CHAT_ADVANCED_MODEL` remain honored via `AliasChoices`.
2026-04-22 08:47:08 +07:00
Nicholas Tindle
e4f291e54b feat(frontend): add AutoGPT logo to share page and zip download for outputs (#11741)
### Why / What / How

**Why:** The share page was unbranded (no logo/navigation) and images
from workspace files couldn't render because the proxy didn't handle
public share URLs. Zip downloads also had several gaps — no size limits,
no workspace file support, silent failures on data URLs, and single
files got wrapped in unnecessary zips.

**What:** Adds AutoGPT branding to the share page, secure public access
to workspace files via a SharedExecutionFile allowlist, and a hardened
zip download module.

**How:** Backend scans execution outputs for `workspace://` URIs on
share-enable and persists an allowlist in a new `SharedExecutionFile`
table. A new unauthenticated endpoint serves files validated against
this allowlist. Frontend proxy routing is extended (with UUID
validation) to handle the 7-segment public share download path as a
binary response. Download logic is consolidated into a shared module
with size limits, parallel fetches, filename sanitization, and
single-file direct download.

### Changes 🏗️

**Share page branding:**
- AutoGPT logo header centered at top, linking to `/`
- Dark/light mode variants with correct `priority` on visible variant
only

**Secure public workspace file access (backend):**
- New `SharedExecutionFile` Prisma model with `@@unique([shareToken,
fileId])` constraint
- `_extract_workspace_file_ids()` scans outputs for `workspace://` URIs
(handles nested dicts/lists)
- `create_shared_execution_files()` / `delete_shared_execution_files()`
manage allowlist lifecycle
- Re-share cleans up stale records before creating new ones (prevents
old token access)
- `GET /public/shared/{token}/files/{id}/download` — validates against
allowlist, uniform 404 for all failures
- `Content-Disposition: inline` for share page rendering
- Hand-written Prisma migration
(`20260417000000_add_shared_execution_file`)

**Frontend proxy fix:**
- `isWorkspaceDownloadRequest` extended to match public share path
(7-segment)
- UUID format validation on dynamic path segments (file IDs, share
tokens)
- 30+ adversarial security tests: path traversal, SQL injection, SSRF
payloads, unicode homoglyphs, null bytes, prototype pollution, etc.

**Download module (`download-outputs.ts`):**
- Consolidated from two divergent copies into single shared module
- `fetchFileAsBlob` with content-length pre-check before buffering
- `sanitizeFilename` strips path traversal, leading dots, falls back to
"file"
- `getUniqueFilename` deduplicates with counter suffix
- `fetchInParallel` with configurable concurrency (5)
- 50 MB per-file limit, 200 MB aggregate limit
- Data URL try-catch, relative URL support (`/api/proxy/...`)
- Single-file downloads skip zip, go directly to browser download
- Dynamic JSZip import for bundle optimization
- 26 unit tests

**Share page file rendering:**
- `WorkspaceFileRenderer` builds public share URLs when `shareToken` is
in metadata
- `RunOutputs` propagates `shareToken` to renderer metadata

### Checklist 📋

#### For code changes:
- [x] I have clearly listed my changes in the PR description
- [x] I have made a test plan
- [x] I have tested my changes according to the test plan:
  - [x] Share page renders with centered AutoGPT logo
  - [x] Logo links to `/` and shows correct dark/light variant
  - [x] Workspace images render inline on share page
  - [x] Download all produces zip with workspace images included
  - [x] Single-file download skips zip, downloads directly
- [x] Re-sharing generates new token and cleans up old allowlist records
  - [x] Public file download returns 404 for files not in allowlist
  - [x] All frontend tests pass (122 tests across 3 suites)
  - [x] Backend formatter + pyright pass
  - [x] Frontend format + lint + types pass

#### For configuration changes:

- [x] `.env.default` is updated or already compatible with my changes
- [x] `docker-compose.yml` is updated or already compatible with my
changes
- [x] I have included a list of my configuration changes in the PR
description (under **Changes**)

> Note: New Prisma migration required. No env/docker changes needed.

<!-- CURSOR_SUMMARY -->
---

> [!NOTE]
> **Medium Risk**
> Adds a new unauthenticated file download path gated by a database
allowlist plus a new Prisma model/migration; mistakes here could expose
workspace files or break sharing. Frontend download behavior also
changes significantly (zipping/fetching), which could impact
large-output performance and edge cases.
> 
> **Overview**
> Enables **public rendering and downloading of workspace files on
shared execution pages** by introducing a `SharedExecutionFile`
allowlist tied to the share token and populating it when sharing is
enabled (and clearing it on disable/re-share).
> 
> Adds `GET /public/shared/{share_token}/files/{file_id}/download` (no
auth) that validates the requested file against the allowlist and
returns a uniform 404 on failure; workspace download responses now
support `inline` `Content-Disposition` via the exported
`create_file_download_response` helper.
> 
> Frontend updates the share page to pass `shareToken` into output
renderers so `WorkspaceFileRenderer` can build public-share download
URLs; the proxy matcher is extended/strictly UUID-validated for both
workspace and public-share download paths with extensive adversarial
tests. Output downloading is consolidated into `download-outputs.ts`
using dynamic `jszip` import, filename sanitization/deduping,
concurrency + size limits, and a single-file non-zip fast path.
> 
> <sup>Reviewed by [Cursor Bugbot](https://cursor.com/bugbot) for commit
e2f5bd9b5a. Bugbot is set up for automated
code reviews on this repo. Configure
[here](https://www.cursor.com/dashboard/bugbot).</sup>
<!-- /CURSOR_SUMMARY -->

---------

Co-authored-by: Claude <noreply@anthropic.com>
Co-authored-by: claude[bot] <41898282+claude[bot]@users.noreply.github.com>
Co-authored-by: Nicholas Tindle <ntindle@users.noreply.github.com>
Co-authored-by: Otto <otto@agpt.co>
2026-04-21 16:26:37 +00:00
Bently
6efbc59fd8 feat(backend): platform server linking API for multi-platform CoPilot (#12615)
## Why
AutoPilot (CoPilot) needs to reach users across chat platforms — Discord
first, Telegram / Slack / Teams / WhatsApp next. To make usage and
billing coherent, every conversation resolves to one AutoGPT account.
There are two independent linking flows:

- **SERVER links**: the first person to claim a server (Discord guild,
Telegram group, …) becomes its owner. Anyone in the server can chat with
the bot; all usage bills to the owner.
- **USER links**: an individual links their 1:1 DMs with the bot to
their own AutoGPT account. Independent from server links — a server
owner still has to link their DMs separately.

## What
Backend for platform linking, split cleanly by trust boundary:

- **Bot-facing operations** run over cluster-internal RPC via a new
`PlatformLinkingManager(AppService)`. No shared bearer token; trust is
the cluster network itself.
- **User-facing operations** stay on REST under JWT auth (the same
pattern as every other feature).

### REST endpoints (JWT auth)

- `GET /api/platform-linking/tokens/{token}/info` — non-sensitive
display info for the link page
- `POST /api/platform-linking/tokens/{token}/confirm` — confirm a SERVER
link
- `POST /api/platform-linking/user-tokens/{token}/confirm` — confirm a
USER link
- `GET /api/platform-linking/links` / `DELETE /links/{id}` — manage
server links
- `GET /api/platform-linking/user-links` / `DELETE /user-links/{id}` —
manage DM links

### `PlatformLinkingManager` `@expose` methods (internal RPC)

- `resolve_server_link(platform, platform_server_id) -> ResolveResponse`
- `resolve_user_link(platform, platform_user_id) -> ResolveResponse`
- `create_server_link_token(req) -> LinkTokenResponse`
- `create_user_link_token(req) -> LinkTokenResponse`
- `get_link_token_status(token) -> LinkTokenStatusResponse`
- `start_chat_turn(req) -> ChatTurnHandle` — resolves the owner,
persists the user message, creates the stream-registry session, enqueues
the turn; returns `(session_id, turn_id, user_id, subscribe_from="0-0")`
so the caller subscribes directly to the per-turn Redis stream.

### New DB models
- `PlatformLink` — `(platform, platformServerId)` → owner's AutoGPT
`userId`
- `PlatformUserLink` — `(platform, platformUserId)` → AutoGPT `userId`
(for DMs)
- `PlatformLinkToken` — one-time token with `linkType` discriminator
(SERVER | USER) and 30-min TTL

## How

- **New `backend/platform_linking/` package**: `models.py` (Pydantic
types), `links.py` (link CRUD helpers — pure business logic), `chat.py`
(`start_chat_turn` orchestration), `manager.py`
(`PlatformLinkingManager(AppService)` + `PlatformLinkingManagerClient`).
Pattern matches `backend/notifications/` + `backend/data/db_manager.py`.
- **Exception translation at the edge**. Helpers raise domain exceptions
(`NotFoundError`, `LinkAlreadyExistsError`, `LinkTokenExpiredError`,
`LinkFlowMismatchError`, `NotAuthorizedError` — all `ValueError`
subclasses in `backend.util.exceptions` so they auto-register with the
RPC exception-mapping). REST routes translate to HTTP codes via a 7-line
`_translate()` helper.
- **Independent scopes, no DM fallback**. `find_server_link()` and
`find_user_link()` each query their own table. A user who owns a linked
server does not leak that identity into their DMs.
- **Race-safe token consumption**. Confirm paths do atomic `update_many`
with `usedAt = None` + `expiresAt > now` in the WHERE clause;
`create_*_token` invalidates pending tokens before issuing a new one.
- **Bug fix**: `start_chat_turn` persists the user message via
`append_and_save_message` before enqueueing the executor turn — mirrors
`backend/api/features/chat/routes.py`. The previous `chat_proxy.py`
skipped this and ran the executor with no user message in history.
- **Streaming**. Copilot streaming lives on Redis Streams (persistent,
replayable). The bot subscribes directly with `subscribe_from="0-0"`, so
late subscribers replay the full stream; no HTTP SSE proxy needed.
- **No PII in logs**: logs reference `session_id`, `turn_id`,
`server_id`, and AutoGPT `user_id` (last 8 chars), but never raw
platform user IDs.
- **New pod**. `PlatformLinkingManager` runs as its own `AppProcess` on
port `8009`; client via `get_platform_linking_manager_client()`. The
infra chart lands in
[cloud-infrastructure#310](https://github.com/Significant-Gravitas/AutoGPT_cloud_infrastructure/pull/310).

## Tests
- **Models** (`models_test.py`) — Platform / LinkType enums, request
validation (CreateLinkToken / ResolveServer / BotChat), response
schemas.
- **Helpers** (`links_test.py`) — resolve, token create (both flows, 409
on already-linked), token status (pending / linked / expired /
superseded-with-no-link), token info (404 / 410), confirm (404 / wrong
flow / already used / expired / same-user / other-user), delete authz.
- **AppService wiring** (`manager_test.py`) — `@expose` methods delegate
to helpers; client surface covers bot-facing ops and excludes
user-facing ones.
- **Adversarial** (`manager_test.py`, `routes_test.py`):
- `asyncio.gather` double-confirm with same user and with two different
users — exactly one winner, other gets clean `LinkTokenExpiredError`, no
double `PlatformLink.create`.
  - Server- and user-link confirm races.
- `TokenPath` regex guard: rejects `%24`, URL-encoded path traversal,
>64 chars; accepts `secrets.token_urlsafe` shape.
- DELETE `link_id` with SQL-injection-style and path-traversal inputs
returns 404 via `NotFoundError`.

## Stack
- #12618 — bot service (rebased onto this so it can consume
`PlatformLinkingManagerClient`)
- #12624 — `/link/{token}` frontend page
-
[cloud-infrastructure#310](https://github.com/Significant-Gravitas/AutoGPT_cloud_infrastructure/pull/310)
— Helm chart for `copilot-bot` + new `platform-linking-manager`

Merge order: this → #12618#12624, infra whenever.

---------

Co-authored-by: coderabbitai[bot] <136622811+coderabbitai[bot]@users.noreply.github.com>
Co-authored-by: CodeRabbit <noreply@coderabbit.ai>
2026-04-21 16:01:03 +00:00
Nicholas Tindle
6924cf90a5 fix(frontend/copilot): artifact panel fixes (SECRT-2254/2223/2220/2255/2224/2256/2221) (#12856)
### Why / What / How


https://github.com/user-attachments/assets/ca26e0b0-d35d-4a5b-b95f-2421b9907742


**Why** — The Artifact & Side Task List project
(https://linear.app/autogpt/project/artifact-and-side-task-list-ef863c93da3c)
accumulated seven related bugs in the copilot artifact panel. The user
kept seeing panels stuck open, previews broken, clicks not registering —
each ticket was small but they all lived in the same small surface area,
so one review pass is easier than five.

Closes SECRT-2254, SECRT-2223, SECRT-2220, SECRT-2255, SECRT-2224,
SECRT-2256, SECRT-2221.

**What** — Five independent fixes, each in its own commit, shipped
together:

1. **Fragment-link interceptor + render error boundary** (SECRT-2255
crash when clicking `<a href="#x">` in HTML artifacts). Sandboxed srcdoc
iframes resolve fragment links against the parent's URL, so clicking
`#activation` in a Plotly TOC tried to navigate the copilot page into
the iframe. Inject a click-capture script into every artifact iframe;
also wrap the renderer in `ArtifactErrorBoundary` so any future render
throw surfaces with a copyable error instead of a blank panel.
2. **Close panel on copilot page unmount** (SECRT-2254 / 2223 / 2220 —
panel stays open, reopens on unrelated navigation, opens by default on
session switch). The Zustand store outlived page unmounts, so `isOpen:
true` survived `/profile` → `/home` → back. One `useEffect` cleanup in
`useAutoOpenArtifacts` calls `resetArtifactPanel()` on unmount.
3. **Sync loading flip on Try Again** (SECRT-2224 "try again doesn't do
anything"). Retry was correct but the loading-state flip was deferred to
an effect, so a retry that re-failed was visually indistinguishable from
a no-op. `retry()` now sets `isLoading: true` / `error: null`
synchronously with the click so the skeleton flashes every time.
4. **Pointer capture on resize drag** (SECRT-2256 "can't drag right when
expanded far left, click doesn't stop it"). The sandboxed iframe was
eating `pointermove`/`pointerup` events when the cursor drifted over it,
freezing the drag and never delivering the release. `setPointerCapture`
on the handle routes all subsequent pointer events through it regardless
of what's under the cursor.
5. **Stop size-gating natively-rendered artifacts + cache-bust retry**
(SECRT-2221 "broken hi-res PNG preview"). The blanket >10 MB size gate
pushed large images / videos / PDFs into `download-only`, so clicking a
hi-res PNG offered a download instead of a preview. Split the gate so it
only applies to content we actually render in JS (text/html/code/etc).
Image and video retries also append a cache-bust query so the browser
can't silently reuse a negative-cached failure.

**How** — Five commits, one concern each, preserved in the order they
were written. Every fix lands with a regression test that fails on the
unfixed code and passes after.

### Changes 🏗️

- `iframe-sandbox-csp.ts` + usage sites —
`FRAGMENT_LINK_INTERCEPTOR_SCRIPT` injected into all three srcdoc iframe
templates (HTML artifact, inline HTMLRenderer, React artifact).
- `ArtifactErrorBoundary.tsx` (new) — class error boundary local to the
artifact panel with a copyable error fallback.
- `useAutoOpenArtifacts.ts` — unmount cleanup calls
`resetArtifactPanel()`.
- `useArtifactContent.ts` — `retry()` flips loading state synchronously.
- `ArtifactDragHandle.tsx` — `setPointerCapture` /
`releasePointerCapture`; `touch-action: none`.
- `helpers.ts` — split classifier; `NATIVELY_RENDERED` exempts
image/video/pdf from the size gate.
- `ArtifactContent.tsx` — image/video carry a retry nonce that appends
`?_retry=N` on Try Again.
- Test files — new
`ArtifactErrorBoundary`/`ArtifactDragHandle`/`HTMLRenderer` tests, plus
regression cases added to `ArtifactContent.test.tsx`, `helpers.test.ts`,
`iframe-sandbox-csp.test.ts`, `reactArtifactPreview.test.ts`,
`useAutoOpenArtifacts.test.ts`.

### Checklist 📋

#### For code changes:
- [x] I have clearly listed my changes in the PR description
- [x] I have made a test plan
- [x] I have tested my changes according to the test plan:
- [x] `pnpm vitest run src/app/\(platform\)/copilot
src/components/contextual/OutputRenderers
src/lib/__tests__/iframe-sandbox-csp.test.ts` — 247/247 pass
  - [x] `pnpm format && pnpm types` clean
- [x] Manual: open the Plotly-style TOC HTML artifact (SECRT-2255
repro), click each anchor — iframe scrolls internally, browser URL bar
stays put
- [x] Manual: open panel → navigate to /profile → navigate back → panel
closed (SECRT-2254)
- [x] Manual: panel open in session A → click different session → panel
closed (SECRT-2223)
- [ ] Manual: simulate a failed artifact fetch → click Try Again →
skeleton flashes before result (SECRT-2224)
- [x] Manual: expand panel to near-full width → drag back right,
crossing over the iframe → drag keeps working and release ends it
(SECRT-2256)
- [x] Manual: upload a ~25 MB PNG → clicking it previews in an `<img>`,
not a download button (SECRT-2221)

Replaces #12836, #12837, #12838, #12839, #12840 — same fixes, bundled
for review.


<!-- CURSOR_SUMMARY -->
---

> [!NOTE]
> **Medium Risk**
> Touches artifact rendering and iframe `srcDoc` generation (including
injected scripts) plus panel state/drag interactions; regressions could
break previews or resizing, but changes are scoped to the copilot
artifact UI with broad test coverage.
> 
> **Overview**
> Improves Copilot’s artifact panel resilience and UX by **resetting
panel state on page unmount/session changes**, making content retries
immediately show the loading skeleton, and fixing resize drags via
pointer capture so iframes can’t “steal” pointer events.
> 
> Hardens artifact rendering by adding a local `ArtifactErrorBoundary`
that reports to Sentry and shows a copyable error fallback instead of a
blank/crashed panel.
> 
> Fixes iframe-based previews by injecting a
`FRAGMENT_LINK_INTERCEPTOR_SCRIPT` into HTML and React artifact `srcDoc`
so `#anchor` clicks scroll within the iframe rather than navigating the
parent URL, and adjusts artifact classification/retry behavior so large
images/videos/PDFs remain previewable and image/video retries cache-bust
failed URLs.
> 
> <sup>Reviewed by [Cursor Bugbot](https://cursor.com/bugbot) for commit
bde37a13fd. Bugbot is set up for automated
code reviews on this repo. Configure
[here](https://www.cursor.com/dashboard/bugbot).</sup>
<!-- /CURSOR_SUMMARY -->

---------

Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-21 15:53:01 +00:00
Nicholas Tindle
07e5a6a9e4 [Snyk] Security upgrade next from 15.4.10 to 15.4.11 (#12715)
![snyk-top-banner](https://res.cloudinary.com/snyk/image/upload/r-d/scm-platform/snyk-pull-requests/pr-banner-default.svg)

### Snyk has created this PR to fix 1 vulnerabilities in the yarn
dependencies of this project.

#### Snyk changed the following file(s):

- `autogpt_platform/frontend/package.json`


#### Note for
[zero-installs](https://yarnpkg.com/features/zero-installs) users

If you are using the Yarn feature
[zero-installs](https://yarnpkg.com/features/zero-installs) that was
introduced in Yarn V2, note that this PR does not update the
`.yarn/cache/` directory meaning this code cannot be pulled and
immediately developed on as one would expect for a zero-install project
- you will need to run `yarn` to update the contents of the
`./yarn/cache` directory.
If you are not using zero-install you can ignore this as your flow
should likely be unchanged.



<details>
<summary>⚠️ <b>Warning</b></summary>

```
Failed to update the yarn.lock, please update manually before merging.
```

</details>



#### Vulnerabilities that will be fixed with an upgrade:

|  | Issue |  
:-------------------------:|:-------------------------
![high
severity](https://res.cloudinary.com/snyk/image/upload/w_20,h_20/v1561977819/icon/h.png
'high severity') | Allocation of Resources Without Limits or Throttling
<br/>[SNYK-JS-NEXT-15921797](https://snyk.io/vuln/SNYK-JS-NEXT-15921797)




---

> [!IMPORTANT]
>
> - Check the changes in this PR to ensure they won't cause issues with
your project.
> - Max score is 1000. Note that the real score may have changed since
the PR was raised.
> - This PR was automatically created by Snyk using the credentials of a
real user.

---

**Note:** _You are seeing this because you or someone else with access
to this repository has authorized Snyk to open fix PRs._

For more information: <img
src="https://api.segment.io/v1/pixel/track?data=eyJ3cml0ZUtleSI6InJyWmxZcEdHY2RyTHZsb0lYd0dUcVg4WkFRTnNCOUEwIiwiYW5vbnltb3VzSWQiOiJmM2NkN2NiMy1iYzU5LTRkMDMtOGExMi0xOTEwMDk4OGQwNmUiLCJldmVudCI6IlBSIHZpZXdlZCIsInByb3BlcnRpZXMiOnsicHJJZCI6ImYzY2Q3Y2IzLWJjNTktNGQwMy04YTEyLTE5MTAwOTg4ZDA2ZSJ9fQ=="
width="0" height="0"/>
🧐 [View latest project
report](https://app.snyk.io/org/significant-gravitas/project/3d924968-0cf3-4767-9609-501fa4962856?utm_source&#x3D;github&amp;utm_medium&#x3D;referral&amp;page&#x3D;fix-pr)
📜 [Customise PR
templates](https://docs.snyk.io/scan-using-snyk/pull-requests/snyk-fix-pull-or-merge-requests/customize-pr-templates?utm_source=github&utm_content=fix-pr-template)
🛠 [Adjust project
settings](https://app.snyk.io/org/significant-gravitas/project/3d924968-0cf3-4767-9609-501fa4962856?utm_source&#x3D;github&amp;utm_medium&#x3D;referral&amp;page&#x3D;fix-pr/settings)
📚 [Read about Snyk's upgrade
logic](https://docs.snyk.io/scan-with-snyk/snyk-open-source/manage-vulnerabilities/upgrade-package-versions-to-fix-vulnerabilities?utm_source=github&utm_content=fix-pr-template)

---

**Learn how to fix vulnerabilities with free interactive lessons:**

🦉 [Allocation of Resources Without Limits or
Throttling](https://learn.snyk.io/lesson/no-rate-limiting/?loc&#x3D;fix-pr)

[//]: #
'snyk:metadata:{"breakingChangeRiskLevel":null,"FF_showPullRequestBreakingChanges":false,"FF_showPullRequestBreakingChangesWebSearch":false,"customTemplate":{"variablesUsed":[],"fieldsUsed":[]},"dependencies":[{"name":"next","from":"15.4.10","to":"15.4.11"}],"env":"prod","issuesToFix":["SNYK-JS-NEXT-15921797"],"prId":"f3cd7cb3-bc59-4d03-8a12-19100988d06e","prPublicId":"f3cd7cb3-bc59-4d03-8a12-19100988d06e","packageManager":"yarn","priorityScoreList":[null],"projectPublicId":"3d924968-0cf3-4767-9609-501fa4962856","projectUrl":"https://app.snyk.io/org/significant-gravitas/project/3d924968-0cf3-4767-9609-501fa4962856?utm_source=github&utm_medium=referral&page=fix-pr","prType":"fix","templateFieldSources":{"branchName":"default","commitMessage":"default","description":"default","title":"default"},"templateVariants":["updated-fix-title","pr-warning-shown"],"type":"auto","upgrade":["SNYK-JS-NEXT-15921797"],"vulns":["SNYK-JS-NEXT-15921797"],"patch":[],"isBreakingChange":false,"remediationStrategy":"vuln"}'

<!-- CURSOR_SUMMARY -->
---

> [!NOTE]
> **Medium Risk**
> Patch-level upgrade of a core runtime/build dependency (Next.js) can
affect app rendering/build behavior despite being scoped to
dependency/lockfile changes.
> 
> **Overview**
> Upgrades the frontend framework dependency `next` from `15.4.10` to
`15.4.11` in `package.json`.
> 
> Updates `pnpm-lock.yaml` to reflect the new Next.js version (including
`@next/env`) and re-resolves dependent packages that pin `next` in their
peer/optional dependency graphs (e.g., `@sentry/nextjs`,
`@vercel/analytics`, Storybook Next integration).
> 
> <sup>Reviewed by [Cursor Bugbot](https://cursor.com/bugbot) for commit
dc19e1f178. Bugbot is set up for automated
code reviews on this repo. Configure
[here](https://www.cursor.com/dashboard/bugbot).</sup>
<!-- /CURSOR_SUMMARY -->

---------

Co-authored-by: snyk-bot <snyk-bot@snyk.io>
Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-21 15:44:47 +00:00
Zamil Majdy
a098f01bd2 feat(builder): AI chat panel for the flow builder (#12699)
### Why

The flow builder had no AI assistance. Users had to switch to a separate
Copilot session to ask about or modify the agent they were looking at,
and that session had no context on the graph — so the LLM guessed, or
the user had to describe the graph by hand.

### What

An AI chat panel anchored to the `/build` page. Opens with a chat-circle
button (bottom-right), binds to the currently-opened agent, and offers
**only** two tools: `edit_agent` and `run_agent`. Per-agent session is
persisted server-side, so a refresh resumes the same conversation. Gated
behind `Flag.BUILDER_CHAT_PANEL` (default off;
`NEXT_PUBLIC_FORCE_FLAG_BUILDER_CHAT_PANEL=true` to enable locally).

### How

**Frontend — new**:
- `(platform)/build/components/BuilderChatPanel/` — panel shell +
`useBuilderChatPanel.ts` coordinator. Renders the shared Copilot
`ChatMessagesContainer` + `ChatInput` (thought rendering, pulse chips,
fast-mode toggle — all reused, no parallel chat stack). Auto-creates a
blank agent when opened with no `flowID`. Listens for `edit_agent` /
`run_agent` tool outputs and wires them to the builder in-place: edit →
`flowVersion` URL param + canvas refetch; run → `flowExecutionID` URL
param → builder's existing execution-follow UI opens.

**Frontend — touched (minimal)**:
- `copilot/components/CopilotChatActionsProvider` — new `chatSurface:
"copilot" | "builder"` flag so cards can suppress "Open in library" /
"Open in builder" / "View Execution" buttons when the chat is the
builder panel (you're already there).
- `copilot/tools/RunAgent/components/ExecutionStartedCard` — title is
now status-aware (`QUEUED → "Execution started"`, `COMPLETED →
"Execution completed"`, `FAILED → "Execution failed"`, etc.).
- `build/components/FlowEditor/Flow/Flow.tsx` — mount the panel behind
the feature flag.

**Backend — new**:
- `copilot/builder_context.py` — the builder-session logic module. Holds
the tool whitelist (`edit_agent`, `run_agent`), the permissions
resolver, the session-long system-prompt suffix (graph id/name + full
agent-building guide — cacheable across turns), and the per-turn
`<builder_context>` prefix (live version + compact nodes/links
snapshot).
- `copilot/builder_context_test.py` — covers both builders, ownership
forwarding, and cap behavior.

**Backend — touched**:
- `api/features/chat/routes.py` — `CreateSessionRequest` gains
`builder_graph_id`. When set, the endpoint routes through
`get_or_create_builder_session` (keyed on `user_id`+`graph_id`, with a
graph-ownership check). No new route; the former `/sessions/builder` is
folded into `POST /sessions`.
- `copilot/model.py` — `ChatSessionMetadata.builder_graph_id`;
`get_or_create_builder_session` helper.
- `data/graph.py` — `GraphSettings.builder_chat_session_id` (new typed
field; stores the builder-chat session pointer per library agent).
- `api/features/library/db.py` —
`update_library_agent_version_and_settings` preserves
`builder_chat_session_id` across graph-version bumps.
- `copilot/tools/edit_agent.py`, `run_agent.py` — builder-bound guard:
default missing `agent_id` to the bound graph, reject any other id.
`run_agent` additionally inlines `node_executions` into dry-run
responses so the LLM can inspect per-node status in the same turn
instead of a follow-up `view_agent_output`. `wait_for_result` docs now
explain the two dispatch modes.
- `copilot/tools/helpers.py::require_guide_read` — bypassed for
builder-bound sessions (the guide is already in the system-prompt
suffix).
- `copilot/tools/agent_generator/pipeline.py` + `tools/models.py` —
`AgentSavedResponse.graph_version` so the frontend can flip
`flowVersion` to the newly-saved version.
- `copilot/baseline/service.py` + `sdk/service.py` — inject the builder
context suffix into the system prompt and the per-turn prefix into the
current user message.
- `blocks/_base.py` — `validate_data(..., exclude_fields=)` so dry-run
can bypass credential required-checks for blocks that need creds in
normal mode (OrchestratorBlock). `blocks/perplexity.py` override
signature matches.
- `executor/simulator.py` — OrchestratorBlock dry-run iteration cap `1 →
min(original, 10)` so multi-role patterns (Advocate/Critic) actually
close the loop; `manager.py` synthesizes placeholder creds in dry-run so
the block's schema validation passes.

### Session lookup

The builder-chat session pointer lives on
`LibraryAgent.settings.builder_chat_session_id` (typed via
`GraphSettings`). `get_or_create_builder_session` reads/writes it
through `library_db().get_library_agent_by_graph_id` +
`update_library_agent(settings=...)` — no raw SQL or JSON-path filter.
Ownership is enforced by the library-agent query's `userId` filter. The
per-session builder binding still lives on
`ChatSession.metadata.builder_graph_id` (used by
`edit_agent`/`run_agent` guards and the system-prompt injection).

### Scope footnotes

- Feature flag defaults **false**. Rollout gate lives in LaunchDarkly.
- No schema migration required: `builder_chat_session_id` slots into the
existing `LibraryAgent.settings` JSON column via the typed
`GraphSettings` model.
- Commits that address review / CI cycles are interleaved with feature
commits — see the commit log for the per-change rationale.

### Test plan

- [x] `pnpm test:unit` + backend `poetry run test` for new and touched
modules
- [x] Agent-browser pass: panel toggle / auto-create / real-time edit
re-render / real-time exec URL subscribe / queue-while-streaming /
cross-graph reset / hard-refresh session persist
- [x] Codecov patch ≥ 80% on diff

---------

Co-authored-by: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-04-21 22:47:23 +07:00
Nicholas Tindle
59273fe6a0 fix(frontend): forward sentry-trace and baggage across API proxy (#12835)
### Why / What / How

**Why:** Every request that went through Next's rewrite proxy broke
distributed tracing. The browser Sentry SDK emitted `sentry-trace` and
`baggage`, but `createRequestHeaders` only forwarded impersonation + API
key, so the backend started a disconnected transaction. The frontend →
backend lineage never appeared in Sentry. Same gap on
direct-from-browser requests: the custom mutator never attached the
trace headers itself, so even non-proxied paths lost the link.

**What:**
- **Server side:** forward `sentry-trace` and `baggage` from
`originalRequest.headers` alongside the existing impersonation/API key
forwarding.
- **Client side:** the custom mutator pulls trace data via
`Sentry.getTraceData()` and attaches it to outgoing headers when running
on the client.

**How:** Inline additions — no new observability module, no new
dependencies beyond `@sentry/nextjs` which the frontend already uses for
Sentry init.

### Changes 🏗️

- `src/lib/autogpt-server-api/helpers.ts` — forward `sentry-trace` +
`baggage` in `createRequestHeaders`.
- `src/app/api/mutators/custom-mutator.ts` — import `@sentry/nextjs`,
attach `Sentry.getTraceData()` on client-side requests.
- `src/app/api/mutators/__tests__/custom-mutator.test.ts` — three new
tests: trace-data present, trace-data empty, server-side no-op.

### Checklist 📋

#### For code changes:
- [x] I have clearly listed my changes in the PR description
- [x] I have made a test plan
- [ ] I have tested my changes according to the test plan:
- [x] `pnpm vitest run
src/app/api/mutators/__tests__/custom-mutator.test.ts` passes (6/6
locally)
  - [x] `pnpm format && pnpm lint` clean
- [x] `pnpm types` clean for touched files (pre-existing unrelated type
errors on dev are untouched)
- [ ] In a local session with Sentry enabled, a `/copilot` chat turn
produces a distributed trace that spans frontend transaction → backend
transaction (single trace ID in Sentry)

<!-- CURSOR_SUMMARY -->
---

> [!NOTE]
> **Low Risk**
> Low risk: header-only changes to request construction for
observability, with added tests; primary risk is unintended header
propagation affecting upstream/proxy behavior.
> 
> **Overview**
> Restores **Sentry distributed tracing continuity** for
frontend→backend calls by propagating `sentry-trace`/`baggage` headers.
> 
> On the client, `customMutator` now reads `Sentry.getTraceData()` and
attaches string trace headers to outgoing requests (guarded for
server-side and older Sentry builds). On the server/proxy path,
`createRequestHeaders` now forwards `sentry-trace` and `baggage` from
the incoming `originalRequest` alongside existing impersonation/API-key
forwarding, with new unit tests covering these cases.
> 
> <sup>Reviewed by [Cursor Bugbot](https://cursor.com/bugbot) for commit
0f6946b776. Bugbot is set up for automated
code reviews on this repo. Configure
[here](https://www.cursor.com/dashboard/bugbot).</sup>
<!-- /CURSOR_SUMMARY -->

---------

Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-21 15:29:19 +00:00
Nicholas Tindle
38c2844b83 feat(admin): Add system diagnostics and execution management dashboard (#11235)
### Changes 🏗️
This PR adds a comprehensive admin diagnostics dashboard for monitoring
system health and managing running executions.


https://github.com/user-attachments/assets/f7afa3ed-63d8-4b5c-85e4-8756d9e3879e


#### Backend Changes:
- **New data layer** (backend/data/diagnostics.py): Created a dedicated
diagnostics module following the established data layer pattern
- get_execution_diagnostics() - Retrieves execution metrics (running,
queued, completed counts)
  - get_agent_diagnostics() - Fetches agent-related metrics
- get_running_executions_details() - Lists all running executions with
detailed info
- stop_execution() and stop_executions_bulk() - Admin controls for
stopping executions

- **Admin API endpoints**
(backend/server/v2/admin/diagnostics_admin_routes.py):
  - GET /admin/diagnostics/executions - Execution status metrics
  - GET /admin/diagnostics/agents - Agent utilization metrics
- GET /admin/diagnostics/executions/running - Paginated list of running
executions
  - POST /admin/diagnostics/executions/stop - Stop single execution
- POST /admin/diagnostics/executions/stop-bulk - Stop multiple
executions
  - All endpoints secured with admin-only access

#### Frontend Changes:
- **Diagnostics Dashboard**
(frontend/src/app/(platform)/admin/diagnostics/page.tsx):
- Real-time system metrics display (running, queued, completed
executions)
  - RabbitMQ queue depth monitoring
  - Agent utilization statistics
  - Auto-refresh every 30 seconds

- **Execution Management Table**
(frontend/src/app/(platform)/admin/diagnostics/components/ExecutionsTable.tsx):
- Displays running executions with: ID, Agent Name, Version, User
Email/ID, Status, Start Time
  - Multi-select functionality with checkboxes
  - Individual stop buttons for each execution
  - "Stop Selected" and "Stop All" bulk actions
  - Confirmation dialogs for safety
  - Pagination for handling large datasets
  - Toast notifications for user feedback

#### Security:
- All admin endpoints properly secured with requires_admin_user
decorator
- Frontend routes protected with role-based access controls
- Admin navigation link only visible to admin users

### Checklist 📋

#### For code changes:
- [x] I have clearly listed my changes in the PR description
- [x] I have made a test plan
- [x] I have tested my changes according to the test plan:
  
  - [x] Verified admin-only access to diagnostics page
  - [x] Tested execution metrics display and auto-refresh
  - [x] Confirmed RabbitMQ queue depth monitoring works
  - [x] Tested stopping individual executions
  - [x] Tested bulk stop operations with multi-select
  - [x] Verified pagination works for large datasets
  - [x] Confirmed toast notifications appear for all actions

#### For configuration changes:

- [x] `.env.default` is updated or already compatible with my changes
(no changes needed)
- [x] `docker-compose.yml` is updated or already compatible with my
changes (no changes needed)
- [x] I have included a list of my configuration changes in the PR
description (no config changes required)



<!-- CURSOR_SUMMARY -->
---

> [!NOTE]
> **Medium Risk**
> Adds new admin-only endpoints that can stop, requeue, and bulk-mark
executions as `FAILED`, plus schedule deletion, which can directly
impact production workload and data integrity if misused or buggy.
> 
> **Overview**
> Introduces a **System Diagnostics** admin feature spanning backend +
frontend to monitor execution/schedule health and perform remediation
actions.
> 
> On the backend, adds a new `backend/data/diagnostics.py` data layer
and `diagnostics_admin_routes.py` with admin-secured endpoints to fetch
execution/agent/schedule metrics (including RabbitMQ queue depths and
invalid-state detection), list problem executions/schedules, and perform
bulk operations like `stop`, `requeue`, and `cleanup` (marking
orphaned/stuck items as `FAILED` or deleting orphaned schedules). It
also extends `get_graph_executions`/`get_graph_executions_count` with
`execution_ids` filtering, pagination, started/updated time filters, and
configurable ordering to support efficient bulk/admin queries.
> 
> On the frontend, adds an admin diagnostics page with summary cards and
tables for executions and schedules (tabs for
orphaned/failed/long-running/stuck-queued/invalid, plus confirmation
dialogs for destructive actions), wires it into admin navigation, and
adds comprehensive unit tests for both the new API routes and UI
behavior.
> 
> <sup>Reviewed by [Cursor Bugbot](https://cursor.com/bugbot) for commit
15b9ed26f9. Bugbot is set up for automated
code reviews on this repo. Configure
[here](https://www.cursor.com/dashboard/bugbot).</sup>
<!-- /CURSOR_SUMMARY -->

---------

Co-authored-by: Claude <noreply@anthropic.com>
Co-authored-by: claude[bot] <41898282+claude[bot]@users.noreply.github.com>
Co-authored-by: Nicholas Tindle <ntindle@users.noreply.github.com>
Co-authored-by: copilot-swe-agent[bot] <198982749+Copilot@users.noreply.github.com>
2026-04-21 15:28:44 +00:00
Zamil Majdy
24850e2a3e feat(backend/autopilot): stream extended_thinking on baseline via OpenRouter (#12870)
### Why / What / How

**Why:** Fast-mode autopilot never renders a Reasoning block. The
frontend already has `ReasoningCollapse` wired up and the wire protocol
already carries `StreamReasoning*` events (landed for SDK mode in
#12853), but the baseline (OpenRouter OpenAI-compat) path never asks
Anthropic for extended thinking and never parses reasoning deltas off
the stream. Result: users on fast/standard get a good answer with no
visible chain-of-thought, while SDK users see the full Reasoning
collapse.

**What:** Plumb reasoning end-to-end through the baseline path by opting
into OpenRouter's non-OpenAI `reasoning` extension, parsing the
reasoning delta fields off each chunk, and emitting the same
`StreamReasoningStart/Delta/End` events the SDK adapter already uses.

**How:**
- **New config:** `baseline_reasoning_max_tokens` (default 8192; 0
disables). Sent as `extra_body={"reasoning": {"max_tokens": N}}` only on
Anthropic routes — other providers drop the field, and
`is_anthropic_model()` already gates this.
- **Delta extraction:** `_extract_reasoning_delta()` handles all three
OpenRouter/provider variants in priority order — legacy
`delta.reasoning` (string), DeepSeek-style `delta.reasoning_content`,
and the structured `delta.reasoning_details` list (text/summary entries;
encrypted or unknown entries are skipped).
- **Event emission:** Reasoning uses the same state-machine rules the
SDK adapter uses — a text delta or tool_use delta arriving mid-stream
closes the open reasoning block first, so the AI SDK v5 transport keeps
reasoning / text / tool-use as distinct UI parts. On stream end, any
still-open reasoning block gets a matching `reasoning-end` so a
reasoning-only turn still finalises the frontend collapse.
- **Scope:** Live streaming only. Reasoning is not persisted to
`ChatMessage` rows or the transcript builder in this PR (SDK path does
so via `content_blocks=[{type: 'thinking', ...}]`, but that round-trip
requires Anthropic signature plumbing baseline doesn't have today).
Reload will still not show reasoning on baseline sessions — can follow
up if we decide it's worth the signature handling.

### Changes

- `backend/copilot/config.py` — new `baseline_reasoning_max_tokens`
field.
- `backend/copilot/baseline/service.py` — new
`_extract_reasoning_delta()` helper; reasoning block state on
`_BaselineStreamState`; `reasoning` gated into `extra_body`; chunk loop
emits `StreamReasoning*` events with text/tool_use transition rules;
stream-end closes any open reasoning block.
- `backend/copilot/baseline/service_unit_test.py` — 11 new tests
covering extractor variants (legacy string, deepseek alias, structured
list with text/summary aliases, encrypted-skip, empty), paired event
ordering (reasoning-end before text-start), reasoning-only streams, and
that the `reasoning` request param is correctly gated by model route
(Anthropic vs non-Anthropic) and by the config flag.

### Checklist

For code changes:
- [x] I have clearly listed my changes in the PR description
- [x] I have made a test plan
- [ ] I have tested my changes according to the test plan:
- [x] `poetry run pytest backend/copilot/baseline/service_unit_test.py
backend/copilot/baseline/transcript_integration_test.py` — 103 passed
- [ ] Manual: with `CHAT_USE_CLAUDE_AGENT_SDK=false` and
`CHAT_MODEL=anthropic/claude-sonnet-4-6`, send a multi-step prompt on
fast mode and confirm a Reasoning collapse appears alongside the final
text
- [ ] Manual: flip `CHAT_BASELINE_REASONING_MAX_TOKENS=0` and confirm
baseline responses revert to text-only (no reasoning param, no reasoning
UI)
- [ ] Manual: with a non-Anthropic baseline model (`openai/gpt-4o`),
confirm the request does NOT include `reasoning` and nothing regresses

For configuration changes:
- [x] `.env.default` is compatible — new setting falls back to the
pydantic default
2026-04-21 21:05:00 +07:00
Zamil Majdy
e17e9f13c4 fix(backend/copilot): reduce SDK + baseline prompt cache waste (#12866)
## Summary

Four cost-reduction changes for the copilot feature. Consolidated into
one PR at user request; each commit is self-contained and bisectable.

### 1. SDK: full cross-user cache on every turn (CLI 2.1.116 bump)
Previous behavior: CLI 2.1.97 crashed when `excludeDynamicSections=True`
was combined with `--resume`, so the code fell back to a raw
`system_prompt` string on resume, losing Claude Code's default prompt
and all cache markers. Every Turn 2+ of an SDK session wrote ~33K tokens
to cache instead of reading.

Fix: install `@anthropic-ai/claude-code@2.1.116` in the backend Docker
image and point the SDK at it via
`CHAT_CLAUDE_AGENT_CLI_PATH=/usr/bin/claude`. CLI 2.1.98+ fixes the
crash, so we can use the preset with `exclude_dynamic_sections=True` on
every turn — Turn 1, 2, 3+ all share the same static prefix and hit the
**cross-user** prompt cache.

**Local dev requirement:** if `CHAT_CLAUDE_AGENT_CLI_PATH` is unset, the
bundled 2.1.97 fallback will crash on `--resume`. Install the CLI
globally (`npm install -g @anthropic-ai/claude-code@2.1.116`) or set the
env var.

### 2. Baseline: add `cache_control` markers (commit `756b3ecd9` +
follow-ups)
Baseline path had zero `cache_control` across `backend/copilot/**`.
Every turn was full uncached input (~18.6K tokens, ~$0.058). Two
ephemeral markers — on the system message (content-blocks form) and the
last tool schema — plus `anthropic-beta: prompt-caching-2024-07-31` via
`extra_headers` as defense-in-depth. Helpers split into `_mark_tools_*`
(precomputed once per session) and `_mark_system_*` (per-round, O(1)).
Repeat hellos: ~$0.058 → ~$0.006.

### 3. Drop `get_baseline_supplement()` (commit `6e6c4d791`)
`_generate_tool_documentation()` emitted ~4.3K tokens of `(tool_name,
description)` pairs that exactly duplicated the tools array already in
the same request. Deleted. `SHARED_TOOL_NOTES` (cross-tool workflow
rules) is preserved. Baseline "hello" input: ~18.7K → ~14.4K tokens.

### 4. Langfuse "CoPilot Prompt" v26 (published under `review` label)
Separate, out-of-repo change. v25 had three duplicate "Example Response"
blocks + a 10-step "Internal Reasoning Process" section. v26 collapses
to one example + bullet-form reasoning. Char count 20,481 → 7,075 (rough
4 chars/token → ~5,100 → ~1,770 tokens).

- v26 is published with label `review` (NOT `production`); v25 remains
active.
- Promote via `mcp__langfuse__updatePromptLabels(name="CoPilot Prompt",
version=26, newLabels=["production"])` after smoke-test.
- Rollback: relabel v25 `production`.

## Test plan
- [x] Unit tests for `_build_system_prompt_value` (fresh vs resumed
turns emit identical preset dict)
- [x] SDK compat tests pass including
`test_bundled_cli_version_is_known_good_against_openrouter`
- [x] `cli_openrouter_compat_test.py` passes against CLI 2.1.116
(locally verified with
`CHAT_CLAUDE_AGENT_CLI_PATH=/opt/homebrew/bin/claude`)
- [x] 8 new `_mark_*` unit tests + identity regression test for
`_fresh_*` helpers
- [x] `SHARED_TOOL_NOTES` public-constant test passes; 5 old tool-docs
tests removed
- [ ] **Manual cost verification (commit 1):** send two consecutive SDK
turns; Turn 2 and Turn 3 should both show `cacheReadTokens` ≈ 33K (full
cross-user cache hits).
- [ ] **Manual cost verification (commit 2):** send two "hello" turns on
baseline <5 min apart; Turn 2 reports `cacheReadTokens` ≈ 18K and cost ≈
$0.006.
- [ ] **Regression sweep for commit 3:** one turn per tool family —
`search_agents`, `run_agent`,
`add_memory`/`forget_memory`/`search_memory`, `search_docs`,
`read_workspace_file` — to verify no tool-selection regression from
dropping the prose tool docs.
- [ ] **Langfuse v26 smoke test:** 5-10 varied turns after relabelling
to `production`; compare responses vs v25 for regression on persona,
concision, capability-gap handling, credential security flows.

## Deployment notes
- Production Docker image now installs CLI 2.1.116 (~20 MB added).
- `CHAT_CLAUDE_AGENT_CLI_PATH=/usr/bin/claude` set in the Dockerfile;
runtime can override via env.
- First deploy after this merge needs a fresh image rebuild to pick up
the new CLI.
2026-04-21 16:34:10 +07:00
Zamil Majdy
f238c153a5 fix(backend/copilot): release session cluster lock on completion (#12867)
## Summary

Fixes a bug where a chat session gets silently stuck after the user
presses Stop mid-turn.

**Root cause:** the cancel endpoint marks the session `failed` after
polling 5s, but the cluster lock held by the still-running task is only
released by `on_run_done` when the task actually finishes. If the task
hangs past the 5s poll (slow LLM call, agent-browser step, etc.), the
lock lingers for up to 5 min — `stream_chat_post`'s `is_turn_in_flight`
check sees the flipped meta (`failed`) and enqueues a new turn, but the
run handler sees the stale lock and drops the user's message at
`manager.py:379` (`reject+requeue=False`). The new SSE stream hangs
until its 60s idle timeout.

### Fix

Two cooperating changes:

1. **`mark_session_completed` force-releases the cluster lock** in the
same transaction that flips status to `completed`/`failed`.
Unconditional delete — by the time we're declaring the session dead, we
don't care who the current lock holder is; the lock has to go so the
next enqueued turn can acquire. This is what closes the stuck-session
window.
2. **`ClusterLock.release()` is now owner-checked** (Lua CAS — `GET ==
token ? DEL : noop` atomically). Force-release means another pod may
legitimately own the key by the time the original task's `on_run_done`
eventually fires. Without the CAS, that late `release()` would wipe the
successor's lock. With it, the late `release()` is a safe no-op when the
owner has changed.

Together: prompt release on completion (via force-delete) + safe cleanup
when on_run_done catches up (via CAS). That re-syncs the API-level
`is_turn_in_flight` check with the actual lock state, so the contention
window disappears.

No changes to the worker-level contention handler: `stream_chat_post`
already queues incoming messages into the pending buffer when a turn is
in flight (via `queue_pending_for_http`). With these fixes, the worker
never sees contention in the common case; if it does (true multi-pod
race), the pre-existing `reject+requeue=False` behaviour still applies —
we'll revisit that path with its own PR if it becomes a production
symptom.

### Verification

- Reproduced the original stuck-session symptom locally (Stop mid-turn →
send new message → backend logs `Session … already running on pod …`,
user message silently lost, SSE stream idle 60s then closes).
- After the fix: cancel → new message → turn starts normally (lock
released by `mark_session_completed`).
- `poetry run pyright` — 0 errors on edited files.
- `pytest backend/copilot/stream_registry_test.py
backend/executor/cluster_lock_test.py` — 33 passed (includes the
successor-not-wiped test).

## Changes

- `autogpt_platform/backend/backend/copilot/executor/utils.py` — extract
`get_session_lock_key(session_id)` helper so the lock-key format has a
single source of truth.
- `autogpt_platform/backend/backend/copilot/executor/manager.py` — use
the helper where the cluster lock is created.
- `autogpt_platform/backend/backend/copilot/stream_registry.py` —
`mark_session_completed` deletes the lock key after the atomic status
swap (force-release).
- `autogpt_platform/backend/backend/executor/cluster_lock.py` —
`ClusterLock.release()` (sync + async) uses a Lua CAS to only delete
when `GET == token`, protecting against wiping a successor after a
force-release.

## Test plan

- [ ] Send a message in /copilot that triggers a long turn (e.g.
`run_agent`), press Stop before it finishes, then send another message.
Expect: new turn starts promptly (no 5-min wait for lock TTL).
- [ ] Happy path regression — send a normal message, verify turn
completes and the session lock key is deleted after completion.
- [ ] Successor protection — unit test
`test_release_does_not_wipe_successor_lock` covers: A acquires, external
DEL, B acquires, A.release() is a no-op, B's lock intact.
2026-04-21 16:27:01 +07:00
Zamil Majdy
01f1289aac feat(copilot): real OpenRouter cost + cost-based rate limits (percent-only public API) (#12864)
## Why

After d7653acd0 removed cost estimation, most baseline turns log with
`tracking_type="tokens"` and no authoritative USD figure (see: dashboard
flipped from `cost_usd` to `tokens` after 4/14/2026). Rate-limit
counters were also token-weighted with hand-rolled cache discounts
(cache_read @ 10%, cache_create @ 25%) and a 5× Opus multiplier — a
proxy for cost that drifts from real OpenRouter billing.

This PR wires real generation cost from OpenRouter into both the
cost-tracking log and the rate limiter, and hides raw spend figures from
the user-facing API so clients can't reverse-engineer per-turn cost or
platform margins.

## What

1. **Real cost from OpenRouter** — baseline passes `extra_body={"usage":
{"include": True}}` and reads `chunk.usage.cost` from the final
streaming chunk. `x-total-cost` header path removed. Missing cost logs
an error and skips the counter update (vs the old estimator that
silently under-counted).
2. **Cost-based rate limiting** — `record_token_usage(...)` →
`record_cost_usage(cost_microdollars)`. The weighted-token math, cache
discount factors, and `_OPUS_COST_MULTIPLIER` are gone; real USD already
reflects model + cache pricing.
3. **Redis key migration** — `copilot:usage:*` → `copilot:cost:*` so
stale token counters can't be misinterpreted as microdollars.
4. **LD flags + config** — renamed to
`copilot-daily-cost-limit-microdollars` /
`copilot-weekly-cost-limit-microdollars` (unit in the LD key so values
can't accidentally be set in dollars or cents).
5. **Public `/usage` hides raw $$** — new `CoPilotUsagePublic` /
`UsageWindowPublic` schemas expose only `percent_used` (0-100) +
`resets_at` + `tier` + `reset_cost`. Admin endpoint keeps raw
microdollars for debugging.
6. **Admin API contract** — `UserRateLimitResponse` fields renamed
`daily/weekly_token_limit` → `daily/weekly_cost_limit_microdollars`,
`daily/weekly_tokens_used` → `daily/weekly_cost_used_microdollars`.
Admin UI displays `$X.XX`.

## How

- `baseline/service.py` — pass `extra_body`, extract cost from
`chunk.usage.cost`, drop the `x-total-cost` header fallback entirely.
- `rate_limit.py` — rewritten around `record_cost_usage`,
`check_rate_limit(daily_cost_limit, weekly_cost_limit)`, new Redis key
prefix. Adds `CoPilotUsagePublic.from_status()` projector for the public
API.
- `token_tracking.py` — converts `cost_usd` → microdollars via
`usd_to_microdollars` and calls `record_cost_usage` only when cost is
present.
- `sdk/service.py` — deletes `_OPUS_COST_MULTIPLIER` and simplifies
`_resolve_model_and_multiplier` to `_resolve_sdk_model_for_request`.
- Chat routes: `/usage` and `/usage/reset` return `CoPilotUsagePublic`.
Internal server-side limit checks still use the raw microdollar
`CoPilotUsageStatus`.
- Admin routes: unchanged response shape (renamed fields only).
- Frontend: `UsagePanelContent`, `UsageLimits`, `CopilotPage`,
`BriefingTabContent`, `credits/page.tsx` consume the new public schema
and render "N% used" + progress bar. Admin `RateLimitDisplay` /
`UsageBar` keep `$X.XX`. Helper `formatMicrodollarsAsUsd` retained for
admin use.
- Tests + snapshots rewritten; new assertions explicitly check that raw
`used`/`limit` keys are absent from the public payload.

## Deploy notes

1. **Before rolling this out, create the new LD flags:**
`copilot-daily-cost-limit-microdollars` (default `500000`) and
`copilot-weekly-cost-limit-microdollars` (default `2500000`). Old
`copilot-*-token-limit` flags can stay in LD for rollback.
2. **One-time Redis cleanup (optional):** token-based counters under
`copilot:usage:*` are orphaned and will TTL out within 7 days. Safe to
ignore or delete manually.

## Test plan

- [x] `poetry run test` — all impacted backend tests pass (182/182 in
targeted scope)
- [x] `pnpm test:unit` — all 1628 integration tests pass
- [x] `poetry run format` / `pnpm format` / `pnpm types` clean
- [x] Manual sanity against dev env — Baseline turn logged $0.1221 for
40K/139 tokens on Sonnet 4 (matches expected pricing)
- [ ] `/pr-test --fix` end-to-end against local native stack
2026-04-21 14:34:43 +07:00
Zamil Majdy
343222ace1 feat(platform): defer paid-to-paid subscription downgrades + cancel-pending flow (#12865)
### Why / What / How

**Why:** Only downgrades to FREE were scheduled at period end; paid→paid
downgrades (e.g. BUSINESS→PRO) applied immediately via Stripe proration.
The asymmetry meant users lost their higher tier mid-cycle in exchange
for a Stripe credit voucher only redeemable on a future subscription — a
confusing pattern that produces negative-value paths for users actually
cancelling. There was also no way to cancel a pending downgrade or
paid→FREE cancellation once scheduled.

**What:** Standardize on "upgrade = immediate, downgrade = next cycle"
and let users cancel a pending change by clicking their current tier.
Harden the new code against conflicting subscription state, concurrent
tab races, flaky Stripe calls, and hot-path latency regressions.

**How:**

Subscription state machine:
- **Upgrade** (PRO→BUSINESS) — `stripe.Subscription.modify` with
immediate proration (unchanged). If a downgrade schedule is already
attached, release it first so the upgrade wins.
- **Paid→paid downgrade** (BUSINESS→PRO) — creates a
`stripe.SubscriptionSchedule` with two phases (current tier until
`current_period_end`, target tier after). No mid-cycle tier demotion.
Defensive pre-clear: existing schedule → release;
`cancel_at_period_end=True` → set to False.
- **Paid→FREE** — unchanged: `cancel_at_period_end=True`.
- **Same-tier update** — reuses the existing `POST
/credits/subscription` route. When `target_tier == current_tier`,
backend calls `release_pending_subscription_schedule` (idempotent) and
returns status. No dedicated cancel-pending endpoint — "Keep my current
tier" IS the cancel operation.
- `release_pending_subscription_schedule` is idempotent on
terminal-state schedules and clears both `schedule` and
`cancel_at_period_end` atomically per call.

API surface:
- New fields on `SubscriptionStatusResponse`: `pending_tier` +
`pending_tier_effective_at` (pulled from the schedule's next-phase
`start_date` so dashboard-authored schedules report the correct
timestamp).
- `POST /credits/subscription` now returns `SubscriptionStatusResponse`
(previously `SubscriptionCheckoutResponse`); the response still carries
`url` for checkout flows and adds the status fields inline.
- `get_pending_subscription_change` is cached with a 30s TTL — avoids
hammering Stripe on every home-page load.
- Webhook dispatches
`subscription_schedule.{released,completed,updated}` through the main
`sync_subscription_from_stripe` flow so both event sources converge to
the same DB state.

Implementation notes:
- New Stripe calls use native async (`stripe.Subscription.list_async`
etc.) and typed attribute access — no `run_in_threadpool` wrapping in
the new helpers.
- Shared `_get_active_subscription` helper collapses the "list
active/trialing subs, take first" pattern used by 4 callers.

Frontend:
- `PendingChangeBanner` sub-component above the tier grid with formatted
effective date + "Keep [CurrentTier]" button. `aria-live="polite"` for
screen readers; locale pinned to `en-US` to avoid SSR/CSR hydration
mismatch.
- "Keep [CurrentTier]" also available as a button on the current tier
card.
- Other tier buttons disabled while a change is pending — user must
resolve pending first to prevent stacked schedules.
- `cancelPendingChange` reuses `useUpdateSubscriptionTier` with `tier:
current_tier`; awaits `refetch()` on both success and error paths so the
UI reconciles even if the server succeeded but the client didn't receive
the response.

### Changes

**Backend (`credit.py`, `v1.py`)**
- Tier-ordering helpers (`is_tier_upgrade`/`is_tier_downgrade`).
- `modify_stripe_subscription_for_tier` routes downgrades through
`_schedule_downgrade_at_period_end`; upgrade path releases any pending
schedule first.
- `_schedule_downgrade_at_period_end` defensively releases pre-existing
schedules and clears `cancel_at_period_end` before creating the new
schedule.
- `release_pending_subscription_schedule` idempotent on terminal-state
schedules; logs partial-failure outcomes.
- `_next_phase_tier_and_start` returns both tier and phase-start
timestamp; warns on unknown prices.
- `get_pending_subscription_change` cached (30s TTL), narrow exception
handling.
- `sync_subscription_schedule_from_stripe` delegates to
`sync_subscription_from_stripe` for convergence with the main webhook
path.
- Shared `_get_active_subscription` +
`_release_schedule_ignoring_terminal` helpers.
- `POST /credits/subscription` absorbs the same-tier "cancel pending
change" branch.

**Frontend (`SubscriptionTierSection/*`)**
- `PendingChangeBanner` new sub-component (a11y, locale-pinned date,
paid→FREE vs paid→paid copy split, non-null effective-date assertion, no
`dark:` utilities).
- "Keep [CurrentTier]" button on current tier card.
- `useSubscriptionTierSection` — `cancelPendingChange` reuses the
update-tier mutation.
- Copy: downgrade dialog + status hint updated.
- `helpers.ts` extracted from the main component.

**Tests**
- Backend: +24 tests (95/95 passing): upgrade-releases-pending-schedule,
schedule-releases-existing-schedule, cancel-at-period-end collision,
terminal-state release idempotency, unknown-price logging, status
response population, same-tier-POST-with-pending, webhook delegation.
- Frontend: +5 integration tests (21/21 passing): banner render/hide,
Keep-button click from banner + current card, paid→paid dialog copy.

### Checklist

- [x] Backend unit tests: 95 pass
- [x] Frontend integration tests: 21 pass
- [x] `poetry run format` / `poetry run lint` clean
- [x] `pnpm format` / `pnpm lint` / `pnpm types` clean
- [ ] Manual E2E on live Stripe (dev env) — pending deploy: BUSINESS→PRO
creates schedule, DB tier unchanged until period end
- [ ] Manual E2E: "Keep BUSINESS" in banner releases schedule
- [ ] Manual E2E: cancel pending paid→FREE flips `cancel_at_period_end`
back to false
- [ ] Manual E2E: BUSINESS→PRO (scheduled) then attempt BUSINESS→FREE
clears the PRO schedule, sets cancel_at_period_end
- [ ] Manual E2E: BUSINESS→PRO (scheduled) then upgrade back to BUSINESS
releases the schedule
2026-04-21 14:01:09 +07:00
Zamil Majdy
a8226af725 fix(copilot): dedupe tool row, lift bash_exec timeout, Stop+resend recovery (#12862)
Closes #12861 · [OPEN-3096](https://linear.app/autogpt/issue/OPEN-3096)

## Why

Four related copilot UX / stability issues surfaced on dev once action
tools started rendering inline in the chat (see #12813):

### 1. Duplicate bash_exec row

`GenericTool` rendered two rows saying the same thing for every
completed tool call — a muted subtitle line ("Command exited with code
1" / "Ran: sleep 20") **and** a `ToolAccordion` with the command echoed
in its description. Previously hidden inside the "Show reasoning" /
"Show steps" collapse, now visibly duplicated.

### 2. `bash_exec` capped at 120s via advisory text

The tool schema said `"Max seconds (default 30, max 120)"`; the model
obeyed, so long-running scripts got clipped at 120s with a vague `Timed
out after 120s` even though the E2B sandbox has no such limit. Confirmed
via Langfuse traces — the model picks `120` for long scripts because
that's what the schema told it the max was. E2B path never had a
server-side clamp.

Originally added in #12103 (default 30) and tightened to "max 120"
advisory in #12398 (token-reduction pass).

### 3. 30s default was too aggressive

`pip install`, small data-processing scripts, etc. routinely cross 30s
and got killed before the model thought to retry with a bigger timeout.

### 4. Stop + edit + resend → "The assistant encountered an error"
([OPEN-3096](https://linear.app/autogpt/issue/OPEN-3096))

Two independent bugs both land on the same banner — fixing only one
leaves the other visible on the next action.

**4a. Stream lock never released on Stop** *(the error in the ticket
screenshot)*. The executor's `async for chunk in
stream_and_publish(...)` broke out on `cancel.is_set()` without calling
`aclose()` on the wrapper. `async for` does NOT auto-close iterators on
`break`, so `stream_chat_completion_sdk` stayed suspended at its current
`await` — still holding the per-session Redis lock (TTL 120s) until GC
eventually closed it. The next `POST /stream` hit `lock.try_acquire()`
at
[sdk/service.py](autogpt_platform/backend/backend/copilot/sdk/service.py)
and yielded `StreamError("Another stream is already active for this
session. Please wait or stop it.")`. The `except GeneratorExit →
lock.release()` handler written exactly for this case never fired
because nothing sent GeneratorExit.

**4b. Orphan `tool_use` after stop-mid-tool.** Even with the lock
released, the stop path persists the session ending on an assistant row
whose `tool_calls` have no matching `role="tool"` row. On the next turn,
`_session_messages_to_transcript` hands Claude CLI `--resume` a JSONL
with a `tool_use` and no paired `tool_result`, and the SDK raises a
vague error — same banner. The ticket's "Open questions" explicitly
flags this.

## What

**Frontend — `GenericTool.tsx`** split responsibilities between the two
rows so they don't duplicate:
- **Subtitle row** (always visible, muted): *what ran* — `Ran: sleep
120`. Never the exit code.
- **Accordion description**: *how it ended* — `completed` / `status code
127 · bash: missing-bin: command not found` / `Timed out after 120s` /
(fallback to command preview for legacy rows missing `exit_code` /
`timed_out`). Pulled from the first non-empty line of `stdout` /
`stderr` when available.
- **Expanded accordion**: full command + stdout + stderr code blocks
(unchanged).

**Backend — `bash_exec.py`**:
- Drop the "max 120" advisory from the schema description.
- Bump default `timeout: 30 → 120`.
- Clean up the result message — `"Command executed with status code 0"`
(no "on E2B", no parens).

**Backend — `executor/processor.py` + `stream_registry.py` (OPEN-3096
#4a)**: wrap the consumer `async for` in `try/finally: await
stream.aclose()`. Close now propagates through `stream_and_publish` into
`stream_chat_completion_sdk`, whose existing `except GeneratorExit →
lock.release()` releases the Redis lock immediately on cancel. Stream
types tightened to `AsyncGenerator[StreamBaseResponse, None]` so the
defensive `getattr(stream, "aclose", None)` goes away.

**Backend — `session_cleanup.py` (OPEN-3096 #4b)**: new
`prune_orphan_tool_calls()` helper walks the trailing session tail and
drops any trailing assistant row whose `tool_calls` have unresolved ids
(plus everything after it) and any trailing `STOPPED_BY_USER_MARKER`
system-stop row. Single backward pass — tolerates the marker being
present or absent. Called from the existing turn-start cleanup in both
`sdk/service.py` and `baseline/service.py`; takes an optional
`log_prefix` so both paths emit the same INFO log when something was
popped. In-memory only — the DB save path is append-only via
`start_sequence`.

## Test plan

- [x] `pnpm exec vitest run src/app/(platform)/copilot/tools/GenericTool
src/app/(platform)/copilot/components/ChatMessagesContainer` — 105 pass
(6 new for GenericTool subtitle/description variants + legacy-fallback
case).
- [x] `pnpm format` / `pnpm lint` / `pnpm types` — clean.
- [x] `poetry run pytest
backend/copilot/sdk/session_persistence_test.py` — 17 pass (6 + 3 new
covering the orphan-tool-call prune and its optional-log-prefix branch).
- [x] `poetry run pytest backend/copilot/stream_registry_test.py
backend/copilot/executor/processor_test.py` — 19 pass (2 for aclose
propagation on the `stream_and_publish` wrapper, 2 for `_execute_async`
aclose propagation on both exit paths, 1 for publish_chunk RedisError
warning ladder).
- [x] `poetry run ruff check` / `poetry run pyright` on touched files —
clean.
- [x] Manual: fire a `bash_exec` — one labelled row, accordion
description reads sensibly (`completed` / `status code 1 · …` / `Timed
out after 120s`).
- [x] Manual: script that needs >120s — no longer clipped.
- [x] Manual: Stop mid-tool + edit + resend — Autopilot resumes without
"Another stream is already active" and without the vague SDK error.

## Scope note

Does not touch `splitReasoningAndResponse` — re-collapsing action tools
back into "Show steps" is #12813's responsibility.
2026-04-21 10:18:52 +07:00
Ubbe
f06b5293de fix(frontend/library): compute monthly spend for AgentBriefingPanel (#12854)
### Why / What / How

<img width="900" alt="Screenshot 2026-04-20 at 19 52 22"
src="https://github.com/user-attachments/assets/c30d5f18-2842-4a8a-ac3d-5bfee18fcd56"
/>

**Why:** The "Spent this month" tile in the Agent Briefing Panel on the
Library page always showed `$0`, even for users with real execution
usage. The tile is meant to give a quick sense of monthly spend across
all agents.

**What:** Compute `monthlySpend` from actual execution data and format
it as currency.

**How:**
- `useLibraryFleetSummary` now sums `stats.cost` (cents) across every
execution whose `started_at` falls within the current calendar month.
Previously `monthlySpend` was hardcoded to `0`.
- `FleetSummary.monthlySpend` is documented as being in cents
(consistent with backend + `formatCents`).
- `StatsGrid` now uses `formatCents` from the copilot usage helpers to
render the tile (e.g. `$12.34` instead of the broken `$0`).

### Changes 🏗️

-
`autogpt_platform/frontend/src/app/(platform)/library/hooks/useLibraryFleetSummary.ts`:
aggregate `stats.cost` across executions started in the current calendar
month; add `toTimestamp` and `startOfCurrentMonth` helpers.
-
`autogpt_platform/frontend/src/app/(platform)/library/components/AgentBriefingPanel/StatsGrid.tsx`:
format the "Spent this month" tile via shared `formatCents` helper.
- `autogpt_platform/frontend/src/app/(platform)/library/types.ts`:
document that `FleetSummary.monthlySpend` is in cents.

### Checklist 📋

#### For code changes:
- [x] I have clearly listed my changes in the PR description
- [x] I have made a test plan
- [ ] I have tested my changes according to the test plan:
- [ ] Load `/library` with the `AGENT_BRIEFING` flag enabled and at
least one completed execution in the current month — the "Spent this
month" tile shows the correct cumulative cost.
  - [ ] With no executions this month, the tile shows `$0.00`.
- [ ] Type-check (`pnpm types`), lint (`pnpm lint`), and integration
tests (`pnpm test:unit`) pass locally.

---------

Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-20 20:28:47 +07:00
Zamil Majdy
70b591d74f fix(copilot): persist reasoning, split steps/reasoning UX, fix mid-turn promote stream stall (#12853)
## Why

Four related issues that surfaced when queued follow-ups hit an
extended_thinking turn:

1. **Mid-turn promote stalled the SSE stream.** `pollBackendAndPromote`
used `setMessages((prev) => [...prev, bubble])` — Vercel AI SDK's
`useChat` streams SSE deltas into `messages[-1]`, so once a user bubble
ended up there, every subsequent chunk silently landed on the wrong
message. Chat sat frozen until a page refresh, even though the backend's
stream completed cleanly.
2. **Thinking-only final turn looked identical to a frozen UI.** When
Claude's last LLM call after a tool_result produced only a
`ThinkingBlock` (no `TextBlock`, no `ToolUseBlock`), the response
adapter silently dropped it and the UI hung on "Thought for Xs" with no
response text.
3. **Reasoning was invisible.** `ThinkingBlock` was dropped live and
never persisted in a way the frontend could render — sessions on reload
/ shared links showed no thinking, a confusing UX gap ("display for
nothing").
4. **Cross-pod Redis replay dropped reasoning events.** The
`stream_registry._reconstruct_chunk` type map had no entries for
`reasoning-*` types, so any client that subscribed mid-stream (share,
reload, cross-pod) silently dropped them with `Unknown chunk type:
reasoning-delta`.

## What

### Mid-turn promote — splice before the trailing assistant

In `useCopilotPendingChips.ts::pollBackendAndPromote`:

```ts
setMessages((prev) => {
  const bubble = makePromotedUserBubble(drained, "midturn", crypto.randomUUID());
  const lastIdx = prev.length - 1;
  if (lastIdx >= 0 && prev[lastIdx].role === "assistant") {
    return [...prev.slice(0, lastIdx), bubble, prev[lastIdx]];
  }
  return [...prev, bubble];
});
```

Streaming assistant stays at `messages[-1]`, AI SDK deltas keep routing
correctly. `useHydrateOnStreamEnd` snaps the bubble to the DB-canonical
position when the stream ends.

### Reasoning — end-to-end visibility (live + persisted)

- **Wire protocol**: new `StreamReasoningStart` / `StreamReasoningDelta`
/ `StreamReasoningEnd` events matching AI SDK v5's `reasoning-*` wire
names, so `useChat` accumulates them into a `type: 'reasoning'`
UIMessage part natively.
- **Response adapter**: every `ThinkingBlock` now emits reasoning
events; text/tool_use transitions close the open reasoning block so AI
SDK doesn't merge distinct parts.
- **Stream registry**: added `reasoning-*` types to
`_reconstruct_chunk`'s type_to_class map so Redis replay no longer drops
them on cross-pod / reload / share.
- **Persistence** (new): each `StreamReasoningStart` opens a
`ChatMessage(role="reasoning")` row in `session.messages`; deltas
accumulate into its content; `StreamReasoningEnd` closes it. No schema
migration — `ChatMessage.role` is already `String`.
`extract_context_messages` filters `role="reasoning"` out of LLM context
(the `--resume` CLI session already carries thinking separately) so the
model never re-ingests prior reasoning.
- **Frontend conversion**: `convertChatSessionMessagesToUiMessages` maps
`role="reasoning"` DB rows into `{type: "reasoning", text}` parts on the
surrounding assistant bubble, so reload / shared-link sessions render
reasoning identically to live stream.

### Steps / Reasoning UX — modal + accordion split

- **`StepsCollapse`** (new): a Dialog-backed "Show steps" modal wraps
the pre-final-answer group (tool timeline + per-block reasoning). Modal
keeps the steps visually grouped and out of the reading flow.
- **`ReasoningCollapse`** (rewritten): inline accordion with "Show
reasoning" / "Hide reasoning" toggle — no longer a modal, so it expands
*inside* the Steps modal without stacking two dialogs. Reasoning text
appears indented with a left border.
- **`splitReasoningAndResponse`**: reasoning parts now stay in the
reasoning group (instead of being pinned out), so they show up inside
the Steps modal alongside the tool-use timeline.

### Thinking-only final turn — synthesize a closing line
(belt-and-suspenders)

- **Prompt rule** (`_USER_FOLLOW_UP_NOTE`): "Every turn MUST end with at
least one short user-facing text sentence."
- **Adapter fallback**: tracks `_text_since_last_tool_result`; at
`ResultMessage success` with tools run + zero text since, opens a fresh
step (`UserMessage` already closed the previous one) and injects `"(Done
— no further commentary.)"` before `StreamFinish`. Only fires for the
pathological case — pure-text turns untouched.

## Test plan

- [x] `pnpm vitest run` on copilot files — all 638 prior tests pass;
**17 new tests** added covering:
- `convertChatSessionToUiMessages`: reasoning row alone / merged with
assistant text / multi-row / empty skip / duration capture
- `ReasoningCollapse`: initial collapsed, toggle, `rotate-90`,
`aria-expanded`
  - `StepsCollapse`: trigger + dialog open renders children
- `MessagePartRenderer`: reasoning → `<pre>` inside collapse,
whitespace/missing text → null
  - `splitReasoningAndResponse`: reasoning-stays-in-reasoning regression
- [x] `poetry run pytest backend/copilot/sdk/response_adapter_test.py` —
36 pass (7 new: 4 reasoning streaming, 3 thinking-only fallback)
- [x] Manual: reasoning streams live and persists across reload on a
fresh session
- [x] Manual: previously-created sessions (pre-persistence) don't have
`role="reasoning"` rows — behaves as a clean no-op (no reasoning shown,
no error), new sessions render reasoning inside Steps modal

## Notes

- No DB migration — `ChatMessage.role` is already an open `String`;
`role="reasoning"` is simply filtered out of LLM context builds but
rendered by the frontend.
- Addresses /pr-review blockers: (a) stream_registry missing reasoning
types in Redis round-trip, (b) fallback text emitted outside a step, (c)
dead `case "thinking"` in renderer (now uses the live `reasoning` type
uniformly).
2026-04-19 10:37:04 +07:00
Zamil Majdy
b1c043c2d8 feat(copilot): queue follow-up messages on busy sessions (UI + run_sub_session + AutoPilot block) (#12737)
## Why

Users and tools can target a copilot session that already has a turn
running. Before this PR there was no uniform behaviour for that case —
the UI manually routed to a separate queue endpoint, `run_sub_session`
and the AutoPilot block raced the cluster lock, and in-turn follow-ups
only reached the model at turn-end via auto-continue. Outcome: dropped
messages, duplicate tool rows, missed mid-turn intent, latent
correctness bugs in block execution.

## What

A single "message arrived → turn already running?" primitive, shared by
every caller:

1. **POST `/stream`** (UI chat): self-defensive. Session idle → SSE as
today; session busy → `202 application/json` with `{buffer_length,
max_buffer_length, turn_in_flight}`. The deprecated `POST
/messages/pending` endpoint is removed (`GET /messages/pending` peek
stays).
2. **`run_copilot_turn_via_queue`** (shared primitive from #12841, used
by `run_sub_session` + `AutoPilotBlock`): gains the same busy-check.
Busy session → push to pending buffer, return `("queued",
SessionResult(queued=True, pending_buffer_length=N))` without creating a
stream registry session or enqueueing a RabbitMQ job. All callers
inherit queueing.
3. **Mid-turn delivery**: drained follow-ups are attached to every
tool_result's `additionalContext` via the SDK's `PostToolUse` hook —
covers both MCP and built-in tools (WebSearch/Read/Agent/etc.), not just
`run_block`. Claude reads the queued text on the next LLM round of the
same turn.
4. **UI observability**: chips promote to a proper user bubble at the
correct chronological position (after the tool_result row that consumed
them). Auto-continue handles end-of-turn drainage; mid-turn backend poll
handles the tool-boundary drainage path.

## How

**Data plane**
- `backend/copilot/pending_messages.py` — Redis list per session
(LPOP-count for atomic drain), TTL, fire-and-forget pub/sub notify. MAX
10 per session.
- `backend/copilot/pending_message_helpers.py` — `is_turn_in_flight`,
`queue_user_message`, `drain_and_format_for_injection`,
`persist_pending_as_user_rows` (shared persist+rollback used by both
baseline and SDK paths).
- `backend/data/redis_helpers.py` — centralised `incr_with_ttl`,
`capped_rpush`, `hash_compare_and_set`; every Lua script and pipeline
atomicity lives in one place.

**Injection sites**
- `backend/copilot/sdk/security_hooks.py::post_tool_use_hook` — drains +
returns `additionalContext`. Single hook covers built-in + MCP tools.
- `backend/copilot/sdk/service.py` — `StreamToolOutputAvailable`
dispatch persists the drained follow-up as a real user row right after
the tool_result (UI bubble at the right index).
`state.midturn_user_rows` keeps the CLI upload watermark honest.
- `backend/copilot/baseline/service.py` — same drain at round
boundaries, uses the shared `persist_pending_as_user_rows` helper so
baseline + SDK code paths don't diverge.

**Dispatch**
- `backend/copilot/sdk/session_waiter.py::run_copilot_turn_via_queue` —
`is_turn_in_flight` short-circuit; `SessionResult` gains `queued` +
`pending_buffer_length`; `SessionOutcome` gains `"queued"`.
- `backend/api/features/chat/routes.py::stream_chat_post` — busy-check
returns 202 with `QueuePendingMessageResponse`; `POST /messages/pending`
deleted.
- `backend/copilot/tools/run_sub_session.py` / `models.py` —
`SubSessionStatusResponse.status` gains `"queued"`;
`response_from_outcome` renders a clear queued-state message with the
pending-buffer depth and a link to watch live.
- `backend/blocks/autopilot.py::execute_copilot` — surfaces queued state
as descriptive response text + empty `tool_calls`/history when
`result.queued`.

**Frontend**
- `src/app/(platform)/copilot/useCopilotPendingChips.ts` — hook owning
the chip lifecycle: backend peek on session load, auto-continue
promotion when a second assistant id appears, mid-turn poll that
promotes when the backend count drops.
- `src/app/(platform)/copilot/useHydrateOnStreamEnd.ts` —
force-hydrate-waits-for-fresh-reference dance extracted.
- `src/app/(platform)/copilot/helpers/stripReplayPrefix.ts` — pure
function with drop / strip / streaming-catch-up cases + helper
decomposition.
- `src/app/(platform)/copilot/helpers/makePromotedBubble.ts` — one-line
helper for the promoted bubble shape.
- `src/app/(platform)/copilot/helpers/queueFollowUpMessage.ts` — thin
`fetch` wrapper for the 202 path (AI SDK's `useChat` fetcher only
handles SSE, so we can't reuse `sendMessage` for the queued response).

## Test plan

Backend unit + integration (`poetry run pytest backend/copilot
backend/api/features/chat`):
- [x] 107 tests pass — pending buffer, drain helpers, routes,
session_waiter queue branch, run_sub_session outcome rendering,
autopilot block
- [x] New `session_waiter_test.py` proves the queue branch
short-circuits `stream_registry.create_session` + `enqueue_copilot_turn`
- [x] Mid-turn persist has a rollback-and-re-queue path tested for when
`session.messages` persist silently fails to back-fill sequences

Frontend unit (`pnpm vitest run`):
- [x] 630 tests pass incl. 22 new for extracted helpers + hooks
- [x] Frontend coverage on touched copilot files: 91%+ (patch 87.37%)

Manual (once merged):
- [ ] Queue two chips while a tool is running; Claude acknowledges both
on the next round, UI shows bubbles in typing order after the tool
output
- [ ] Hand AutoPilot block an existing session_id that has a live turn;
block returns queued status, in-flight turn drains the message on its
next round
- [ ] `run_sub_session` against a busy sub — status=`queued`,
`sub_autopilot_session_link` lets user watch live

---------

Co-authored-by: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-04-19 00:48:59 +07:00
Zamil Majdy
fcaebd1bb7 refactor(backend/copilot): unified queue-backed copilot turns + async sub-AutoPilot + guide-read gate (#12841)
### Why / What / How

**Why:** the 10-min stream-level idle timeout was killing legitimate
long-running tool calls — notably sub-AutoPilot runs via
`run_block(AutoPilotBlock)`, which routinely take 15–45 min. The symptom
users saw was `"A tool call appears to be stuck"` even though AutoPilot
was actively working. A second long-standing rough edge was shipped
alongside: agents often skipped `get_agent_building_guide` when
generating agent JSON, producing schemas that failed validation and
burned turns on auto-fix loops.

**What:** three threaded pieces.

1. **Async sub-AutoPilot via `run_sub_session`.** New copilot tool that
delegates a task to a fresh (or resumed) sub-AutoPilot, and its
companion `get_sub_session_result` for polling/cancelling. The agent
starts with `run_sub_session(prompt, wait_for_result≤300s)` and, if the
sub isn't done inside the cap, receives a handle + polls via
`get_sub_session_result(wait_if_running≤300s)`. No single MCP call ever
blocks the stream for more than 5 min, so the 10-min stream-idle timer
stays simple and effective (derived as `MAX_TOOL_WAIT_SECONDS * 2`).

2. **Queue-backed copilot turn dispatch** — one code path for all three
callers.
- `run_sub_session` enqueues a `CoPilotExecutionEntry` on the existing
`copilot_execution` exchange instead of spawning an in-process
`asyncio.Task`.
- `AutoPilotBlock.execute_copilot` (graph block) now uses the **same
queue** instead of `collect_copilot_response` inline.
   - The HTTP SSE endpoint was already queue-backed.
- All three share a single primitive: `run_copilot_turn_via_queue` →
`create_session` → `enqueue_copilot_turn` → `wait_for_session_result`.
The event-aggregation logic (`EventAccumulator`/`process_event`) is a
shared module used by both the direct-stream path and the cross-process
waiter.
- Benefits: **deploy/crash resilience** (RabbitMQ redelivery survives
worker restarts), **natural load balancing** across copilot_executor
workers, **sessions as first-class resources** (UI users can
`/copilot?sessionId=<inner>` into any sub or AutoPilot block's session),
and every future stream-level feature (pending-messages drain #12737,
compaction policies, etc.) applies uniformly instead of bypassing
graph-block sessions.

3. **Guide-read gate on agent-generation tools.** `create_agent` /
`edit_agent` / `validate_agent_graph` / `fix_agent_graph` refuse until
the session has called `get_agent_building_guide`. The pre-existing soft
hint was routinely ignored; the gate makes the dependency enforceable.
All four tool descriptions advertise the requirement in one tightened
sentence ("Requires get_agent_building_guide first (refuses
otherwise).") that stays under the 32000-char schema budget.

**How:**

#### Queue-backed sub-AutoPilot + AutoPilotBlock

- `sdk/session_waiter.py` — new module. `SessionResult` dataclass
mirrors `CopilotResult`. `wait_for_session_result` subscribes to
`stream_registry`, drains events via shared `process_event`, returns
`(outcome, result)`. `wait_for_session_completion` is the cheaper
outcome-only variant. `run_copilot_turn_via_queue` is the canonical
three-step dispatch. Every exit path unsubscribes the listener.
- `sdk/stream_accumulator.py` — new module. `EventAccumulator`,
`ToolCallEntry`, `process_event` extracted from `collect.py`. Both the
direct-stream and cross-process paths now use the same fold logic.
- `tools/run_sub_session.py` / `tools/get_sub_session_result.py` —
rewritten around the shared primitive. `sub_session_id` is now the sub's
`ChatSession` id directly (no separate registry handle). Ownership
re-verified on every call via `get_chat_session`. Cancel via
`enqueue_cancel_task` on the existing `copilot_cancel` fan-out exchange.
- `blocks/autopilot.py` — `execute_copilot` replaced its inline
`collect_copilot_response` with `run_copilot_turn_via_queue`.
`SessionResult` carries response text, tool calls, and token usage back
from the worker so no DB round-trip is needed. The block's public I/O
contract (inputs, outputs, `ToolCallEntry` shape) is unchanged.
- `CoPilotExecutionEntry` gains a `permissions: CopilotPermissions |
None` field forwarded to the worker's `stream_fn` so the sub's
capability filter survives the queue hop. The processor passes it
through to `stream_chat_completion_sdk` /
`stream_chat_completion_baseline`.
- **Deleted**: `sdk/sub_session_registry.py` (module-level dict,
done-callback, abandoned-task cap, `notify_shutdown_and_cancel_all`,
`_reset_for_test`), plus the shutdown-notifier hook in
`copilot_executor.processor.cleanup` — redundant under queue-backed
execution.

#### Run_block single-tool cap (3)

- `tools/helpers.execute_block` caps block execution at
`MAX_TOOL_WAIT_SECONDS = 5 min` via `asyncio.wait_for` around the
generator consumption.
- On timeout: logs `copilot_tool_timeout tool=run_block block=…
block_id=… input_keys=… user=… session=… cap_s=…` (grep-friendly) and
returns an `ErrorResponse` that redirects the LLM to `run_agent` /
`run_sub_session`.
- Billing protection: `_charge_block_credits` is called in a `finally`
guarded by `asyncio.shield` and marked `charge_handled` **before** the
await so cancel-mid-charge doesn't double-bill and
cancel-mid-generator-before-charge still settles via the finally.

#### Guide-read gate

- `helpers.require_guide_read(session, tool_name)` scans
`session.messages` for any prior assistant tool call named
`get_agent_building_guide` (handles both OpenAI and flat shapes).
Applied at the top of `_execute` in `create_agent`, `edit_agent`,
`validate_agent_graph`, `fix_agent_graph`. Tool descriptions advertise
the requirement.

#### Shared timing constants

- `MAX_TOOL_WAIT_SECONDS = 5 * 60` + `STREAM_IDLE_TIMEOUT_SECONDS = 2 *
MAX_TOOL_WAIT_SECONDS` in `constants.py`. Every long-running tool
(`run_agent`, `view_agent_output`, `run_sub_session`,
`get_sub_session_result`, `run_block`) imports from one place; no more
hardcoded 300 / `10*60` literals drifting apart. Stream-idle invariant
("no single tool blocks close to the idle timeout") holds by
construction.

### Frontend

- Friendlier tool-card labels: `run_sub_session` → "Sub-AutoPilot",
`get_sub_session_result` → "Sub-AutoPilot result", `run_block` →
"Action" (matches the builder UI's own naming), `run_agent` → "Agent".
Fixes the double-verb "Running Run …" phrasing.
- `SubSessionStatusResponse.sub_autopilot_session_link` surfaces
`/copilot?sessionId=<inner>` so users can click into any sub's session
from the tool-call card — same pattern as `run_agent`'s
`library_agent_link`.

### Changes 🏗️

- **New modules**: `sdk/session_waiter.py`, `sdk/stream_accumulator.py`,
`tools/run_sub_session.py`, `tools/get_sub_session_result.py`,
`tools/sub_session_test.py`, `tools/agent_guide_gate_test.py`.
- **New response types**: `SubSessionStatusResponse`,
`SubSessionProgressSnapshot`, `SessionResult`.
- **New gate helper**: `require_guide_read` in `tools/helpers.py`.
- **Queue protocol**: `permissions` field on `CoPilotExecutionEntry`,
threaded through `processor.py` → `stream_fn`.
- **Hidden**: `AUTOPILOT_BLOCK_ID` in `COPILOT_EXCLUDED_BLOCK_IDS`
(run_block can't execute AutoPilotBlock; agents use `run_sub_session`
instead).
- **Deleted**: `sdk/sub_session_registry.py`, processor
shutdown-notifier hook.
- **Regenerated**: `openapi.json` for the new response types; block-docs
for the updated `ToolName` Literal.
- **Tool descriptions**: tightened the guide-gate hint across the four
agent-builder tools to stay under the 32000-char schema budget.
- **40+ tests** across sub_session, execute_block cap + billing races,
stream_accumulator, agent_guide_gate, frontend helpers.

### Checklist 📋

#### For code changes:
- [x] I have clearly listed my changes in the PR description
- [x] I have made a test plan
- [x] I have tested my changes according to the test plan:
- [x] Unit suite green on the full copilot tree; `poetry run format` +
`pyright` clean
- [x] Schema character budget test passes (tool descriptions trimmed to
stay under 32000)
- [x] Native UI E2E (`poetry run app` + `pnpm dev`):
`run_sub_session(wait_for_result=60)` returns `status="completed"` +
`sub_autopilot_session_link` inline;
`run_sub_session(wait_for_result=1)` returns `status="running"` +
handle, `get_sub_session_result(wait_if_running=60)` observes `running →
completed` transition
- [x] AutoPilotBlock (graph) goes through `copilot_executor` queue
end-to-end (verified via logs: ExecutionManager's AutoPilotBlock node
spawned session `f6de335b-…`, a different `CoPilotExecutor` worker
acquired its cluster lock and ran the SDK stream)
- [x] Guide gate: `create_agent` without a prior
`get_agent_building_guide` returns the refusal; agent reads the guide
and retries successfully
2026-04-18 23:11:41 +07:00
Toran Bruce Richards
1c0c7a6b44 fix(copilot): add gh auth status check to Tool Discovery Priority section (#12832)
## Problem

The CoPilot system prompt contains a `gh auth status` instruction in the
E2B-specific `GitHub CLI` section, but models pattern-match to
`connect_integration` from the **Tool Discovery Priority** section —
which is where the actual decision to call an external service is made.

Because the GitHub auth check lives in a separate, later section, it's
not salient at the point of decision-making. This causes the model to
call `connect_integration(provider='github')` even when `gh` is already
authenticated via `GH_TOKEN`, unnecessarily prompting the user.

## Fix

Add a 3-line callout directly inside the **Tool Discovery Priority**
section:

```
> 🔑 **GitHub exception:** Before calling `connect_integration` for GitHub,
> always run `gh auth status` first. If it shows `Logged in`, proceed
> directly with `gh`/`git` — no integration connection needed.
```

This places the rule at the exact location where the model decides which
tool path to take, preventing the miss.

## Why this works

- **Placement over repetition**: The existing instruction isn't wrong —
it's just in the wrong spot relative to where the decision is made
- **Negative framing**: Explicitly says "before calling
`connect_integration`" which directly intercepts the incorrect reflex
- **Minimal change**: 4 lines added, zero removed

Co-authored-by: Toran Bruce Richards <22963551+Torantulino@users.noreply.github.com>
2026-04-17 15:22:10 +00:00
Joe Munene
3a01874911 fix(frontend/builder): preserve agent name in AgentExecutor node title after reload (#12805)
## Summary

Fixes #11041

When an `AgentExecutorBlock` is placed in the builder, it initially
displays the agent's name (e.g., "Researcher v2"). After saving and
reloading the page, the title reverts to the generic "Agent Executor."

## Root Cause

The backend correctly persists `agent_name` and `graph_version` in
`hardcodedValues` (via `input_default` in `AgentExecutorBlock`).
However, `NodeHeader.tsx` always resolves the display title from
`data.title` (the generic block name), ignoring the persisted agent
name.

## Fix

Modified the title resolution chain in `NodeHeader.tsx` to check
`data.hardcodedValues.agent_name` between the user's custom name and the
generic block title:

1. `data.metadata.customized_name` (user's manual rename) — highest
priority
2. `agent_name` + ` v{graph_version}` from `hardcodedValues` — **new**
3. `data.title` (generic block name) — fallback

This is a frontend-only change. No backend modifications needed.

## Files Changed

-
`autogpt_platform/frontend/src/app/(platform)/build/components/FlowEditor/nodes/CustomNode/components/NodeHeader.tsx`
(+11, -1)

## Test Plan

- [x] Place an AgentExecutorBlock, select an agent — title shows agent
name
- [x] Save graph, reload page — title still shows agent name (was "Agent
Executor" before)
- [x] Double-click to rename — custom name takes priority over agent
name
- [x] Clear custom name — falls back to agent name
- [x] Non-AgentExecutor blocks — unaffected, show generic title as
before

---------

Co-authored-by: Zamil Majdy <zamil.majdy@agpt.co>
2026-04-17 15:20:32 +00:00
Zamil Majdy
6d770d9917 fix(platform/copilot): revert forward pagination, add visibility guarantee for blank chat (#12831)
## Why / What / How

**Why:** PR #12796 changed completed copilot sessions to load messages
from sequence 0 forward (ascending), which broke the standard chat UX —
users now land at the beginning of the conversation instead of the most
recent messages. Reported in Discord.

**What:** Reverts the forward pagination approach and replaces it with a
visibility guarantee that ensures every page contains at least one
user/assistant message.

**How:**
- **Backend**: Removed after_sequence, from_start, forward_paginated,
newest_sequence — always use backward (newest-first) pagination. Added
_expand_for_visibility() helper: after fetching, if the entire page is
tool messages (invisible in UI), expand backward up to 200 messages
until a visible user/assistant message is found.
- **Frontend**: Removed all forwardPaginated/newestSequence plumbing
from hooks and components. Removed bottom LoadMoreSentinel. Simplified
message merge to always prepend paged messages.

### Changes
- routes.py: Reverted to simple backward pagination, removed TOCTOU
re-fetch logic
- db.py: Removed forward mode, extracted _expand_tool_boundary() and
added _expand_for_visibility()
- SessionDetailResponse: Removed newest_sequence and forward_paginated
fields
- openapi.json: Removed after_sequence param and forward pagination
response fields
- Frontend hooks/components: Removed forward pagination props and logic
(-1000 lines)
- Updated all tests (backend: 63 pass, frontend: 1517 pass)

### Checklist
- [x] I have clearly listed my changes in the PR description
- [x] Backend unit tests: 63 pass
- [x] Frontend unit tests: 1517 pass
- [x] Frontend lint + types: clean
- [x] Backend format + pyright: clean
2026-04-17 19:23:28 +07:00
slepybear
334ec18c31 docs: convert in-code comments to MkDocs admonitions in block-sdk-gui… (#12819)
### Why / What / How

<!-- Why: Why does this PR exist? What problem does it solve, or what's
broken/missing without it? -->
This PR converts inline Python comments in code examples within
`block-sdk-guide.md` into MkDocs `!!! note` admonitions. This makes code
examples cleaner and more copy-paste friendly while preserving all
explanatory content.

<!-- What: What does this PR change? Summarize the changes at a high
level. -->
Converts inline comments in code blocks to admonitions following the
pattern established in PR #12396 (new_blocks.md) and PR #12313.

<!-- How: How does it work? Describe the approach, key implementation
details, or architecture decisions. -->
- Wrapped code examples with `!!! note` admonitions
- Removed inline comments from code blocks for clean copy-paste
- Added explanatory admonitions after each code block

### Changes 🏗️

- Provider configuration examples (API key and OAuth)
- Block class Input/Output schema annotations
- Block initialization parameters
- Test configuration
- OAuth and webhook handler implementations
- Authentication types and file handling patterns

### Checklist 📋

#### For documentation changes:
- [x] Follows the admonition pattern from PR #12396
- [x] No code changes, documentation only
- [x] Admonition syntax verified correct

#### For configuration changes:
- [ ] `.env.default` is updated or already compatible with my changes
- [ ] `docker-compose.yml` is updated or already compatible with my
changes

---

**Related Issues**: Closes #8946

Co-authored-by: slepybear <slepybear@users.noreply.github.com>
Co-authored-by: Zamil Majdy <zamil.majdy@agpt.co>
2026-04-17 07:47:52 +00:00
slepybear
ea5cfdfa2e fix(frontend): remove debug console.log statements (#12823)
## Why
Debug console.log statements were left in production code, which can
leak
sensitive information and pollute browser developer consoles.

## What
Removed console.log from 4 non-legacy frontend components:
- useNavbar.ts: isLoggedIn debug log
- WalletRefill.tsx: autoRefillForm debug log  
- EditAgentForm.tsx: category field debug log
- TimezoneForm.tsx: currentTimezone debug log

## How
Simply deleted the console.log lines as they served no purpose 
other than debugging during development.

## Checklist
- [x] Code follows project conventions
- [x] Only frontend changes (4 files, 6 lines removed)
- [x] No functionality changes

Co-authored-by: slepybear <slepybear@users.noreply.github.com>
2026-04-17 07:31:51 +00:00
Ubbe
d13a85bef7 feat(frontend): surface scheduled agents in library & copilot briefings (#12818)
## Why

Scheduled agents weren't well-surfaced in the Library and Copilot
briefings:

- The Library fleet summary didn't count agents that are scheduled
purely via the scheduler (only those with a `recommended_schedule_cron`
set at the agent level).
- Sitrep items didn't distinguish scheduled or listening (trigger-based)
agents, so they often fell back to a generic "idle" state.
- Scheduled chips showed a generic message with no indication of when
the next run would happen.
- The Copilot Agent Briefing surfaced every scheduled agent regardless
of how far out the next run was — an agent scheduled a month away would
take a slot from something actually happening soon.
- Long sitrep messages overflowed the row.

## What

- Add `is_scheduled` to `LibraryAgent` (sourced from the scheduler) so
the frontend can reliably detect schedule-only agents.
- Count scheduled agents in `useLibraryFleetSummary`.
- Include scheduled and listening agents in sitrep items, with a
priority ordering (error → running → stale → success → listening →
scheduled → idle).
- Show a relative next-run time on scheduled sitrep chips (e.g.
"Scheduled to run in 2h" / "in 3d").
- Filter the Copilot Agent Briefing to scheduled agents whose next run
is within the next 3 days.
- Truncate long sitrep messages to 1 line with `OverflowText` and show
the full text in a tooltip on hover.

## How

- Scheduler → `LibraryAgent` mapping populates `is_scheduled` /
`next_scheduled_run`.
- `useSitrepItems` gains an optional `scheduledWithinMs` parameter.
Copilot's `usePulseChips` passes `3 * 24 * 60 * 60 * 1000`; the Library
briefing omits it to keep its existing (unbounded) behavior.
- Scheduled config-based sitrep items are skipped when
`next_scheduled_run` is missing or outside the window.
- `SitrepItem` wraps the message in `OverflowText` so a single-line
ellipsis + hover tooltip replaces raw overflow.

## Test plan

- [ ] `/library` — scheduled and listening agents appear in the sitrep
with accurate copy; fleet summary counts scheduled agents correctly;
long messages truncate with a tooltip on hover.
- [ ] `/copilot` — on an empty session with the `AGENT_BRIEFING` flag
on, the briefing only shows scheduled agents whose next run is within 3
days; agents scheduled further out no longer appear as "scheduled"
chips.
- [ ] Scheduled chip text reads "Scheduled to run in {Nm|Nh|Nd}"
matching `next_scheduled_run`.

🤖 Generated with [Claude Code](https://claude.com/claude-code)

---------

Co-authored-by: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-04-17 14:36:15 +07:00
Zamil Majdy
60b85640e7 fix(backend/copilot): replace dedup lock with idempotent append_and_save_message (#12814)
## Why

The Redis dedup lock (`chat:msg_dedup:{session}:{content_hash}`, 30s
TTL) was solving the wrong problem:

- Its purpose: block infra/nginx retries from calling
`append_and_save_message` twice after a client disconnect, writing a
duplicate user message to the DB.
- The approach: deliberately hold the lock for 30s on `GeneratorExit`.
- Why unnecessary: the executor's cluster lock already prevents
duplicate *execution*. The only real gap was duplicate *DB writes* in
the ~1s before the executor picks up the turn.

## What

- **Deleted** `message_dedup.py` and `message_dedup_test.py` (~150 lines
removed).
- **Removed** all dedup lock code from `routes.py` (~40 lines removed).
- **`append_and_save_message`** is now idempotent and self-contained:
- Uses redis-py's built-in `Lock(timeout=10, blocking_timeout=2)` —
Lua-script atomic acquire/release, no manual poll/sleep loop.
- Lock context manager yields `bool` (`True` = acquired, `False` =
degraded). When degraded (Redis down or 2s timeout), reads from DB
directly instead of cache to avoid stale-state duplicates.
- Idempotency check: if `session.messages[-1]` already matches the
incoming role+content, returns `None` instead of the session.
- Lock released explicitly as soon as the write completes; `try/except`
in `finally` so a cleanup error after a successful write never surfaces
a false 500.
- On cache-write failure, the stale cache entry is invalidated so future
reads fall back to the authoritative DB.
- **`routes.py`** uses the `None` signal: `is_duplicate_message = (await
append_and_save_message(...)) is None`
- Skips `create_session` and `enqueue_copilot_turn` for duplicates —
client re-attaches to the existing turn's Redis stream.
- `track_user_message` and `turn_id` generation only happen when
`is_duplicate_message` is false.
- **`subscribe_to_session`** retry window increased from 1×50ms to
3×100ms — covers the window where a duplicate request subscribes before
the original's `create_session` hset completes.
- **Cleaned up** `routes_test.py`: removed 5 dedup-specific tests and
the `mock_redis` setup from `_mock_stream_internals`; added
duplicate-skips-enqueue test.

## How

The idempotency guard distinguishes legit same-text messages from
retries via the **assistant turn between them**: if the user said "yes",
got a response, and says "yes" again, `session.messages[-1]` is the
assistant reply, so the role check fails and the second message goes
through. A retry (no response yet) sees the user message as the last
entry and is blocked.

```python
if (
    session.messages
    and session.messages[-1].role == message.role
    and session.messages[-1].content == message.content
):
    return None  # duplicate — caller skips enqueue
```

The Redis lock ensures this check always sees authoritative state even
in multi-replica deployments. When the lock is unavailable (Redis down
or contention), reading from DB directly (bypassing potentially stale
cache) provides the same safety guarantee at the cost of a DB
round-trip.

## Checklist

- [x] PR targets `dev`
- [x] Conventional commit title with scope
- [x] Tests added/updated (duplicate detection, lock degradation, DB
error, cache invalidation paths)
- [x] `poetry run format` and `poetry run pyright` pass clean
- [x] No new linter suppressors
2026-04-16 22:12:30 +07:00
Zamil Majdy
87e4d42750 fix(backend/copilot): fix initial load missing messages + forward pagination for completed sessions (#12796)
### Why / What / How

**Why:** Completed copilot sessions with many messages showed a
completely empty chat view. A user reported a 158-message session that
appeared blank on reload.

**What:** Two bugs fixed:
1. **Backend** — initial page load always returned the newest 50
messages in DESC order. For sessions heavy in tool calls, the user's
original messages (seq 0–5) were never included; all 50 slots consumed
by mid-session tool outputs.
2. **Frontend** — convertChatSessionToUiMessages silently dropped user
messages with null/empty content.

**How:** For completed sessions (no active stream), the backend now
loads from sequence 0 in ASC order. Active/streaming sessions keep
newest-first for streaming context. A new after_sequence forward cursor
enables infinite-scroll for subsequent pages (sentinel moves to bottom).
The frontend wires forward_paginated + newest_sequence end-to-end.

### Changes 🏗️

- db.py: added from_start (ASC) and after_sequence (forward cursor)
modes; added newest_sequence to PaginatedMessages
- routes.py: detect completed vs active on initial load; pass
from_start=True for completed; expose newest_sequence +
forward_paginated; accept after_sequence param
- convertChatSessionToUiMessages.ts: never drop user messages with empty
content
- useLoadMoreMessages.ts: forward pagination via after_sequence; append
pages to end
- ChatMessagesContainer.tsx: LoadMoreSentinel at bottom for
forward-paginated sessions
- Wire newestSequence + forwardPaginated end-to-end through
useChatSession/useCopilotPage/ChatContainer
- openapi.json: add after_sequence + newest_sequence/forward_paginated;
regenerate types
- db_test.py: 9 new unit tests for from_start and after_sequence modes

### Checklist 📋

#### For code changes:
- [x] I have clearly listed my changes in the PR description
- [x] I have made a test plan
- [x] I have tested my changes according to the test plan:
- [x] Open a completed session with many messages — first user message
visible on initial load
- [x] Scroll to bottom of completed session — load more appends next
page
- [x] Open active/streaming session — newest messages shown first,
streaming unaffected
  - [x] Backend unit tests: all 28 pass
  - [x] Frontend lint/format: clean, no new type errors

---------

Co-authored-by: chernistry <73943355+chernistry@users.noreply.github.com>
Co-authored-by: Nicholas Tindle <nicholas.tindle@agpt.co>
2026-04-16 14:16:54 +00:00
Ubbe
0339d95d12 fix(frontend): small UI fixes, sort menu bg, name update auth, stats grid overflow, pulse chips (#12815)
## Summary
- **LibrarySortMenu / AgentFilterMenu**: Force `!bg-transparent` and
neutralise legacy `SelectTrigger` styles (`m-0.5`, `ring-offset-white`,
`shadow-sm`) that caused a white background around the trigger
- **EditNameDialog**: Replace client-side `supabase.auth.updateUser()`
with server-side `PUT /api/auth/user` route — fixes "Auth session
missing!" error caused by `httpOnly` cookies being inaccessible to
browser JS
- **StatsGrid**: Swap label `Text` for `OverflowText` so tile labels
truncate with `…` and show a tooltip instead of wrapping when the grid
is squeezed
- **PulseChips**: Set fixed `15rem` chip width with `shrink-0`,
horizontal scroll, and styled thin scrollbar
- **Tests**: Updated `EditNameDialog` tests to use MSW instead of
mocking Supabase client; added 7 new `PulseChips` integration tests

## Test plan
- [x] `pnpm test:unit` — all 1495 tests pass (91 files)
- [x] `pnpm format && pnpm lint` — clean
- [x] `pnpm types` — no new errors (pre-existing only)
- [ ] QA `/library?sort=updatedAt` — sort menu trigger has no white bg
- [ ] QA `/library` — StatsGrid labels truncate with tooltip on narrow
viewports
- [ ] QA `/copilot` — PulseChips scroll horizontally at fixed width
- [ ] QA `/copilot` — Edit name dialog saves successfully (no "Auth
session missing!")

🤖 Generated with [Claude Code](https://claude.com/claude-code)

---------

Co-authored-by: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-04-16 20:11:21 +07:00
Toran Bruce Richards
f410929560 feat(platform): Add xAI Grok 4.20 models from OpenRouter (#12620)
Requested by @Torantulino

Adds the 2 xAI Grok 4.20 models available on OpenRouter that are missing
from the platform.

## Why

`x-ai/grok-4.20` and `x-ai/grok-4.20-multi-agent` are xAI's current
flagship models (released March 2026) and are available via OpenRouter,
but weren't accessible from the platform's LLM blocks.

## Changes

**`autogpt_platform/backend/backend/blocks/llm.py`**
- Added `GROK_4_20` and `GROK_4_20_MULTI_AGENT` enum members
- Added corresponding `MODEL_METADATA` entries (open_router provider, 2M
context window, price tier 3)

**`autogpt_platform/backend/backend/data/block_cost_config.py`**
- Added `MODEL_COST` entries at 5 credits each (flagship tier, $2/M in)

**`docs/integrations/block-integrations/llm.md`**
- Added new model IDs to all LLM block tables

| Model | Pricing | Context |
|-------|---------|---------|
| `x-ai/grok-4.20` | $2/M in, $6/M out | 2M |
| `x-ai/grok-4.20-multi-agent` | $2/M in, $6/M out | 2M |

Both models use the standard OpenRouter chat completions API — no
special handling needed.

Resolves: SECRT-2196

---------

Co-authored-by: Torantulino <22963551+Torantulino@users.noreply.github.com>
Co-authored-by: Toran Bruce Richards <Torantulino@users.noreply.github.com>
Co-authored-by: Otto (AGPT) <otto@agpt.co>
2026-04-16 12:14:56 +00:00
Zamil Majdy
2bbec09e1a feat(platform): subscription tier billing via Stripe Checkout (#12727)
## Why

Introducing paid subscription tiers (PRO, BUSINESS) so we can charge for
AutoPilot capacity beyond the free tier. Without a billing integration,
all users share the same rate limits regardless of their willingness to
pay for additional capacity.

## What

End-to-end subscription billing system using Stripe Checkout Sessions:

**Backend:**
- `SubscriptionTier` enum (`FREE`, `PRO`, `BUSINESS`, `ENTERPRISE`) on
the `User` model
- `POST /credits/subscription` — creates a Stripe Checkout Session for
paid upgrades; for FREE tier or when `ENABLE_PLATFORM_PAYMENT` is off,
sets tier directly
- `GET /credits/subscription` — returns current tier, monthly cost
(cents), and all tier costs
- `POST /credits/stripe_webhook` — handles
`customer.subscription.created/updated/deleted`,
`checkout.session.completed`, `charge.dispute.*`, `refund.created`
- `sync_subscription_from_stripe()` — keeps `User.subscriptionTier` in
sync from webhook events; guards against out-of-order delivery
(cancelled event after new sub created), ENTERPRISE overwrite, and
duplicate webhook replay
- Open-redirect protection on `success_url`/`cancel_url` via
`_validate_checkout_redirect_url()`
- `_cancel_customer_subscriptions()` — cancels both active and trialing
subs; propagates errors so callers can avoid updating DB tier on Stripe
failure
- `_cleanup_stale_subscriptions()` — best-effort cancellation of old
subs when a new one becomes active (paid-to-paid upgrade), to prevent
double-billing
- `get_stripe_customer_id()` with idempotency key to prevent duplicate
Stripe customers on concurrent requests
- `cache_none=False` sentinel fix in `@cached` decorator so Stripe price
lookups retry on transient error instead of poisoning the cache with
`None`
- Stripe Price IDs read from LaunchDarkly (`stripe-price-id-pro`,
`stripe-price-id-business`). If not configured, upgrade returns 422.

**Frontend:**
- `SubscriptionTierSection` component on billing page: tier cards
(FREE/PRO/BUSINESS), upgrade/downgrade buttons, per-tier cost display,
Stripe redirect on upgrade
- Confirmation dialog for downgrades
- ENTERPRISE users see a read-only admin-managed banner
- Success toast on return from Stripe Checkout (`?subscription=success`)
- Uses generated `useGetSubscriptionStatus` /
`useUpdateSubscriptionTier` hooks

## How

- Paid upgrades use Stripe Checkout Sessions (not server-side
subscription creation) — Stripe handles PCI-compliant card collection
and the subscription lifecycle
- Tier is synced back via webhook on
`customer.subscription.created/updated/deleted`
- Downgrade to FREE cancels via Stripe API immediately; a
`stripe.StripeError` during cancellation returns 502 with a generic
message (no Stripe detail leakage)
- LaunchDarkly flags: `stripe-price-id-pro` (string),
`stripe-price-id-business` (string), `enable-platform-payment` (bool)
- `ENABLE_PLATFORM_PAYMENT=false` bypasses Stripe for beta/internal
access (sets tier directly)

## Checklist 📋

#### For code changes:
- [x] I have clearly listed my changes in the PR description
- [x] I have made a test plan
- [x] I have tested my changes according to the test plan:
- [x] `ENABLE_PLATFORM_PAYMENT=false` → tier change updates directly, no
Stripe redirect
- [x] `ENABLE_PLATFORM_PAYMENT=true` with price IDs configured → paid
upgrade redirects to Stripe Checkout
- [x] Stripe webhook `customer.subscription.created` →
`User.subscriptionTier` updated
  - [x] Unrecognised price ID in webhook → logs warning, tier unchanged
  - [x] ENTERPRISE user webhook event → tier not overwritten
  - [x] Empty `STRIPE_WEBHOOK_SECRET` → 503 (prevents HMAC bypass)
  - [x] Open-redirect attack on `success_url`/`cancel_url` → 422

#### For configuration changes:
- [x] No `.env` or `docker-compose.yml` changes
- [x] LaunchDarkly flags to create: `stripe-price-id-pro` (string),
`stripe-price-id-business` (string), `enable-platform-payment` (bool)

---------

Co-authored-by: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Co-authored-by: majdyz <majdy.zamil@gmail.com>
2026-04-16 17:52:06 +07:00
Ubbe
31b88a6e56 feat(frontend): add Agent Briefing Panel (#12764)
## Summary

<img width="800" height="772" alt="Screenshot_2026-04-13_at_18 29 19"
src="https://github.com/user-attachments/assets/3da6eaf2-1485-4c08-9651-18f2f4220eba"
/>
<img width="800" height="285" alt="Screenshot_2026-04-13_at_18 29 24"
src="https://github.com/user-attachments/assets/6a5f981a-1e1d-4d22-a33d-9e1b0e7555a7"
/>
<img width="800" height="288" alt="Screenshot_2026-04-13_at_18 29 27"
src="https://github.com/user-attachments/assets/f97b4611-7c23-4fc9-a12d-edf6314a77ef"
/>
<img width="800" height="433" alt="Screenshot_2026-04-13_at_18 29 31"
src="https://github.com/user-attachments/assets/e6d7241d-84f3-4936-b8cd-e0b12df392bb"
/>
<img width="700" height="554" alt="Screenshot_2026-04-13_at_18 29 40"
src="https://github.com/user-attachments/assets/92c08f21-f950-45cd-8c1d-529905a6e85f"
/>


Implements the Agent Intelligence Layer — real-time agent awareness
across the Library and Copilot pages.

### Core Features
- **Agent Briefing Panel** — stats grid with fleet-wide counts (running,
recently completed, needs attention, scheduled, idle, monthly spend) and
tab-driven content below
- **Enhanced Library Cards** — StatusBadge, run counts, contextual
action buttons (See tasks, Start, Chat) with consistent icon-left
styling
- **Situation Report Items** — prioritized sitrep with error-first
ranking, "See task" deep-links for completed runs, and "Ask AutoPilot"
bridge
- **Home Pulse Chips** — agent status chips on Copilot empty state with
hover-reveal actions (slide-up animation + backdrop blur on desktop,
always visible on touch)
- **Edit Display Name** — pencil icon on Copilot greeting to update
Supabase user metadata inline

### Backend
- **Execution count API** — batch `COUNT(*)` query on
`AgentGraphExecution` grouped by `agentGraphId` for the current user,
avoiding loading full execution rows. Wired into `list_library_agents`
and `list_favorite_library_agents` via `execution_count_override` on
`LibraryAgent.from_db()`

### UI Polish
- Subtler gradient on AgentBriefingPanel (reduced opacity on background
+ animated border)
- Consistent button styles across all action buttons (icon-left, same
sizing)
- Removed duplicate "Open in builder" menu item (kept "Edit agent")
- "Recently completed" tab replaces "Listening" in briefing panel,
showing agents with completed runs in last 72h

## Changes

### Backend
- `backend/api/features/library/db.py` — added
`_fetch_execution_counts()` batch COUNT query, wired into list endpoints
- `backend/api/features/library/model.py` — added
`execution_count_override` param to `LibraryAgent.from_db()`

### Frontend — New files
- `EditNameDialog/EditNameDialog.tsx` — modal to update display name via
Supabase auth
- `PulseChips/PulseChips.module.css` — hover-reveal animation + glass
panel styles

### Frontend — Modified files
- `EmptySession.tsx` — added EditNameDialog and PulseChips
- `PulseChips.tsx` — redesigned with See/Ask buttons, hover overlay on
desktop
- `usePulseChips.ts` — added agentID for deep-linking
- `AgentBriefingPanel.tsx` — subtler gradient, adjusted padding
- `AgentBriefingPanel.module.css` — reduced conic gradient opacity
- `BriefingTabContent.tsx` — added "completed" tab routing
- `StatsGrid.tsx` — replaced Listening with Recently completed,
reordered tabs
- `SitrepItem.tsx` — consistent button styles, "See task" link for
completed items, updated copilot prompt
- `ContextualActionButton.tsx` — icon-left, smaller icon, renamed Run to
Start
- `LibraryAgentCard.tsx` — icon-left on all buttons, EyeIcon for See
tasks
- `AgentCardMenu.tsx` — removed duplicate "Open in builder"
- `useAgentStatus.ts` — added completed count to FleetSummary
- `useLibraryFleetSummary.ts` — added recent completion tracking
- `types.ts` — added `completed` to FleetSummary and AgentStatusFilter

## Test plan
- [ ] Library page renders Agent Briefing Panel with stats grid
- [ ] "Recently completed" tab shows agents with completed runs in last
72h
- [ ] Agent cards show real execution counts (not 0)
- [ ] Action buttons have consistent styling with icon on the left
- [ ] "See task" on completed items deep-links to agent page with
execution selected
- [ ] "Ask AutoPilot" generates last-run-specific prompt for completed
items
- [ ] Copilot empty state shows PulseChips with hover-reveal actions on
desktop
- [ ] PulseChips show See/Ask buttons always on touch screens
- [ ] Pencil icon on greeting opens edit name dialog
- [ ] Name update persists via Supabase and refreshes greeting
- [ ] `pnpm format && pnpm lint && pnpm types` pass
- [ ] `poetry run format` passes for backend changes

🤖 Generated with [Claude Code](https://claude.com/claude-code)

---------

Co-authored-by: John Ababseh <jababseh7@gmail.com>
Co-authored-by: Claude Opus 4.6 <noreply@anthropic.com>
Co-authored-by: Bentlybro <Github@bentlybro.com>
Co-authored-by: coderabbitai[bot] <136622811+coderabbitai[bot]@users.noreply.github.com>
Co-authored-by: CodeRabbit <noreply@coderabbit.ai>
Co-authored-by: majdyz <zamil.majdy@agpt.co>
2026-04-16 17:32:17 +07:00
Zamil Majdy
d357956d98 refactor(backend/copilot): make session-file helper fns public to fix Pyright warnings (#12812)
## Why
After PR #12804 was squashed into dev, two module-level helper functions
in `backend/copilot/sdk/service.py` remained private (`_`-prefixed)
while being directly imported by name in `sdk/transcript_test.py`.
Pyright reports `reportAttributeAccessIssue` when tests (even those
excluded from CI lint) import private symbols from outside their
defining module.

## What
Rename two helpers to remove the underscore prefix:
- `_process_cli_restore` → `process_cli_restore`
- `_read_cli_session_from_disk` → `read_cli_session_from_disk`

Update call sites in `service.py` and imports/calls/docstrings in
`sdk/transcript_test.py`.

## How
Pure rename — no logic change. Both functions were already module-level
helpers with no reason to be private; the underscore was convention
carried over during the refactor but they are directly unit-tested and
should be public.

All 66 `sdk/transcript_test.py` tests pass after the rename.

## Checklist
- [x] Tests pass (`poetry run pytest
backend/copilot/sdk/transcript_test.py`)
- [x] No `_`-prefixed symbols imported across module boundaries
- [x] No linter suppressors added
2026-04-16 17:00:02 +07:00
Zamil Majdy
697ffa81f0 fix(backend/copilot): update transcript_test to use strip_for_upload after upload_cli_session removal 2026-04-16 16:17:02 +07:00
Zamil Majdy
2b4727e8b2 chore: merge master into dev, resolve baseline/transcript conflicts
Conflicts in baseline/service.py, baseline/transcript_integration_test.py,
and transcript.py arose because dev-only commit 0cd0a76305
(baseline upload fix) overlapped with the same fix in PR #12804 which
landed in master. Took master's version for all three files — it is the
complete, reviewed implementation.
2026-04-16 15:38:46 +07:00
Zamil Majdy
0d4b31e8a1 refactor(backend/copilot): unified transcript context — extract_context_messages, mode-gated --resume, compaction-aware gap-fill (#12804)
### Why / What / How

**Why:** The copilot had two separate GCS paths (`cli-sessions/` and
`chat-transcripts/`), redundant function names
(`upload_cli_session`/`restore_cli_session`), and no shared context
strategy between modes. When switching from baseline→SDK or
SDK→baseline, the receiving mode discarded the stored transcript and
fell back to full DB reconstruction — loading all raw messages instead
of the compacted form — causing inflated context, wasted tokens, and
loss of CLI compaction summaries.

**What:**
- Single GCS path (`cli-sessions/`) for both modes — `chat-transcripts/`
removed
- Unified public API: `upload_transcript` / `download_transcript` /
`TranscriptDownload`
- `TranscriptMode = Literal["sdk", "baseline"]` persisted in
`.meta.json` — SDK skips `--resume` when `mode != "sdk"`
(baseline-written JSONL has stripped fields / synthetic IDs)
- `extract_context_messages(download, session_messages)` — shared
context primitive used by **both SDK and baseline**: reads compacted
transcript content + fills only the DB gap (messages after watermark),
so CLI compaction summaries are preserved across mode switches
- Watermark fix: `_jsonl_covered = transcript_msg_count + 2` when a real
transcript is present, preventing false gap detection after `--resume`
- Baseline gap-fill: `_append_gap_to_builder` converts `ChatMessage` →
JSONL entries; no more silently discarded stale transcripts

**How:**

```
SDK turn (mode="sdk" transcript available):
  ──► --resume  [full CLI session restored natively]
  ──► inject gap prefix if DB has messages after watermark

SDK turn (mode="baseline" transcript available):
  ──► cannot --resume (synthetic CLI IDs)
  ──► extract_context_messages(download, session_messages):
        returns transcript JSONL (compacted, isCompactSummary preserved) + gap
        excludes session_messages[-1] (current turn — caller injects it separately)
  ──► format as <conversation_history> + "Now, the user says: {current}"

Baseline turn (any transcript):
  ──► _load_prior_transcript → TranscriptDownload
  ──► extract_context_messages(download, session_messages) + session_messages[-1]
        replaces full session.messages DB read
  ──► LLM messages: [compacted history + gap] + [current user turn]

Transcript unavailable — both SDK (use_resume=False) and baseline:
  ──► extract_context_messages(None, session_messages) returns session_messages[:-1]
        (all prior DB messages except the current user turn at [-1])
  ──► graceful fallback — no crash, no empty context
  ──► covers: first turn, GCS error, corrupt JSONL, missing .meta.json
  ──► next successful response uploads a fresh transcript
```

`extract_context_messages` is the shared primitive — both modes call the
same function, which handles:
- `download=None` (first turn, GCS unavailable) → falls back to
`session_messages[:-1]`
- Empty/corrupt content → falls back to `session_messages[:-1]`
- `bytes` content (raw GCS) or `str` content (pre-decoded baseline path)
- `isCompactSummary=True` entries → preserved so CLI compaction survives
mode switches
- Missing/corrupt `.meta.json` → `message_count` defaults to `0`, `mode`
defaults to `"sdk"`

**Why `[:-1]` and not all messages?** `session_messages[-1]` is always
the current user turn being handled right now. Both callers inject it
separately — SDK wraps it as `"Now, the user says: ..."`, baseline
appends it as the final message in the LLM array. Returning it inside
`extract_context_messages` would double-inject it.

### Changes 🏗️

- **`transcript.py`**: `CliSessionRestore` → `TranscriptDownload` +
`mode` field; `upload_cli_session` → `upload_transcript`;
`restore_cli_session` → `download_transcript`; add `TranscriptMode`,
`detect_gap`, `extract_context_messages`; import `ChatMessage` via
relative path to match `service.py` style
- **`sdk/service.py`**: mode-check before `--resume`; `_RestoreResult`
carries `baseline_download` + `context_messages` + `transcript_content`;
`_build_query_message` accepts `prior_messages` override;
`_restore_cli_session_for_turn` populates `context_messages` via
`extract_context_messages` and sets `transcript_content` to prevent
duplicate DB reconstruction; watermark fix (`_jsonl_covered =
transcript_msg_count + 2`)
- **`baseline/service.py`**: `_load_prior_transcript` returns `(bool,
TranscriptDownload | None)`; LLM context replaced with
`extract_context_messages(download, messages)`; `_append_gap_to_builder`
+ `detect_gap` call; `upload_transcript(mode="baseline")`
- **`sdk/transcript.py`**: updated re-exports, old aliases removed
- **`scripts/download_transcripts.py`**: updated for `bytes | str`
content type
- **Test files**: 179 tests total; `transcript_test.py`,
`baseline/transcript_integration_test.py`,
`sdk/service_helpers_test.py`, `sdk/test_transcript_watermark.py`,
`test/copilot/test_transcript_watermark.py` all updated/added

### Checklist 📋

#### For code changes:
- [x] I have clearly listed my changes in the PR description
- [x] I have made a test plan
- [x] I have tested my changes according to the test plan:
- [x] 179 unit tests pass — `transcript_test`,
`baseline/transcript_integration_test`, `sdk/service_helpers_test`,
`sdk/test_transcript_watermark`
  - [x] pyright 0 errors on all changed files
- [x] SDK `--resume` path still works when `mode="sdk"` transcript is
present
- [x] SDK fallback uses `extract_context_messages` (compacted baseline
content + gap) when `mode="baseline"` transcript is stored — no more
full DB reconstruction
- [x] Baseline uses `extract_context_messages` per turn instead of full
`session.messages` DB read
  - [x] `isCompactSummary=True` entries preserved across mode switches
- [x] Watermark (`_jsonl_covered`) fix prevents false gap detection
after `--resume`
- [x] Baseline gap detection no longer silently discards stale
transcripts
- [x] `TranscriptDownload.content` accepts `bytes | str` — backward
compatible
- [x] Transcript unavailable (GCS error, first turn, corrupt file)
gracefully falls back to `session_messages[:-1]` without crash — applies
to both SDK and baseline paths

---------

Co-authored-by: chernistry <73943355+chernistry@users.noreply.github.com>
Co-authored-by: Nicholas Tindle <nicholas.tindle@agpt.co>
2026-04-16 15:35:18 +07:00
Zamil Majdy
0cd0a76305 fix(backend/copilot): baseline always uploads when GCS has no transcript
_load_prior_transcript was returning False for missing/invalid transcripts,
which caused should_upload_transcript to suppress the upload. The original
intent was to protect against overwriting a *newer* GCS version — but a
missing or corrupt file is not 'newer'. Only stale (watermark ahead) and
download errors (unknown GCS state) should suppress upload.

Also renames transcript_covers_prefix → transcript_upload_safe throughout
to accurately describe what the flag means.
2026-04-16 14:58:42 +07:00
Toran Bruce Richards
d01a51be0e Add check for GitHub account connection status (#12807)
Added instruction to check GitHub authentication status before prompting
user. This prevents repeated, unnecessary asking of the user to add
their GitHub credentials when they're already added, which is currently
a prevalent bug.

### Changes 🏗️
- Added one line to
`autogpt_platform/backend/backend/copilot/prompting.py` instructing
AutoPilot to run `gh auth status` before prompting the user to connect
their GitHub account.

Co-authored-by: Toran Bruce Richards <22963551+Torantulino@users.noreply.github.com>
2026-04-16 12:09:00 +07:00
chernistry
bd2efed080 fix(frontend): allow zooming out more in the builder (#12690)
Reduced minZoom on the builder canvas from 0.1 to 0.05 to allow zooming
out further when working with large agent graphs.

Fixes #9325

Co-authored-by: Nicholas Tindle <nicholas.tindle@agpt.co>
2026-04-15 21:25:07 +00:00
Zamil Majdy
5fccd8a762 Merge branch 'master' of github.com:Significant-Gravitas/AutoGPT into dev 2026-04-16 01:23:07 +07:00
Zamil Majdy
2740b2be3a fix(backend/copilot): disable fallback model to fix prod CLI rejection (#12802)
### Why / What / How

**Why:** `fffbe0aad8` changed both `ChatConfig.model` and
`ChatConfig.claude_agent_fallback_model` to `claude-sonnet-4-6`. The
Claude Code CLI rejects this with `Error: Fallback model cannot be the
same as the main model`, causing every standard-mode copilot turn to
fail with exit code 1 — the session "completes" in ~30s but produces no
response and drops the transcript.

**What:** Set `claude_agent_fallback_model` default to `""`.
`_resolve_fallback_model()` already returns `None` on empty string,
which means the `--fallback-model` flag is simply not passed to the CLI.
On 529 overload errors the turn will surface normally instead of
silently retrying with a fallback.

**How:** One-line config change + test update.

### Changes 🏗️

- `ChatConfig.claude_agent_fallback_model` default:
`"claude-sonnet-4-6"` → `""`
- Update `test_fallback_model_default` to assert the empty default

### Checklist 📋

#### For code changes:
- [x] I have clearly listed my changes in the PR description
- [x] I have made a test plan
- [x] I have tested my changes according to the test plan:
  - [x] `poetry run pytest backend/copilot/sdk/p0_guardrails_test.py`

#### For configuration changes:
- [x] `.env.default` is updated or already compatible with my changes
- [x] `docker-compose.yml` is updated or already compatible with my
changes
2026-04-16 01:22:20 +07:00
Zamil Majdy
d27d22159d Merge branch 'master' of github.com:Significant-Gravitas/AutoGPT into dev 2026-04-16 00:05:32 +07:00