WriteWorkspaceFileTool: when write_file rejects with "Storage limit exceeded",
append a recovery hint pointing at list_workspace_files + delete_workspace_file.
The agent reads this on the next turn and can offer to clean up before falling
back to "ask the user to upgrade", so the user doesn't have to leave the chat.
Upload route TOCTOU rollback: route the over-quota cleanup through
WorkspaceManager.delete_file instead of soft_delete_workspace_file. The latter
only flips isDeleted on the DB row and renames the path — the storage backend
blob would survive, leaking storage on every concurrent-upload race. delete_file
removes the blob first, then soft-deletes the DB record.
The 80% warning fired on every write inside the band, producing repeated log
entries during a single CoPilot turn that writes multiple files. The frontend
storage bar already conveys usage to the user, and ops alerting belongs on a
metric, not a log line that scales with write rate.
Also document `VirusDetectedError` / `VirusScanError` in `WorkspaceManager.write_file`
so callers can decide between explicit handling and generic `Exception` catches.
## Why
Two regressions surfaced after
[#12933](https://github.com/Significant-Gravitas/AutoGPT/pull/12933)
merged to `dev`:
1. **Cancel-pending banner shows wrong copy.** The merged PR moved
cancel-at-period-end from `BASIC` → `NO_TIER`, but
`PendingChangeBanner.isCancellation` was still keyed on `"BASIC"`. As a
result, a user who cancels their sub now sees *"Scheduled to downgrade
to No subscription on …"* instead of the intended *"Scheduled to cancel
your subscription on …"*. Caught by Sentry on the merged PR.
2. **Currency-mismatch downgrade returns 502 (looks like outage).** A
user with an existing GBP-active sub (Max Price has
`currency_options.gbp`) tried to downgrade to Pro and got 502. The
backend logs show:
```
stripe._error.InvalidRequestError: The price specified only supports
`usd`.
This doesn't match the expected currency: `gbp`.
```
The Pro Price is USD-only; Stripe rejects `SubscriptionSchedule.modify`
because phases must share currency. Wrapping that in a generic 502 hid
the real cause and made it read like a Stripe outage.
## What
* Frontend: flip `PendingChangeBanner.isCancellation` from `pendingTier
=== "BASIC"` to `"NO_TIER"`. Update both component and page-level tests
that exercised the cancellation branch.
* Backend: catch `stripe.InvalidRequestError` whose message mentions
`currency` in `update_subscription_tier`, and return **422** with *"Tier
change unavailable for your current billing currency. Cancel your
subscription and re-subscribe at the target tier, or contact support."*
— so users see the actual reason, not a misleading outage message. Other
`StripeError` paths still return 502.
* New backend test asserts the currency-mismatch branch returns 422 with
the new copy.
## How
* `PendingChangeBanner.tsx` line 28: 1-char change (`"BASIC"` →
`"NO_TIER"`).
* `subscription_routes_test.py` and `PendingChangeBanner.test.tsx`
updated to use `NO_TIER` for the cancellation fixture.
* `v1.py` `update_subscription_tier` adds a typed `except
stripe.InvalidRequestError` branch ahead of the generic `StripeError`;
only currency-mismatch messages get the special 422, everything else
falls through to the existing 502.
## The real fix lives in Stripe config
The defensive 422 here is just a clearer error surface. To actually
unblock GBP/EUR users from changing tiers, the per-tier Stripe Prices
(Pro, and Basic if priced) need `currency_options` for GBP added — Max
already has this, which is why Max checkout shows the £/$ toggle. Stripe
locks `currency_options` after a Price has been transacted, so the
procedure is: create a new Price with USD + GBP from the start → update
the `stripe-price-ids` LD flag to the new Price ID. No further code
change required; same Price ID stays per tier, multiple currencies
inside it.
## Checklist
- [x] Component test for new banner copy
- [x] Backend test for 422 currency-mismatch branch
- [x] Format / lint / types pass
- [x] No protected route added — N/A
- Change LD workspace storage validation from `< 0` to `<= 0` to prevent
a zero value from silently disabling quota enforcement
- Fix prettier formatting in UsagePanelContentRender test
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
## Why
[#12723](https://github.com/Significant-Gravitas/AutoGPT/pull/12723)
wired Web Push fanout into `AsyncRedisNotificationEventBus.publish()` so
copilot completion events reach users with the tab closed. But the bus
is also used by `data/onboarding.py` for in-page step toasts, and those
started firing OS-level system notifications (`increment_runs`,
`step_completed`, etc.) — unwanted noise.
## What
Smallest possible patch: skip the OS push fanout when `payload.type ==
"onboarding"`. WebSocket delivery is unchanged.
## How
```python
async def publish(self, event: NotificationEvent) -> None:
await self.publish_event(event, event.user_id)
# Skip OS push for onboarding step toasts — those are in-page only.
# TODO: remove once the onboarding/wallet rework lands.
if event.payload.model_dump().get("type") == "onboarding":
return
...
```
Five-line addition in `backend/data/notification_bus.py`. Marked `TODO`
to remove once the upcoming onboarding/wallet rework decides per-event
whether a system notification is desired.
Tests: added `test_publish_skips_web_push_for_onboarding`; existing
fanout tests continue to validate the happy path with non-onboarding
payloads.
## Test plan
- [x] `poetry run format` (ruff + isort + black + pyright)
- [ ] CI: `poetry run pytest backend/data/notification_bus_test.py`
- [ ] Manual on dev: trigger onboarding step → confirm no OS
notification; finish copilot session → confirm OS notification still
fires.
🤖 Generated with [Claude Code](https://claude.com/claude-code)
---------
Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
## Why
Started as a regression fix for admin-granted user downgrades hitting
Stripe Checkout, broadened to close the surrounding gaps in the Stripe
billing flow that surfaced during testing. Three concrete user-facing
problems the PR resolves:
1. **Admin-granted users couldn't change tier in-app** when their
current tier had no `stripe-price-id-*` LD configured — clicking
Downgrade silently routed to a paid-signup Stripe Checkout instead of
just changing the tier.
2. **Subscription payments granted nothing visible to users** — paying
£20–£320/mo gave higher rate-limit multipliers but no AutoPilot credits
in the user's balance, despite a dialog promising "credit to your next
Stripe invoice" (which users naturally read as AutoGPT credits).
3. **Tier oscillated across page refreshes** — `get_user_by_id` was
process-local cached, so dev's 4 server pods each held their own copy.
Tier could read MAX on one pod and BASIC on another for ~5 min after a
webhook update, depending on which pod the request landed on.
Plus three structural improvements caught during review:
4. **No paywall enforcement for paid-cohort users without subscription**
— non-beta users on `BASIC` (no Stripe sub) could freely use AutoPilot.
5. **Upgrade/downgrade dialog copy was misleading** — implied a Stripe
redirect that doesn't happen for existing-sub modifications, used
"credit" ambiguously, and didn't surface the next-invoice date.
6. **Top-up Checkout created an ephemeral Stripe Product per session** —
no canonical Product for dashboard reporting, no way to scope coupons to
top-ups.
## What
### 1. Admin-granted downgrades skip Checkout (price-id-pruning
regression)
`update_subscription_tier()` used to gate its modify-or-DB-flip block on
`current_tier_price_id is not None`. When a tier was pruned from
`stripe-price-ids` LD, that gate skipped the inner DB-flip branch and
the request fell through to Checkout — sending admin-granted users to a
paid-signup flow when they were trying to *reduce* their tier. Drop the
gate and call `modify_stripe_subscription_for_tier()` unconditionally —
the function self-reports `False` when there's no Stripe sub. One
uniform path for everyone now.
### 2. Subscription credit grant on every paid Stripe invoice
New `invoice.payment_succeeded` webhook handler at
[`credit.py:handle_subscription_payment_success`](autogpt_platform/backend/backend/data/credit.py)
adds a `GRANT` transaction equal to `invoice.amount_paid`, keyed by
`INVOICE-{id}` for idempotency (Stripe webhook retries cannot
double-grant). Initial signup, monthly renewal, and prorated upgrade
charges all surface as AutoGPT balance bumps the moment Stripe confirms
the charge. Skipped: non-subscription invoices, $0 invoices, ENTERPRISE
users.
### 3. Cross-pod user cache
[`user.py:31`](autogpt_platform/backend/backend/data/user.py#L31)
`cache_user_lookup = cached(maxsize=1000, ttl_seconds=300,
shared_cache=True)`. Single line — moves the cache to Redis so all
server pods read/write the same key. The existing
`get_user_by_id.cache_delete(user_id)` invalidations now propagate
cross-pod.
### 4. PaywallGate
New
[`PaywallGate`](autogpt_platform/frontend/src/app/(platform)/PaywallGate/PaywallGate.tsx)
wraps the `(platform)/layout.tsx` route group. When
`ENABLE_PLATFORM_PAYMENT === true` (paid cohort) AND `subscription.tier
=== "BASIC"`, redirects to `/profile/credits` where the credits page
shows a "Pick a plan to continue using AutoGPT" banner above the tier
picker.
Notes:
- **Beta cohort skips entirely** (flag off → `useGetSubscriptionStatus`
query disabled, no redirect).
- **Gates on DB tier, not `has_active_stripe_subscription`** — Sentry
caught that a transient Stripe API error in
`get_active_subscription_period_end()` would set `has_active=false` for
paying users, locking them out. The DB tier is set by webhooks and
persists locally; Stripe API hiccups don't flip it.
- **Exempt routes**: `/profile`, `/admin`, `/auth`, `/login`, `/signup`,
`/reset-password`, `/error`, `/unauthorized`, `/health`. Onboarding
lives in the sibling `(no-navbar)` group, so this gate doesn't conflict
with the in-flight onboarding-paywall integration.
### 5. Upgrade/downgrade dialog clarity
`SubscriptionStatusResponse` now exposes
`has_active_stripe_subscription: bool` and `current_period_end: int |
None`, computed via a new
[`get_active_subscription_period_end`](autogpt_platform/backend/backend/data/credit.py)
helper. Frontend dialogs branch on those:
**Upgrade — modify-in-place** (existing sub):
> Your subscription is upgraded to MAX immediately. On your next invoice
on May 21, 2026, your saved card is charged for the upgrade proration
since today plus the next month at the new rate, with the unused portion
of your current plan automatically deducted. Credits matching the paid
amount are added to your AutoGPT balance once Stripe confirms the
charge.
**Upgrade — Checkout** (no sub):
> You'll be redirected to Stripe to enter payment details and start your
MAX subscription. The first invoice's amount is added to your AutoGPT
balance once Stripe confirms the charge.
**Downgrade (paid → paid)**:
> Switching to PRO takes effect at the end of your current billing
period on May 21, 2026 — no charge today. You keep your current plan
until then. From that date your saved card is billed at the PRO rate,
and matching credits are added to your AutoGPT balance with each paid
invoice.
Toast wording on success matches dialog. Tier labels run through
`getTierLabel()` so we render "Pro/Max/Business" not "PRO/MAX/BUSINESS"
(Sentry-flagged in review).
### 6. Top-up Stripe Product ID via LD flag
New `STRIPE_PRODUCT_ID_TOPUP` LD flag. **Unset (default)** → legacy
inline `product_data` (Stripe creates an ephemeral product per Checkout
— backward-compatible with current behavior). **Set to a Stripe Product
ID** → line item references that Product so all top-ups group under one
entity in Stripe Dashboard reporting; per-session amount stays dynamic
via `price_data.unit_amount`. The two paths are mutually exclusive
(Stripe rejects `product` + `product_data` together).
## How
- Backend changes confined to
[`v1.py`](autogpt_platform/backend/backend/api/features/v1.py),
[`credit.py`](autogpt_platform/backend/backend/data/credit.py),
[`user.py`](autogpt_platform/backend/backend/data/user.py),
[`feature_flag.py`](autogpt_platform/backend/backend/util/feature_flag.py).
- Frontend changes: new
[`PaywallGate`](autogpt_platform/frontend/src/app/(platform)/PaywallGate/PaywallGate.tsx)
component + small edits to
[`(platform)/layout.tsx`](autogpt_platform/frontend/src/app/(platform)/layout.tsx),
`SubscriptionTierSection.tsx`, `useSubscriptionTierSection.ts`,
`helpers.ts`.
- Both backend and frontend pass `user.id` to LD context (verified in
[`feature_flag.py:_fetch_user_context_data`](autogpt_platform/backend/backend/util/feature_flag.py)
and
[`feature-flag-provider.tsx`](autogpt_platform/frontend/src/services/feature-flags/feature-flag-provider.tsx))
for proper per-user targeting.
### Out of scope (follow-ups)
- Hard-paywall onboarding integration (Lluis's work — coordinated;
PaywallGate wraps `(platform)/layout.tsx` and onboarding lives in
`(no-navbar)`, so they don't conflict).
- Beta-users-as-Stripe-trial migration.
- Max-cap usage alerting + "Contact us" routing.
- "No Active Subscription" state rename.
- "Your credits" → "Automation Credits" rename + helper tooltip.
- BASIC tier resurface as a free / cancel-subscription option
(deliberately deferred per current product direction).
## Test plan
### Backend (all green in CI)
- [x] `poetry run pytest
backend/api/features/subscription_routes_test.py` — 41 passed.
- [x] `poetry run pytest backend/data/credit_subscription_test.py`
covering: `handle_subscription_payment_success` (grants credits, skips
non-sub/zero/missing-customer/unknown-user/ENTERPRISE, idempotent on
retry), `get_active_subscription_period_end` (happy path, no-customer
short-circuit, Stripe error swallow), top-up Product ID flag both
branches.
- [x] Type-check (3.11/3.12/3.13) — green after explicit
`list[stripe.checkout.Session.CreateParamsLineItem]` typing on top-up
`line_items`.
- [x] Codecov patch — both backend + frontend green.
### Frontend (all green in CI)
- [x] `pnpm test:unit` — 2154/2154 pass, including 5 new PaywallGate
tests (beta-cohort skip, paid-cohort BASIC redirect, no-redirect for
PRO/MAX/BUSINESS, exempt-prefix matrix, loading-state guard) and updated
`formatCost`/dialog-copy assertions.
- [x] `pnpm types`, `pnpm format`, `pnpm lint` — clean.
### Live verification on `dev-builder.agpt.co` (5/5 pass — see PR
comments)
- [x] Login + credits page renders correctly with Pro + Max cards, BASIC
+ BUSINESS hidden, no paywall banner for active subscriber.
- [x] Downgrade dialog shows new copy with concrete date + "no charge
today" + credit-grant explanation.
- [x] PaywallGate does NOT redirect paying users (MAX tier with active
sub).
- [x] PaywallGate REDIRECTS BASIC user (DB-flipped via `kubectl exec`
for testing, restored after) → `/build` redirects to `/profile/credits`,
violet "Pick a plan to continue using AutoGPT" banner displayed.
- [x] Upgrade dialog (modify-in-place) shows the corrected proration
phrasing.
- [ ] Manual: real production-like test of `invoice.payment_succeeded`
granting credits — fires on next billing cycle (2026-05-21 for the dev
test user); not testable today without manipulating Stripe webhook.
### Why / What / How
**Why:** The onboarding `SubscriptionStep` (added in #12935) is
currently shown to every new user, but the platform payment system is
rolled out behind the `ENABLE_PLATFORM_PAYMENT` LaunchDarkly flag. We
need the onboarding plan-selection step to honor the same flag so users
in flag-off cohorts don't hit a payment surface that the rest of the
product won't support.
**What:** Conditionally render the `SubscriptionStep` based on
`ENABLE_PLATFORM_PAYMENT`. When the flag is off the wizard runs `Welcome
→ Role → PainPoints → Preparing` (3 user-interactive steps +
transition); when on, behavior is unchanged (`Welcome → Role →
PainPoints → Subscription → Preparing`).
**How:**
- `page.tsx` reads the flag, computes `totalSteps` (3 vs. 4) and
`preparingStep` (4 vs. 5), and only renders `SubscriptionStep` when the
flag is on.
- `useOnboardingPage.ts` threads the same `preparingStep` into the URL
`parseStep` clamp and into the "submit profile when entering Preparing"
effect, so both adapt to the flag state.
- The Zustand store is left unchanged — its hard `Math.min(5, …)` clamp
is unreachable in flag-off flow because PainPointsStep advances 3 → 4
(Preparing) and that's the terminal step.
- `playwright/utils/onboarding.ts`: with `NEXT_PUBLIC_PW_TEST=true`
LaunchDarkly returns `defaultFlags` (`ENABLE_PLATFORM_PAYMENT: false`),
so the helper now waits up to 2s for the Subscription header and only
clicks a plan CTA if the step is actually rendered.
### Changes 🏗️
- `autogpt_platform/frontend/src/app/(no-navbar)/onboarding/page.tsx` —
gate `SubscriptionStep` on `ENABLE_PLATFORM_PAYMENT`; derive
`totalSteps`/`preparingStep` from the flag.
-
`autogpt_platform/frontend/src/app/(no-navbar)/onboarding/useOnboardingPage.ts`
— make `parseStep` and the profile-submission effect respect the
flag-derived `preparingStep`.
- `autogpt_platform/frontend/src/playwright/utils/onboarding.ts` — make
the Subscription step optional in `completeOnboardingWizard` so E2E
works in both flag states.
### Checklist 📋
#### For code changes:
- [x] I have clearly listed my changes in the PR description
- [x] I have made a test plan
- [ ] I have tested my changes according to the test plan:
- [x] Existing onboarding unit tests pass (`pnpm test:unit` — 2447
passed, including `PainPointsStep`, `RoleStep`, `SubscriptionStep`,
store)
- [x] `pnpm format`, `pnpm lint`, `pnpm types` clean
- [ ] Manual: with flag **off**, walk onboarding and confirm wizard goes
Welcome → Role → PainPoints → Preparing → /copilot, progress bar shows 3
steps
- [ ] Manual: with flag **on** (LD or
`NEXT_PUBLIC_FORCE_FLAG_ENABLE_PLATFORM_PAYMENT=true`), walk onboarding
and confirm SubscriptionStep is present at step 4, progress bar shows 4
steps
- [ ] Manual: with flag **off**, hit `/onboarding?step=5` directly and
confirm it clamps back to step 1 (no orphan Subscription state)
- [ ] Playwright: `completeOnboardingWizard` E2E flow continues to pass
under default `NEXT_PUBLIC_PW_TEST=true` (flag off path)
#### For configuration changes:
- [x] `.env.default` is updated or already compatible with my changes
(no config changes — flag already exists in LaunchDarkly +
`defaultFlags`)
- [x] `docker-compose.yml` is updated or already compatible with my
changes
- [x] I have included a list of my configuration changes in the PR
description (none needed)
---------
Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
## Summary
Adds a new **Subscription Step** (Step 4) to the onboarding wizard,
allowing users to choose a plan (Pro, Max, or Team) before reaching the
"Preparing" step.
## Changes
### New files
- **`steps/SubscriptionStep.tsx`** — Full subscription UI with:
- Three plan cards (Pro $50/mo, Max $320/mo, Team — coming soon)
- Monthly / yearly billing toggle (yearly shows annual total with 20%
discount, plus monthly equivalent)
- Country selector (28 Stripe-supported countries) that opens upward as
a search modal
- Localized pricing using live exchange rates
- **`steps/countries.ts`** — Currency data module with exchange rates,
`formatPrice()` helper, and zero-decimal currency handling (JPY, KRW,
HUF, CLP)
### Modified files
- **`store.ts`** — Extended `Step` type to `1 | 2 | 3 | 4 | 5`, added
`selectedPlan` and `selectedBilling` state/actions
- **`page.tsx`** — Wired `SubscriptionStep` as Step 4, moved
`PreparingStep` to Step 5, adjusted progress bar and dot indicators
- **`useOnboardingPage.ts`** — Updated `parseStep` range to 1–5, profile
submission now triggers at Step 5
## Design decisions
- Follows existing component patterns: uses `FadeIn`, `Text`, `Button`
atoms, `cn()` utility, Phosphor icons
- Country selector opens **upward** to avoid clipping below the viewport
- Plan selection advances to Step 5 immediately (Stripe integration is
TODO)
- Exchange rates are hardcoded for now — should be fetched from an API
in production
## TODO
- [ ] Integrate with Stripe checkout / backend subscription API
- [ ] Fetch live exchange rates instead of hardcoded values
- [ ] Add responsive layout for mobile viewports
---------
Co-authored-by: Nicholas Tindle <nicholas.tindle@agpt.co>
Co-authored-by: Lluis Agusti <hi@llu.lu>
Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Co-authored-by: Ubbe <hi@ubbe.dev>
### Why / What / How
**Why:** When a user kicks off an AutoPilot task and leaves the platform
(closes the tab, switches to another page, or minimizes the browser),
they have no way of knowing when it completes unless they come back and
check. This breaks the "set it and forget it" promise of automation.
**What:** Adds Web Push notifications using the standard Push API
(VAPID). Push notifications are delivered through free browser vendor
services (Google FCM, Apple APNs, Mozilla Push) to a service worker —
even when all AutoGPT tabs are closed, as long as the browser process is
running. The system is generic and extensible to all notification types,
with copilot session completion as the first integration.
**How:**
- **Backend:** A new `PushSubscription` Prisma model stores per-user
push subscriptions. When a `NotificationEvent` is published to the Redis
notification bus, the existing `notification_worker` in `ws_api.py`
fires a tracked background `send_push_for_user()` task. This uses
`pywebpush` to call the browser push services with VAPID authentication.
Includes per-user TTL-bounded debounce (5s), per-user subscription cap
(20), 410/404 auto-cleanup, periodic scheduler-driven cleanup of
high-failure rows, and route-level SSRF rejection of untrusted
endpoints.
- **Frontend:** A `push-sw.js` service worker handles `push` events and
shows OS notifications via `self.registration.showNotification()`, with
click-to-navigate. A `PushNotificationProvider` mounted at the platform
layout registers the SW and subscription on all pages, posts the current
URL to the SW on every Next.js navigation (since Chrome's
`WindowClient.url` is stale for SPA routing), forwards the user's
notifications-toggle setting to the SW, and tears down on logout. The
copilot in-page notification path defers to the SW when a push
subscription is active so users don't get duplicate alerts.
### Behavior — when does an OS notification fire?
| Where the user is focused | Notifications toggle | OS notification? |
|---|---|---|
| Any `/copilot` page (any session, tab visible + browser focused) | on
| suppressed — sidebar green check + title badge handle it |
| `/library` (or any non-`/copilot` route) | on | **fires** |
| `/copilot` but tab hidden (Cmd-Tab away, minimized, different tab) |
on | **fires** |
| All AutoGPT tabs closed (browser process still running) | on |
**fires** |
| Any state | off | suppressed |
| Anywhere | permission not granted / no push subscription | falls back
to in-page `Notification()` if user is away on `/copilot`; nothing
otherwise |
Click any OS notification → focuses an existing tab and navigates it to
`/copilot?sessionId=<id>`, or opens a new window if no AutoGPT tab is
open.
### Test plan
#### Setup
- [ ] Generate VAPID keys via the snippet in `backend/.env.default` and
set `VAPID_PRIVATE_KEY`, `VAPID_PUBLIC_KEY`, `VAPID_CLAIM_EMAIL` in
`backend/.env`
- [ ] Leave `NEXT_PUBLIC_VAPID_PUBLIC_KEY` unset on the frontend (single
source of truth via `/api/push/vapid-key`)
- [ ] Start backend + frontend, grant notification permission on the
copilot page
- [ ] Verify `push-sw.js` is "activated and is running" in DevTools →
Application → Service Workers
- [ ] Verify `POST /api/push/subscribe` created exactly one DB row in
`PushSubscription` for your user
#### Notification show / suppress matrix
- [ ] Trigger completion **on `/copilot` viewing the same session**, tab
visible + focused → no OS notification (sidebar green check appears)
- [ ] Trigger completion **on `/copilot` viewing a different session**,
tab visible + focused → no OS notification (still considered "in the
feature")
- [ ] Trigger completion **on `/library`**, tab visible + focused → OS
notification fires
- [ ] Trigger completion **on `/copilot`** but with the tab hidden
(Cmd-Tab to another app) → OS notification fires
- [ ] Trigger completion with all AutoGPT tabs closed → OS notification
fires (browser must still be running)
- [ ] Toggle notifications **off** in the copilot UI → trigger
completion → no OS notification
- [ ] Toggle notifications **back on** → trigger completion → OS
notification fires
#### Click behavior
- [ ] OS notification → click → focuses an existing AutoGPT tab and
navigates to `/copilot?sessionId=<id>`
- [ ] OS notification with no AutoGPT tab open → click → opens a new tab
on `/copilot?sessionId=<id>`
#### Lifecycle
- [ ] Logout → DB row removed, browser unsubscribed; no further OS
notifications until login + re-subscribe
- [ ] Stale subscription (e.g. unsubscribed externally) → backend gets
410 from FCM → row auto-deleted; second push attempts no longer fan out
to it
### Changes 🏗️
**Backend — New files:**
- `backend/data/push_subscription.py` — CRUD for push subscriptions:
`upsert` (with `MAX_SUBSCRIPTIONS_PER_USER` cap), `find_many`, `delete`,
`increment_fail_count`, `cleanup_failed_subscriptions`,
`validate_push_endpoint` (HTTPS + push-service hostname allowlist for
SSRF prevention)
- `backend/data/push_sender.py` — Fire-and-forget push delivery with
`cachetools.TTLCache`-bounded debounce, defense-in-depth re-validation
at send time, 410/404 auto-cleanup with regex-based status extraction
(covers pywebpush versions where `e.response` is unset)
- `backend/api/features/push/routes.py` — 3 endpoints: `GET
/api/push/vapid-key`, `POST /api/push/subscribe`, `POST
/api/push/unsubscribe` (all with `requires_user` auth and 400 on invalid
endpoints)
- `backend/api/features/push/model.py` — Pydantic models with
`min_length`/`max_length` constraints on endpoint and crypto keys
**Backend — Modified files:**
- `schema.prisma` — Added `PushSubscription` model + `User` relation
- `pyproject.toml` — Added `pywebpush ^2.3` dependency
- `backend/util/settings.py` — VAPID key fields on `Secrets`;
`push_subscription_cleanup_interval_hours` config
- `backend/api/rest_api.py` — Registered push router at `/api/push`
- `backend/api/ws_api.py` — Notification worker now fires
`send_push_for_user()` as a tracked background task (strong-ref set +
done callback so asyncio doesn't GC it mid-run)
- `backend/data/db_manager.py` — Exposed push subscription RPC methods
on the DB manager async client
- `backend/executor/scheduler.py` — Periodic
`cleanup_failed_push_subscriptions` job (default 24h)
- `backend/.env.default` — VAPID env vars with key generation snippet
**Frontend — New files:**
- `public/push-sw.js` — Service worker: routes pushes via
`NOTIFICATION_MAP`, suppresses when user is on `/copilot`, accepts
`CLIENT_URL` and `NOTIFICATIONS_ENABLED` postMessages so SW logic stays
in sync with SPA navigation and the toggle, click handler with focus →
navigate → openWindow fallback, `pushsubscriptionchange` re-subscribe
with `credentials: include`
- `src/services/push-notifications/registration.ts`, `api.ts`,
`helpers.ts` — SW registration / Push API subscription / backend API
helpers
- `src/services/push-notifications/usePushNotifications.ts` — Hook that
auto-subscribes on login and tears down on logout
- `src/services/push-notifications/useReportClientUrl.ts` — Posts
current pathname+search to SW on every Next.js route change (works
around stale `WindowClient.url`)
- `src/services/push-notifications/useReportNotificationsEnabled.ts` —
Forwards the user's notifications toggle to the SW
- `src/services/push-notifications/PushNotificationProvider.tsx` —
Mounts all three hooks at the platform layout level
**Frontend — Modified files:**
- `src/app/(platform)/layout.tsx` — Mounted `<PushNotificationProvider
/>`
- `src/app/(platform)/copilot/useCopilotNotifications.ts` — Skips
in-page `Notification()` when a SW push subscription is active (avoids
duplicate alerts)
- `src/services/storage/local-storage.ts` — Added
`PUSH_SUBSCRIPTION_REGISTERED` key
- `frontend/.env.default` — Optional `NEXT_PUBLIC_VAPID_PUBLIC_KEY`
(left unset by default to keep `/api/push/vapid-key` as the single
source of truth)
**Configuration changes:**
- New env vars: `VAPID_PRIVATE_KEY`, `VAPID_PUBLIC_KEY`,
`VAPID_CLAIM_EMAIL` (backend); optional `NEXT_PUBLIC_VAPID_PUBLIC_KEY`
(frontend)
- New `push_subscription_cleanup_interval_hours` setting (default 24,
range 1–168)
- New DB migration: `PushSubscription` table
(`20260420120000_add_push_subscription`)
### Checklist 📋
#### For code changes:
- [x] I have clearly listed my changes in the PR description
- [x] I have made a test plan
- [x] All blockers and should-fixes from the autogpt-pr-reviewer review
have been addressed (see PR thread)
- [x] All inline review threads resolved (49 threads addressed)
#### For configuration changes:
- [x] `.env.default` is updated or already compatible with my changes
- [x] `docker-compose.yml` is updated or already compatible with my
changes
- [x] I have included a list of my configuration changes in the PR
description (under **Changes**)
---------
Co-authored-by: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
## Why
AutoPilot needs to reach users on chat platforms — Discord first,
Telegram / Slack / Teams / WhatsApp next. This PR adds the bot service
that bridges those platforms to the AutoPilot backend via the
`PlatformLinkingManager` AppService introduced in #12615.
Two independent linking flows (see #12615 for the rationale):
- **SERVER links**: first person to run `/setup` in a guild claims it.
Anyone in the server can mention the bot; all usage bills to the owner.
- **USER links**: an individual DMs the bot, links their personal
account, DMs bill to their own AutoPilot. A server owner still has to
link their DMs separately.
## What
A Python service using `discord.py`, living alongside the rest of the
backend. Connects to the platform_linking service via cluster-internal
RPC (no shared bearer token) and subscribes to copilot streams directly
on Redis (no HTTP SSE proxy).
Originally prototyped in Node.js with Vercel's Chat SDK — rewritten in
Python after team feedback: the rest of the platform is Python,
`discord.py` was already a dependency, and the Chat SDK's streaming-UI
abstractions don't apply to a headless chat bot.
### Deployment
- **Shares the existing backend Docker image** — no separate Dockerfile,
no separate Artifact Registry. A `copilot-bot` poetry script entry lets
the same image run with `command: ["copilot-bot"]` in the Helm chart.
- **Auto-starts with `poetry run app`** when
`AUTOPILOT_BOT_DISCORD_TOKEN` is set, so the full local dev stack
includes the bot without extra setup.
- **Runs standalone** via `poetry run copilot-bot` for the production
pod.
Infra PR:
[AutoGPT_cloud_infrastructure#310](https://github.com/Significant-Gravitas/AutoGPT_cloud_infrastructure/pull/310).
### File layout
```
backend/copilot/bot/
├── app.py # CoPilotChatBridge(AppService) + adapter factory + outbound @expose
├── config.py # Shared (platform-agnostic) config
├── handler.py # Core logic: routing, linking, batched streaming
├── platform_api.py # Thin facade over PlatformLinkingManagerClient + stream_registry
├── platform_api_test.py
├── text.py # split_at_boundary + format_batch
├── threads.py # Redis-backed thread subscription tracking
├── README.md
└── adapters/
├── base.py # PlatformAdapter ABC + MessageContext
└── discord/
├── adapter.py # Gateway connection, events, thread creation, buttons
├── commands.py # /setup, /help, /unlink
└── config.py # Discord token + message limits
```
**Locality rule:** anything platform-specific lives under
`adapters/<platform>/`. `app.py` is the only file that names specific
platforms — it's the factory that picks adapters based on which tokens
are set. Adding Telegram later = drop in `adapters/telegram/` with the
same shape.
### `CoPilotChatBridge` — now an `AppService`
Previously `AppProcess`. Now inherits `AppService`, runs its RPC server
on `Config.copilot_chat_bridge_port=8010`, and exposes two scaffolding
`@expose` methods for the backend→chat-platform direction:
- `send_message_to_channel(platform, channel_id, content)` — stub
- `send_dm(platform, platform_user_id, content)` — stub
Both currently raise `NotImplementedError` — they unlock the
architecture for future features (scheduled agent outputs piped to
Discord, etc.) without another structural change. A matching
`CoPilotChatBridgeClient` + `get_copilot_chat_bridge_client()` factory
lets other services call the bot by the same `AppServiceClient` pattern
used for `NotificationManager` and `PlatformLinkingManager`.
### Bot behaviour
- `/setup` — server only, ephemeral, returns a "Link Server" button.
Rejects DM invocations up front.
- `/help` — ephemeral usage info.
- `/unlink` — ephemeral, opens a "Settings" button pointing at
`AUTOGPT_FRONTEND_URL/profile/settings` (real unlinking needs JWT auth).
- **Thread per conversation**: @mentioning the bot in a channel creates
a thread and routes the reply there. Subsequent messages in that thread
don't need another @mention — thread subscriptions are tracked in Redis
with a 7-day TTL.
- **Batched follow-ups**: messages arriving mid-stream append to a
per-thread pending list; drained as a single follow-up turn when the
current stream ends.
- **Persistent typing indicator**: 8-second re-fire loop.
- **Per-user identity prefix**: every forwarded message tagged `[Message
sent by {name} (Discord user ID: ...)]`.
- **Platform-aware chunking**: long responses split at paragraph → line
→ sentence → word boundaries (1900 chars for Discord).
- **Link buttons** for DM link prompts and `/setup` / `/unlink`
responses.
- **Duplicate message guard**: on `DuplicateChatMessageError` the bot
stays quiet — no double response.
### Env vars
| Variable | Purpose |
|----------|---------|
| `AUTOPILOT_BOT_DISCORD_TOKEN` | Discord bot token — enables the
Discord adapter |
| `AUTOGPT_FRONTEND_URL` | Frontend base URL for link confirmation pages
|
| `REDIS_HOST` / `REDIS_PORT` | Shared with backend — session +
thread-subscription state + direct copilot stream subscription |
| `PLATFORMLINKINGMANAGER_HOST` | Cluster DNS name of the
`PlatformLinkingManager` service (RPC target) |
Gone vs. the previous REST design: `AUTOGPT_API_URL`,
`PLATFORM_BOT_API_KEY`, `SSE_IDLE_TIMEOUT`.
## How
- **Adapter pattern**: `PlatformAdapter` ABC defines `start`, `stop`,
`send_message`, `send_link`, `start_typing`, `create_thread`,
`max_message_length`, `chunk_flush_at`, etc. Each platform implements
the interface; the shared `MessageHandler` calls through it.
- **Control plane over RPC**: `PlatformAPI` (~180 lines) is a thin
facade over `PlatformLinkingManagerClient` — `resolve_server`,
`resolve_user`, `create_link_token`, `create_user_link_token`,
`stream_chat`. The bot never constructs HTTP requests or handles an API
key.
- **Streaming over Redis Streams**: `stream_chat` calls
`start_chat_turn` (backend `@expose`), receives a
`ChatTurnHandle(session_id, turn_id, user_id, subscribe_from="0-0")`,
then subscribes directly via
`stream_registry.subscribe_to_session(...)`. Yields text from
`StreamTextDelta`, terminates on `StreamFinish`, surfaces
`StreamError.errorText` to the user. No SSE parsing, no X-Session-Id
header dance.
- **Error model**: backend domain exceptions (`NotFoundError`,
`LinkAlreadyExistsError`, `DuplicateChatMessageError`) cross the RPC
boundary cleanly (all `ValueError`-based, registered in
`backend.util.exceptions`). The bot catches them by type instead of
inspecting HTTP status codes.
- **Cooperative batching**: `TargetState.processing` flag + per-target
`pending` list. Messages arriving while `processing=True` append; the
running stream's finally block loops to drain the list before releasing.
- **Typing helper for `endpoint_to_async`**: added an `@overload` so
`async def` `@expose` methods on the server type-check correctly on the
client side (the scheduler pattern avoids this by using sync `@expose`,
but the new managers are async).
## Tests
- `backend/copilot/bot/platform_api_test.py` — new. Covers resolve
(server + user), create link tokens (success + `LinkAlreadyExistsError`
propagation), stream chat (yields deltas, terminates on `StreamFinish`,
surfaces `StreamError`, propagates `DuplicateChatMessageError` and
`NotFoundError`, handles `subscribe_to_session` returning `None`).
- `poetry run pyright backend/copilot/bot/` — clean.
- `poetry run ruff check backend/copilot/bot/` — clean.
- `poetry run copilot-bot` starts and connects to Discord Gateway, syncs
slash commands.
- `/setup` in a guild → confirm on frontend → mention bot → AutoPilot
streams back in a created thread.
- Thread follow-ups work without re-mentioning.
- Spamming messages mid-stream produces one batched follow-up.
- Long responses chunk at natural boundaries.
- DM to unlinked user → "Link Account" button → confirm → DMs stream as
that user's AutoPilot.
## Stack
- Backend API: #12615 — merge first
- Frontend link page: #12624
- Infra:
[AutoGPT_cloud_infrastructure#310](https://github.com/Significant-Gravitas/AutoGPT_cloud_infrastructure/pull/310)
---------
Co-authored-by: coderabbitai[bot] <136622811+coderabbitai[bot]@users.noreply.github.com>
Co-authored-by: CodeRabbit <noreply@coderabbit.ai>
### Why / What / How
**Why:** The creator dashboard route under settings v2 currently shows a
"Coming soon" placeholder. SECRT-2281 fills it in so creators can manage
their store submissions from one place.
**What:** Implements the full creator dashboard at
`/settings/creator-dashboard` — stats overview, desktop submissions
table, mobile submissions list, filtering/sorting, selection bar, edit
modal, and empty/loading/error states.
**How:** Page logic lives in `useCreatorDashboardPage.ts` (data fetch,
filter state, modal state, CRUD callbacks); pure transforms in
`helpers.ts`; UI broken into colocated `components/*` (one folder per
component, each ~200–400 lines). Reuses generated API hooks,
`ErrorCard`, and `EditAgentModal` from the design system. Mobile/desktop
split via Tailwind `md:` breakpoints rather than runtime detection.
### Changes 🏗️
- Replace placeholder `page.tsx` with the real dashboard, wired to
`useCreatorDashboardPage`
- Add `useCreatorDashboardPage.ts` (page-level state + handlers) and
`helpers.ts` (filter/sort/stat utilities)
- Add components: `DashboardHeader`, `DashboardSkeleton`, `EmptyState`,
`StatsOverview`, `SubmissionsList` (+ `columns/*`,
`useSubmissionSelection`), `SubmissionItem` (+ `useSubmissionItem`),
`SubmissionSelectionBar`, `MobileSubmissionsList` (+
`MobileSelectionBar`), `MobileSubmissionItem`, `ColumnFilter`
- Set document title to "Creator dashboard – AutoGPT Platform"
- Surface fetch errors via `ErrorCard` with retry; show
`DashboardSkeleton` while loading; show `EmptyState` when there are no
submissions
### Checklist 📋
#### For code changes:
- [x] I have clearly listed my changes in the PR description
- [x] I have made a test plan
- [ ] I have tested my changes according to the test plan:
- [ ] Loading state renders skeleton until submissions load
- [ ] Empty state renders when the creator has no submissions
- [ ] Error state renders `ErrorCard` and retry refetches the list
- [ ] Stats overview reflects approved/pending/rejected/draft counts
- [ ] Desktop list: sort/filter by status and other columns updates the
visible rows
- [ ] Desktop list: selection bar appears on row select and clears on
reset
- [ ] Mobile list (≤ md breakpoint): renders mobile items + selection
bar
- [ ] Edit modal opens for a submission, saves, and refreshes the list
on success
- [ ] Delete action removes the submission and updates stats
- [ ] View action navigates to the submission's public detail
- [ ] Submit/publish entry point opens the publish modal
- [ ] Document title shows "Creator dashboard – AutoGPT Platform"
## Why
The new Settings v2 surfaces (preferences, api-keys, integrations,
profile) shipped with a few rough edges spotted in self-review:
- **Timezone saves silently dropped on refresh.** Backend `GET
/auth/user/timezone` resolved the user via
`get_or_create_user(user_data)` (a 5-min in-process cache keyed by the
JWT-payload dict). `update_user_timezone` only invalidates
`get_user_by_id`'s cache, so the GET kept returning the pre-save tz
until TTL expired — looked exactly like "save did nothing."
- **Confusing "Looks like you're in X" CTA on the Time zone card** that
did nothing in the common case (server tz already matched the browser
tz, so clicking it produced no dirty state).
- **Save was disabled out of the gate when server tz was `"not-set"`** —
the hook substituted the browser tz into both `formState` and
`savedState`, so they were equal and `dirty` was false.
- **Lists felt static.** No motion when API keys / integrations mount,
and the loading skeletons popped in all at once instead of handing off
cleanly to the loaded rows.
- **Profile bio textarea** corner clipped against the rounded-3xl border
and the scrollbar overflowed the rounded container.
## What
### Bug fixes
- `GET /auth/user/timezone` now reads via `get_user_by_id(user_id)` —
the same cache `update_user_timezone` already invalidates — so a save
followed by refresh shows the new tz immediately.
- `usePreferencesPage` now treats the raw server tz (`"not-set"`
included) as the saved baseline, while `formState` uses the browser tz
only as a *display* fallback. Effect: when the user has never set a tz,
Save is enabled on first paint and a single click persists the detected
tz.
- Frontend save flow swapped `setQueryData` for `invalidateQueries`,
mirroring the older `/profile/(user)/settings` page so we always re-read
the persisted value.
- Removed the auto-detect "Looks like you're in X" button + its dead
helpers.
### Animations (per Emil Kowalski's guidelines)
Added orchestrated stagger animations that run on both the loaded list
**and** its skeleton, so the loading→loaded handoff is continuous
in-position:
- **API keys list + skeleton:** 280ms ease-out `cubic-bezier(0.16, 1,
0.3, 1)`, 40ms stagger, opacity + 6px translate.
- **Integrations list + skeleton:** 300ms ease-out, 80ms stagger,
opacity + 16px translate (rows are bigger / fewer).
- Both honor `prefers-reduced-motion` via `useReducedMotion`; only
`opacity` and `transform` are animated.
### Misc polish
- Profile bio textarea: `!rounded-tr-md` so the top-right corner doesn't
fight the surrounding `rounded-3xl`, plus a thin styled scrollbar
(`scrollbar-thin scrollbar-thumb-zinc-200
hover:scrollbar-thumb-zinc-300`) that lives inside the rounded container
instead of breaking out of it.
## How
| File | Change |
| --- | --- |
| `backend/api/features/v1.py` | `get_user_timezone_route` now uses
`get_user_by_id` + `Security(get_user_id)` instead of
`get_or_create_user(user_data)` |
| `frontend/.../preferences/usePreferencesPage.ts` | Split init into
`initialFormState` (browser-fallback display) vs `initialSavedState`
(raw server value); swap optimistic `setQueryData` for
`invalidateQueries` after tz mutate |
| `frontend/.../preferences/components/TimezoneCard/TimezoneCard.tsx` |
Drop `initialValue` prop, remove auto-detect button + unused imports |
| `frontend/.../preferences/page.tsx` | Drop `savedState`/`initialValue`
wiring |
| `frontend/.../api-keys/components/APIKeyList/APIKeyList.tsx` | Wrap
rows in container `motion.div` with `staggerChildren`; per-row
`motion.div` with opacity + y variants |
|
`frontend/.../api-keys/components/APIKeyListSkeleton/APIKeyListSkeleton.tsx`
| Same stagger config so loading→loaded matches |
|
`frontend/.../integrations/components/IntegrationsList/IntegrationsList.tsx`
+ `IntegrationsListSkeleton.tsx` | Same pattern for the providers list |
| `frontend/.../profile/components/ProfileForm/ProfileForm.tsx` |
Tailwind classes only — `!rounded-tr-md` + `scrollbar-thin
scrollbar-thumb-zinc-200 hover:scrollbar-thumb-zinc-300` |
## Test plan
- [ ] On `/settings/preferences`: pick a different tz → Save →
hard-refresh → new tz still selected.
- [ ] First-time user (server tz = `not-set`): land on page, Save button
should already be enabled; click Save → toast confirms; refresh → tz
persists.
- [ ] No "Looks like you're in X" button visible.
- [ ] On `/settings/api-keys`: rows fade/slide in staggered on first
mount; loading skeleton uses the same motion.
- [ ] On `/settings/integrations`: provider groups fade/slide in
staggered; skeleton matches.
- [ ] OS "Reduce motion" enabled → no transforms, content appears
instantly on all four surfaces.
- [ ] On `/settings/profile`: bio textarea top-right corner is no longer
hard-cornered against the card; scrollbar fits inside the rounded shape.
- [ ] Existing unit tests still pass: `pnpm test:unit
src/app/\(platform\)/settings/preferences` and `.../api-keys`.
## Why
Pre-launch scaling. Redis is currently a single-master pod — a real
SPOF, and not scalable horizontally. To move it to a sharded Redis
Cluster (via KubeBlocks in GKE), the backend has to speak the cluster
protocol.
Keeping both "standalone" and "cluster" code paths would have local dev
not reflect prod. Going **cluster-only**.
## What
- `backend.data.redis_client` now always constructs `RedisCluster`
(sync) / `redis.asyncio.cluster.RedisCluster` (async). Type aliases
`RedisClient` / `AsyncRedisClient` point at the cluster classes.
- `RedisCluster` uses the existing `REDIS_HOST` / `REDIS_PORT` as a
startup node and auto-discovers peers via `CLUSTER SLOTS`.
- Classic Redis pub/sub is broadcast cluster-wide and redis-py's async
`RedisCluster` has no `.pubsub()`; dedicated `get_redis_pubsub[_async]`
helpers return plain `(Async)Redis` clients to the seed node. All
pub/sub callers (`event_bus`, `notification_bus`,
`copilot.pending_messages`) route through these helpers.
- `rate_limit.py` MULTI/EXEC pipelines are split per-counter — daily and
weekly counters hash to different slots, which `RedisCluster` correctly
rejects as `CrossSlotTransactionError`. Per-counter `INCRBY + EXPIRE`
atomicity is preserved; the counters are logically independent budgets.
- `util/cache.py` shared-cache client is also `RedisCluster` now.
- Pre-existing mock-based unit tests updated; new `redis_client_test.py`
covers the swap.
## Local dev
`docker-compose.platform.yml` now runs **2-master Redis Cluster**
(`redis` + `redis-2`, 16384 slots split 0-8191 / 8192-16383). A one-shot
`redis-init` sidecar bootstraps it on first boot via raw `CLUSTER MEET`
+ `CLUSTER ADDSLOTSRANGE` (bundled `redis-cli --cluster create` enforces
a 3-node minimum).
This deliberately catches cross-slot bugs on a laptop rather than in
prod:
```
>>> ALL SMOKE TESTS PASS <<<
[sync] class: RedisCluster
[sync] 20 keys across slots: OK
[sync] colocated MULTI/EXEC: OK [5, 12, 1]
[sync] cross-slot MULTI/EXEC rejected as expected: CrossSlotTransactionError
[sync] EVAL single-key: OK
[sync] pub/sub (classic, broadcast): OK
[async] class: RedisCluster
[async] 15 keys across slots: OK
[async] colocated pipeline: OK
[async] pub/sub: OK
```
`rest_server` `/health` → 200, both shards have connected clients + keys
distributed 19/19 under the smoke run. `executor` boots + connects to
RabbitMQ + Redis cleanly.
For a 3-shard override (6 pods, with replicas) when you want to test
real KubeBlocks topology:
```
docker compose -f docker-compose.yml -f docker-compose.redis-cluster.yml up -d
```
## Deploy order (companion infra PR:
[cloud_infrastructure#312](https://github.com/Significant-Gravitas/AutoGPT_cloud_infrastructure/pull/312))
The existing `helm/redis` chart is updated in that PR to run as a
1-shard cluster (backwards-compatible toggle, default on). That rollout
must land before this PR's image goes live so the backend's
`RedisCluster` client has something to discover.
Sequence:
1. Infra: `helm upgrade redis` (1-shard cluster-enabled)
2. Infra: `helm upgrade rabbit-mq` (3-node cluster)
3. Backend: merge + deploy this PR
4. Follow-up: swap to KubeBlocks `redis-cluster` chart (3-shard sharded,
already staged in infra PR)
## Caveats / follow-ups
- Classic pub/sub via seed node means every node in the cluster sees
every message (broadcast). Fine at current volume; if it becomes hot,
migrate to `SPUBLISH`/`SSUBSCRIBE` (Redis 7+ sharded pub/sub).
- Per-user rate-limit counters (daily vs weekly) lost cross-counter
transactionality, but per-counter atomicity is preserved — the two
counters are independent budgets so no correctness regression.
- Local 2-master cluster crashes lose the cluster state; `redis-init`
idempotently rebootstraps.
## Checklist
- [x] Lint + format pass (`poetry run format` + `poetry run lint`)
- [x] Unit tests pass — `redis_client_test`, `redis_helpers_test`,
`event_bus_test`, `pending_messages_test`, `rate_limit_test`,
`cluster_lock_test`
- [x] Live smoke against 2-master cluster — sync + async; MULTI/EXEC;
EVAL; pub/sub; cross-slot rejection
- [x] Full stack smoke — `rest_server` /health, `executor` boot, keys
distributed across both shards
- [ ] Dev deploy (pending infra PR merge + manual validation)
## Summary
Improves Copilot/AutoPilot streaming reliability across frontend and
backend. The diff now covers the original streaming investigation issues
plus follow-up CI and review fixes from the latest merge with `dev`.
Addresses [SECRT-2240](https://linear.app/autogpt/issue/SECRT-2240),
[SECRT-2241](https://linear.app/autogpt/issue/SECRT-2241), and
[SECRT-2242](https://linear.app/autogpt/issue/SECRT-2242).
## Changes
- Fixes reasoning vs response rendering so action tools such as
`run_block` and `run_agent` do not cause assistant response text to be
hidden inside the collapsed reasoning section.
- Reworks Copilot session lifecycle handling: active-stream hydration,
resume ordering, reconnect timeout recovery, wake resync, session
deletion, title polling, stop handling, and session-switch stale
callback guards.
- Adds a per-session Copilot stream store/registry and transport helpers
to prevent duplicate resumes, duplicate sends, and cross-session
contamination during reconnect or reload flows.
- Adds pending follow-up message support and backend pending-message
safeguards, including sanitization of queued user content and
requeue-on-persist-failure behavior.
- Improves backend stream and executor robustness: active stream
registry checks, bounded cancellation drain with sync fail-close
fallback, Redis helper coverage, and updated SDK response adapter
expectations for post-tool status events.
- Adds and polishes usage-limit UI, including reset gate behavior,
backdrop blending behind the usage-limit card, and usage panel/card
coverage.
- Fixes a chat input Enter-submit race where Playwright and fast users
could fill the textarea and press Enter before React had re-enabled the
submit button, causing the visible message not to send.
- Refactors the Copilot page into smaller hooks/components and adds
focused tests around stream recovery, hydration, pending queueing,
rate-limit gates, and message rendering.
## Test plan
- [x] `poetry run format`
- [x] `poetry run pytest backend/copilot/sdk/response_adapter_test.py
backend/copilot/executor/processor_test.py`
- [x] `pnpm prettier --write` on touched frontend files
- [x] `pnpm vitest run
src/app/(platform)/copilot/components/ChatInput/__tests__/useChatInput.test.ts`
- [x] `pnpm types`
- [x] `pnpm lint` (passes with existing unrelated `next/no-img-element`
warnings)
- [ ] Full GitHub CI after latest push
## Review notes
- The Sentry review thread about unbounded cancellation cleanup is
addressed in `375ec9d5f`: cancellation now waits for normal async
cleanup but exits after `_CANCEL_GRACE_SECONDS` and falls through to the
sync fail-close path.
- The previous backend CI failures were stale test expectations around
the new `StreamStatus("Analyzing result…")` event after tool output;
tests now assert that event explicitly.
- The previous full-stack E2E failure was the Copilot input Enter race;
the input now submits from the live form value instead of depending on a
possibly stale disabled button state.
---------
Co-authored-by: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Co-authored-by: Nicholas Tindle <nicholas.tindle@agpt.co>
## Summary
When `CHAT_USE_CLAUDE_CODE_SUBSCRIPTION=true` is paired with a populated
`CHAT_BASE_URL=https://openrouter.ai/api/v1` (e.g. left over from an
earlier OpenRouter setup), the SDK was passing the OpenRouter slug
`anthropic/claude-opus-4.7` straight through to the Claude Code CLI
subprocess. The CLI uses OAuth and ignores
`CHAT_BASE_URL`/`CHAT_API_KEY`, so it rejects the slug:
> There's an issue with the selected model (anthropic/claude-opus-4.7).
It may not exist or you may not have access to it.
The bug was in `_normalize_model_name`, which gated on
`config.openrouter_active` (config-shape check) instead of the transport
the CLI actually uses for the turn.
## Changes
- Add `ChatConfig.effective_transport` property returning `subscription`
| `openrouter` | `direct_anthropic`, detected in that priority order.
Subscription wins over OpenRouter config because the CLI subprocess uses
OAuth and ignores the OpenRouter env vars (see `build_sdk_env` mode 1).
- Switch `_normalize_model_name` to gate on `effective_transport`.
Subscription and direct-Anthropic transports both produce the
CLI-friendly hyphenated form (`claude-opus-4-7`) and reject
non-Anthropic vendors loudly.
- `_resolve_sdk_model_for_request` already routes any LD-served override
through `_normalize_model_name`, so a per-user advanced-tier override
under subscription now correctly becomes `claude-opus-4-7` instead of
the OpenRouter slug. The standard-tier \"no LD override → return None\"
behaviour is preserved.
- Update two existing service tests to assert the corrected behaviour
(Kimi LD override under subscription falls back to tier default
normalised for the CLI; Opus advanced override returns hyphenated form).
## Test plan
- [x] `poetry run pytest backend/copilot/sdk/service_helpers_test.py
backend/copilot/sdk/service_test.py backend/copilot/config_test.py -v` —
165 passed.
- [x] `poetry run pytest backend/copilot/sdk/env_test.py
backend/copilot/sdk/p0_guardrails_test.py` — 136 passed (other call
sites of `openrouter_active` unchanged).
- [x] `poetry run ruff format` + `ruff check` clean on touched files.
### New tests added (service_helpers_test.py)
- Subscription transport with OpenRouter base URL set + advanced-tier LD
override → returns `claude-opus-4-7` (not the OpenRouter slug, not
None).
- Subscription transport with OpenRouter base URL set + standard-tier no
override → returns None (existing behaviour preserved).
- Subscription transport rejects non-Anthropic vendor (`moonshotai/...`)
→ ValueError.
- `effective_transport` returns `subscription` when subscription is on
regardless of OpenRouter config; returns `openrouter` when subscription
is off and OpenRouter is fully configured; returns `direct_anthropic`
otherwise.
## Summary
Pause the SDK idle timer while a tool call is pending, with a 2-hour
hung-tool cap as backstop. Fixes SECRT-2239 — long-running tools (10+
min, e.g. sub-agent execution) were being silently aborted by the
10-minute idle timeout introduced in #12660.
## What changed (backend only)
- `_IDLE_TIMEOUT_SECONDS = 1800` (30 min) — soft cap when no tool
pending (raised from 10 min)
- `_HUNG_TOOL_CAP_SECONDS = 7200` (2 h) — hard cap when a tool is
pending; protects against truly hung tool calls without false-aborting
legitimate long-running ones
- `_idle_timeout_threshold(adapter)` — returns the appropriate threshold
based on whether any tool is currently pending in the adapter
Backed by 7 regression tests in
`service_test.py::TestIdleTimeoutThreshold`.
## Frontend coordination
The original cherry-pick batch included a `useStreamActivityWatchdog`
hook for client-side wire-silence detection. That hook is dropped from
this PR because it overlaps with Lluis's #12813, which ships the same
component as part of a comprehensive copilot streaming refactor. End
state on dev: his PR contributes the watchdog, this PR contributes the
backend pause + cap.
## Test plan
- 7/7 unit tests in
`backend/copilot/sdk/service_test.py::TestIdleTimeoutThreshold` pass
- pyright clean on `service.py` + `service_test.py`
- /pr-test --fix posted with native-stack run + screenshots:
https://github.com/Significant-Gravitas/AutoGPT/pull/12927#issuecomment-4328320714
## Linear
SECRT-2239
Filtering store creators to only show profiles with an approved agent
keeps the marketplace focused on usable inventory and prevents empty
creator cards.
### Changes 🏗️
- add a `num_agents > 0` filter to `get_store_creators`
- add a regression test ensuring we only return creators with approved
agents
- keep the existing SQL injection regression tests intact after rebasing
onto `dev`
### Checklist 📋
#### For code changes:
- [x] I have clearly listed my changes in the PR description
- [x] I have made a test plan
- [ ] I have tested my changes according to the test plan:
- [ ] python3 -m pytest
autogpt_platform/backend/backend/server/v2/store/db_test.py -k
get_store_creators_only_returns_approved *(blocked: repo environment
lacks pytest and related deps)*
<!-- CURSOR_SUMMARY -->
---
> [!NOTE]
> Filter `get_store_creators` to creators with `num_agents > 0` and add
a test to validate the behavior.
>
> - **Store backend**:
> - Update `get_store_creators` in `backend/server/v2/store/db.py` to
filter creators by `num_agents > 0`.
> - **Tests**:
> - Add `test_get_store_creators_only_returns_approved` in
`backend/server/v2/store/db_test.py` to verify filtering and pagination
count calls.
>
> <sup>Written by [Cursor
Bugbot](https://cursor.com/dashboard?tab=bugbot) for commit
c2fca584cce5a8c26dbdadd68696a0033642f193. This will update automatically
on new commits. Configure
[here](https://cursor.com/dashboard?tab=bugbot).</sup>
<!-- /CURSOR_SUMMARY -->
---------
Co-authored-by: claude[bot] <41898282+claude[bot]@users.noreply.github.com>
Co-authored-by: Nicholas Tindle <ntindle@users.noreply.github.com>
Co-authored-by: Nicholas Tindle <nicholas.tindle@agpt.co>
Co-authored-by: copilot-swe-agent[bot] <198982749+Copilot@users.noreply.github.com>
Co-authored-by: ntindle <8845353+ntindle@users.noreply.github.com>
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
## Summary
- Detect ghost-finished sessions where the SDK returns a `ResultMessage`
with `subtype="success"`, empty `result`, no produced content, and
`output_tokens == 0`.
- Emit `StreamError(code="empty_completion")` instead of silently
calling `StreamFinish`, so the caller (and the user) sees the failure.
## Background
Linear: [SECRT-2252](https://linear.app/agpt/issue/SECRT-2252) — SDK
silent empty completion not retried, leaving the user with a blank
stream (`start -> start-step -> finish-step -> finish`).
## Changes
- `response_adapter.py::convert_message`: in the `ResultMessage` branch,
check `_is_empty_completion()` before falling through to the existing
success path. When matched, close any open step, emit `StreamError`, and
skip `StreamFinish`.
- `response_adapter.py::_is_empty_completion`: new helper that returns
`True` only when `result` is falsy, no text/reasoning was emitted, no
tool calls were registered, no tool results were seen, and
`usage["output_tokens"]` is `0`.
- `response_adapter_test.py`: 4 new unit tests covering empty-success
(None and empty-string variants), non-empty success, and the
non-empty-tokens-but-empty-result fallthrough.
## Out of scope (per ticket)
- Retry-once behavior. This PR only surfaces the error; the caller
decides retry semantics. Follow-up work can wire automatic retry on
`code="empty_completion"`.
## Test plan
- [x] `poetry run pytest backend/copilot/sdk/response_adapter_test.py` —
all 58 tests pass (4 new + 54 existing).
- [x] `poetry run pyright backend/copilot/sdk/response_adapter.py
backend/copilot/sdk/response_adapter_test.py` — clean.
## Checklist
- [x] My code follows the style of this project.
- [x] I have added tests covering my changes.
- [x] I have updated the documentation accordingly. (N/A — internal
adapter behavior)
### Why / What / How
**Why** — Settings v2 needs a dedicated Preferences page covering
account info (email, password reset), time zone, and notification
preferences with a single, predictable save flow. The existing legacy
page at `/profile/(user)/settings` mixes concerns and uses `__legacy__`
UI primitives we are migrating away from. SECRT-2279.
**What** — A new preferences page at `/settings/preferences` built from
atomic / molecular design-system components. Three cards (Account, Time
zone, Notifications) share one inline Save / Discard bar at the bottom.
The page replaces the legacy settings page from a UX standpoint while
keeping the same backend mutations.
<img width="1511" height="899" alt="Screenshot 2026-04-27 at 6 43 09 PM"
src="https://github.com/user-attachments/assets/5762fc41-1654-4764-8fbf-d5dd262e031a"
/>
**How**
- **Page composition (`page.tsx`)** uses a single `usePreferencesPage`
hook that owns dirty/saved/form state. Renders `PreferencesHeader`,
`AccountCard`, `TimezoneCard`, `NotificationsCard`, and the inline
`SaveBar` below the cards.
- **Account card** — Email row shows the current address with a compact
pencil button that opens a width-constrained `Dialog` for editing
(Cancel / Update). Password row is a `NextLink` button that routes to
`/reset-password`.
- **Time zone card** — Single row inside a card: label + info-icon
tooltip on the left; small-size `Select` and the GMT offset chip on the
right. The "auto-detect" prompt shows up only when the saved tz differs
from the browser tz, rendered as a pill on the right side of the card.
- **Notifications card** — Tabs for Agents / Marketplace / Credits;
toggling a switch flips a flag in formState and enables Save.
- **Save flow** — `usePreferencesPage` keeps a `savedState` snapshot
(mirrors the old `react-hook-form` `defaultValues` capture-once
semantics) so the dirty check is fully decoupled from any backend GET
refetch. After a successful mutation, `savedState ← formState`, and the
timezone query cache gets an optimistic `setQueryData` write so the
value isn't snapped back by the (cached) GET endpoint.
- **Skeleton** — `PreferencesSkeleton` mirrors the real layout — header,
Account card with the two row shapes, Time zone single row,
Notifications tabs + toggle rows, and the Save / Discard buttons.
- **Sidebar** — Renames the entry "Settings" → "Preferences" with
`SlidersHorizontalIcon`. Profile and Creator Dashboard get clearer
affordances (`UserIcon`, `ChartLineUpIcon`).
### Changes 🏗️
- New page: `src/app/(platform)/settings/preferences/page.tsx` and
`usePreferencesPage.ts`
- New components: `AccountCard`, `TimezoneCard`, `NotificationsCard`,
`PreferencesHeader`, `PreferencesSkeleton`, `SaveBar`
- Helpers: `helpers.ts` (timezones list, GMT-offset formatter, dirty
utilities, notification-group definitions)
- Sidebar: rename "Settings" → "Preferences", swap to
`SlidersHorizontalIcon` + cleaner Profile / Creator Dashboard icons
(`SettingsSidebar/helpers.ts`)
- Tests:
- `preferences/__tests__/main.test.tsx` — page render, edit-email dialog
open/cancel, time zone info trigger, Save/Discard disabled-on-clean,
notification toggle → save submission, discard revert
- Updated `SettingsSidebar.test.tsx` and `SettingsMobileNav.test.tsx`
for the "Preferences" label
### Checklist 📋
#### For code changes:
- [x] I have clearly listed my changes in the PR description
- [x] I have made a test plan
- [ ] I have tested my changes according to the test plan:
- [ ] Sidebar shows "Preferences" with the new icon; clicking routes to
`/settings/preferences`
- [ ] Account card: pencil button opens a 420px-wide dialog; Update
button stays disabled until the email differs and is valid; Cancel
closes the dialog
- [ ] Password row: "Reset password" button navigates to
`/reset-password`
- [ ] Time zone: changing the select enables Save; clicking Save
persists the value and the form keeps showing the saved value (does not
snap back), the inline GMT offset chip updates, info tooltip appears on
hover
- [ ] Auto-detect prompt appears only when saved tz ≠ browser tz;
clicking it sets the select to the browser tz
- [ ] Notifications: toggling any switch enables Save; saving flips the
flag in the request body; switching tabs preserves toggles; Discard
reverts unsaved toggles
- [ ] Save / Discard render below the cards on the right and stay
disabled until any field is dirty
- [ ] Loading state shows the new skeleton that mirrors the layout
- [ ] `pnpm test:unit` passes (covers the page-level integration tests
above)
#### For configuration changes:
- [x] `.env.default` is updated or already compatible with my changes
- [x] `docker-compose.yml` is updated or already compatible with my
changes
- [x] I have included a list of my configuration changes in the PR
description (under **Changes**)
### Why / What / How
**Why:** SECRT-2278 — settings v2 needs a Profile page so users can
manage how they appear on the marketplace (display name, handle, bio,
links, avatar) without hopping into the legacy settings UI.
**What:** Adds the `/settings/profile` page end-to-end:
- Form fields for display name, handle, bio, avatar, and up to 5 links —
wired to the `getV2GetUserProfile`, `postV2UpdateUserProfile`, and
`postV2UploadSubmissionMedia` endpoints.
- Bio editor gets a markdown toolbar (bold / italic / strikethrough /
link / bulleted list) with a live preview toggle that renders via
`react-markdown` + `remark-gfm`.
- Save/Discard bar with full validation (handle regex, bio length, dirty
tracking) and toast feedback for success and failure paths.
- Forward refs through the `Input` atom so consumers can target the
underlying `<textarea>` / `<input>` (needed for the toolbar's
selection/cursor manipulation).
- Comprehensive integration tests (Vitest + RTL + MSW) at the page level
plus pure-helper unit tests, in line with the project's
"integration-first" testing strategy. Coverage is reported via
`cobertura` for Codecov.
**How:**
- The toolbar applies markdown syntax by reading `selectionStart` /
`selectionEnd` from a forwarded textarea ref. To avoid the textarea
jumping to the top on click: buttons `preventDefault` on `mousedown` (so
the textarea keeps focus), and the handler captures `scrollTop` before
mutation and restores it (with `focus({ preventScroll: true })`) after
React commits the new value in the next animation frame.
- The preview pane styles markdown elements via Tailwind arbitrary child
selectors (`[&_ul]:list-disc` etc.) instead of pulling in
`@tailwindcss/typography`, since the plugin isn't installed and the
project's `prose` usage was a no-op.
- Profile data hydration tolerates nullish API fields by mapping through
`profileToFormState`, padding `links` to 3 slots so the UI always has
the initial layout.
- Tests use Orval-generated MSW handlers from `store.msw.ts`, mock
`useSupabase` to inject an authenticated user, and assert UI behavior
via Testing Library queries.
### Changes 🏗️
- New: `app/(platform)/settings/profile/__tests__/helpers.test.ts`,
`__tests__/main.test.tsx`
- Updated: `settings/profile/page.tsx`, `useProfilePage.ts`,
`helpers.ts`, plus `ProfileForm`, `ProfileHeader`, `ProfileSkeleton`,
`LinksSection`, `SaveBar`
- Updated: `settings/layout.tsx` (settings v2 chrome adjustments to host
the profile page)
- Atom change: `components/atoms/Input/Input.tsx` now forwards refs
(`HTMLInputElement | HTMLTextAreaElement`) — backward-compatible for
existing consumers
### Checklist 📋
#### For code changes:
- [x] I have clearly listed my changes in the PR description
- [x] I have made a test plan
- [x] I have tested my changes according to the test plan:
- [x] Open `/settings/profile` and confirm the page hydrates the
existing display name, handle, bio, avatar, and links
- [x] Edit each field and verify validation messages (empty name,
invalid handle with spaces, bio over 280 chars)
- [x] Bio markdown toolbar: select text and click Bold / Italic / Strike
— selection wraps, cursor stays in place, textarea does not scroll to
top
- [x] Bio toolbar with no selection: each button inserts the markdown
placeholder template
- [x] Click Bulleted list twice on the same line — line is prefixed with
`- ` only once
- [x] Toggle Preview — bio renders bullets, bold, italic, strikethrough,
links correctly; toolbar buttons dim and become inert
- [x] Toggle Edit — textarea returns with the same content
- [x] Add link → 4th and 5th slots appear; the 6th attempt is blocked by
the "Limit of 5 reached" button label
- [x] Remove link 1 — the rest reorder correctly
- [x] Avatar upload — happy path replaces the avatar; failure path
surfaces a destructive toast
- [x] Save with valid data → success toast, query invalidates, save
button disables until next edit
- [x] Save with a server 422 → destructive toast, no state corruption
- [x] Discard reverts every field back to the loaded profile
- [x] `pnpm test:unit` passes locally; `coverage/cobertura-coverage.xml`
shows ≥ 80% line coverage for `src/app/(platform)/settings/profile/**`
Requested by @Bentlybro
Anthropic released [Claude Opus
4.7](https://www.anthropic.com/news/claude-opus-4-7) today. This PR adds
it to the platform's supported model list.
## Why
Users and developers need access to `claude-opus-4-7` via the platform's
LLM block and API. The model is available on Anthropic's API today.
## What
- Adds `CLAUDE_4_7_OPUS = "claude-opus-4-7"` to the `LlmModel` enum
- Adds corresponding `ModelMetadata` entry: 200k context, 128k output,
price tier 3 ($5/M input, $25/M output — same as Opus 4.6)
## How
Two lines added to `llm.py`, following the exact same pattern as all
other Anthropic model additions. No migrations, no frontend changes
needed — the frontend reads model metadata from the backend's JSON
schema endpoint automatically.
Closes SECRT-2248
---------
Co-authored-by: Bentlybro <Github@bentlybro.com>
## Summary
Fixes#9188 — redirects `www.` to non-www to prevent cookie/auth domain
mismatch.
Uses **308** (Permanent Redirect) instead of 301, which preserves the
HTTP method and request body. This is important because the middleware
matcher runs on `/auth/authorize` and `/auth/integrations/*`, where
OAuth callbacks may use POST.
Split from #12895 per reviewer request.
## Changes
- Added www→non-www redirect in Next.js middleware using
`NextResponse.redirect(url, 308)`
---------
Co-authored-by: majdyz <zamil.majdy@agpt.co>
## Summary
Fixes#11237 — `buildUrlWithQuery` now filters out `null` values in
addition to `undefined`, preventing them from being serialized as
literal `"null"` strings in URL query parameters.
Split from #12895 per reviewer request.
## Changes
- Added `value !== null` check alongside existing `value !== undefined`
in `buildUrlWithQuery`
---------
Co-authored-by: majdyz <zamil.majdy@agpt.co>
### Why / What / How
**Why:** The settings v2 integrations surface renders from a hardcoded
`MOCK_PROVIDERS` array — no real user credentials, no delete, no way to
connect a new service. Provider display metadata (descriptions +
supported auth types) was scattered across frontend maps with no backend
source of truth, leaving each new provider to be manually registered in
two places.
**What:** Full-featured settings v2 integrations page driven by live
backend data, plus a backend SDK extension so every provider carries a
description **and** declares its supported auth types. Settings UI uses
both to render the connect-a-service dialog: descriptions in the list,
auth types to pick the right tabs in the detail view.
**How:**
- **Credentials list** — single-fetch via `useGetV1ListCredentials`,
grouped client-side by provider, debounced (250 ms) Unicode-normalized
in-memory search (no roundtrip per keystroke), managed/system creds
filtered via the shared `filterSystemCredentials` helper from
`CredentialsInput`. Loading → skeletons that mirror the real accordion
shape, error → `ErrorCard` with retry, empty → `IntegrationsListEmpty`
with custom marquee illustration.
- **Delete flow** — `useDeleteIntegration` returns per-target `succeeded
/ failed / needsConfirmation` so the UI can name failed items and keep
them selected for one-click retry. Single + bulk both gated by
`DeleteConfirmDialog`. Per-row delete button disables + shows a spinner
via `isDeletingId` so double-clicks can't fire two requests. Success
toast names the credential ("Removed GitHub key").
- **Connect-a-service dialog** — backend-driven (`useGetV1ListProviders`
returns `ProviderMetadata[]` with description + supported_auth_types),
Emil-spec animations (150 ms ease-out step swap, 200 ms ease-out height
resize, 180 ms entry fade+slide+blur on tab swap, all respecting
`prefers-reduced-motion`). Detail view picks tab order via deterministic
`TAB_PRIORITY` (oauth → api_key → user_password → host_scoped) and
remembers last-selected tab per provider for the session.
- **OAuth tab** → `openOAuthPopup` + `getV1InitiateOauthFlow` →
`postV1ExchangeOauthCodeForTokens`
- **API key tab** → zod-validated form with
`autoComplete="new-password"` + `spellCheck=false` so browsers don't
autofill the wrong stored key → `postV1CreateCredentials`
- **Provider metadata SDK** — chainable
`ProviderBuilder.with_description(...)` +
`.with_supported_auth_types(...)` (the latter populated automatically by
`with_oauth` / `with_api_key` / `with_managed_api_key` /
`with_user_password`; explicit form reserved for legacy providers whose
auth lives outside the builder chain). `GET /integrations/providers`
upgraded from `List[str]` → `List[ProviderMetadata]` carrying both
fields.
- **Backward compat** — `BackendAPI.listProviders()` maps the new
`ProviderMetadata[]` shape down to `string[]` so the deprecated
`CredentialsProvider` (used by the builder/library credential pickers)
keeps working without ripple changes.
- **Routing** — page lives at `/settings/integrations` directly. No
feature flag gate (settings v2 layout is already on dev).
### Changes 🏗️
**Backend**
- `backend/sdk/builder.py` — `with_description()` +
`with_supported_auth_types()` chain methods; the latter is
auto-populated by every existing auth-method chain call so explicit
declaration is only needed for legacy providers.
- `backend/sdk/provider.py` — `description` + `supported_auth_types`
fields on `Provider`.
- `backend/api/features/integrations/router.py` — `GET /providers` now
returns `List[ProviderMetadata]`; calls `load_all_blocks()` (cached
`@cached(ttl_seconds=3600)`) before reading `AutoRegistry`.
- `backend/api/features/integrations/models.py` — `ProviderMetadata`
with `name + description + supported_auth_types`;
`get_provider_description` + `get_supported_auth_types` helpers reading
from `AutoRegistry`.
- 13 existing `_config.py` files updated with `.with_description(...)`:
agent_mail, airtable, ayrshare, baas, bannerbear, dataforseo, exa,
firecrawl, linear, stagehand, wordpress, wolfram, generic_webhook.
- 20 new `_config.py` files (one per provider block dir): apollo,
compass, discord, elevenlabs, enrichlayer, fal, github, google, hubspot,
jina, mcp, notion, nvidia, replicate, slant3d, smartlead, telegram,
todoist, twitter, zerobounce. Each declares
`with_supported_auth_types(...)` because their auth handlers live in
`backend/integrations/oauth/` (legacy) or are block-level
`CredentialsMetaInput` declarations — outside the builder chain.
- 1 new `backend/blocks/_static_provider_configs.py` registering
description + auth types for ~24 providers that live in shared files
(openai, anthropic, groq, ollama, open_router, v0, aiml_api, llama_api,
reddit, medium, d_id, e2b, http, ideogram, openweathermap, pinecone,
revid, screenshotone, unreal_speech, webshare_proxy, google_maps, mem0,
smtp, database). Comment documents the migration path (each entry
retires when the provider graduates to its own `_config.py`).
**Frontend**
- `src/app/(platform)/settings/integrations/page.tsx` — replaces mock
page; composes header + list + connect dialog.
-
`src/app/(platform)/settings/integrations/components/IntegrationsList/`
— list + skeleton + selection (Record<string, true> instead of Set) +
delete orchestration hook.
-
`src/app/(platform)/settings/integrations/components/ConnectServiceDialog/`
— split per the 200-line house rule into `ConnectServiceDialog`,
`ListView`, `ProviderRow`, `useMeasuredHeight`. DetailView's nested
helpers extracted to siblings: `MethodPanel`, `UnsupportedNotice`,
`ProviderAvatar`. Tabs render in deterministic priority order;
last-selected tab persisted per provider in module-scope.
-
`src/app/(platform)/settings/integrations/components/DeleteConfirmDialog/`
— new confirm dialog gating single + bulk deletes (shows up to 3 names +
remaining count for bulk).
-
`src/app/(platform)/settings/integrations/components/IntegrationsListEmpty/components/IntegrationsMarquee.tsx`
— switched from `next/image unoptimized` to plain `<img loading="lazy"
decoding="async">` for decorative logos (no LCP candidate, avoids Next
Image runtime overhead).
-
`src/app/(platform)/settings/integrations/components/hooks/useDeleteIntegration.ts`
— bulk delete now returns per-target
`succeeded/failed/needsConfirmation`; failed items stay selected for
retry; per-id pending tracking via `isDeletingId`; toast names the
credential.
-
`src/app/(platform)/settings/integrations/components/hooks/useDebouncedValue.ts`
— small reusable debounce hook (250 ms, used by both list + dialog
search).
- `src/app/(platform)/settings/integrations/helpers.ts` —
`formatProviderName` guarded against non-string input; `filterProviders`
now Unicode-normalized (NFKD + strip combining marks) so accented
queries match.
- `src/providers/agent-credentials/helper.ts` — `toDisplayName` same
`typeof string` guard.
- `src/components/contextual/CredentialsInput/helpers.ts` — loosened
`filterSystemCredentials` / `getSystemCredentials` generic constraint to
accept `title?: string | null` so it consumes `CredentialsMetaResponse`
directly.
- `src/lib/autogpt-server-api/client.ts` — `listProviders()` maps the
new shape to `string[]` for backward compat.
- `src/app/api/openapi.json` — regenerated spec includes
`ProviderMetadata` with `supported_auth_types`.
### PR feedback addressed 🛠️
Round of fixes after the first review pass:
- Bulk delete: per-item names in failure toast, failed items kept
selected for retry.
- Confirmation dialog before any delete (single or bulk) —
`DeleteConfirmDialog`.
- Per-row delete button disabled + spinner while pending (no
double-click double-fire).
- Toast names the credential ("Removed GitHub key") instead of generic
copy.
- API key input: `autoComplete="new-password"` + `spellCheck=false`;
title field `autoComplete="off"`.
- Search debounced 250 ms on both list + dialog; Unicode-normalized so
"Açai" matches "acai".
- `toDisplayName` / `formatProviderName` guarded against non-string
input (`provider.split is not a function` was reproducible).
- Skeleton mirrors the real accordion shape — no layout shift on data
load.
- Selection bar sticky position fixed for <375 px (`top-2 sm:top-0`).
- Last-selected auth tab persisted per provider for the session.
- Tabs ordered deterministically (oauth → api_key → user_password →
host_scoped) instead of insertion order.
- `useMemo` removed from `useIntegrationsList` per project rule (no
measured perf need).
- Selection state migrated from `Set<string>` to `Record<string, true>`
(idiomatic React state shape).
- ConnectServiceDialog 288 LoC → ~130 (extracted `ListView`,
`ProviderRow`, `useMeasuredHeight`); DetailView helpers → siblings.
- `next/image unoptimized` → plain `<img>` for decorative logos in
marquee + provider rows + avatar.
- `with_supported_auth_types(...)` pruned in 11 `_config.py` files where
it was redundant with `with_oauth` / `with_api_key` /
`with_managed_api_key` / `with_user_password`. Kept in legacy ones
(github, discord, google, notion, ...) where the docstring says it's
required because auth handlers live outside the builder chain.
- Tab swap + dialog step animations re-tuned vs Emil Kowalski's
animation rules: ease-out default, under 300 ms,
transform+opacity+filter only, blur-bridge to soften swap, willChange to
dodge the 1px shift, reduced-motion fallbacks via `useReducedMotion`.
- Merged latest `dev` (api-keys SECRT-2273 took dev's version, no
api-keys diff in this PR; settings layout took dev's version, no
`SETTINGS_V2` feature flag in this PR; `scroll-area` took dev's
version).
### Checklist 📋
#### For code changes:
- [x] I have clearly listed my changes in the PR description
- [x] I have made a test plan
- [ ] I have tested my changes according to the test plan:
- [ ] Navigate to `/settings/integrations` — loading skeletons appear,
then user's real credentials grouped by provider.
- [ ] Type in the search box — filters client-side after 250 ms, no new
network requests (DevTools → Network stays quiet). Try "açai" with
diacritics — matches "acai" providers.
- [ ] Connect a service → dialog loads provider list with backend
descriptions; search matches name, slug, and description.
- [ ] Click a provider → dialog tweens height smoothly (no jump); header
shows provider avatar + name + description; tabs render in oauth →
api_key priority; last-used tab restored on reopen.
- [ ] Open `linear` (oauth + api_key) — switching tabs animates with a
quick fade+slide+blur entry, no flash.
- [ ] OAuth tab → "Continue with X" opens popup, completes consent,
popup closes, dialog closes, new credential appears with success toast.
- [ ] API key tab → paste a key (browser does NOT offer to autofill any
stored password), Save → toast names the credential, dialog closes.
- [ ] Delete (single) via trash icon → confirmation dialog → button
disables with spinner during the request → toast names the credential.
- [ ] Delete (bulk) via selection bar → confirmation lists up to 3 names
→ if some fail, failed ones stay selected for retry; toast lists which
failed.
- [ ] Double-click a delete button rapidly — only one request fires.
- [ ] Managed credentials (e.g. "Use Credits for AI/ML API") do **not**
appear in the list.
- [ ] Test on a fresh account (no credentials) — `IntegrationsMarquee`
empty state renders.
- [ ] Throttle network to Slow 3G — skeleton (mirroring real shape)
visible, then list slides in.
- [ ] Block `/api/integrations/credentials` → `ErrorCard` with retry.
- [ ] `curl /api/integrations/providers` returns `[{ name, description,
supported_auth_types }, ...]` with every provider carrying both fields.
- [ ] `prefers-reduced-motion: reduce` set → all motion collapses to
opacity-only fades.
- [ ] On <375 px viewport — selection bar clears the mobile nav.
- [ ] `pnpm format && pnpm lint && pnpm types && pnpm test:unit` all
pass.
#### For configuration changes:
- [x] `.env.default` is updated or already compatible with my changes
- [x] `docker-compose.yml` is updated or already compatible with my
changes
- [x] I have included a list of my configuration changes in the PR
description (under **Changes**) — no flag changes in this PR.
## Background
[SECRT-2275](https://linear.app/autogpt/issue/SECRT-2275). User report:
when a copilot ("autopilot") turn is interrupted by a usage-limit,
tool-call-limit, or other run interruption, the user's recent work
disappears. User described it as: "my initial message was lost 3 times
and it disappeared, then when I would say 'continue' it would do a
random old task."
Investigation surfaced two distinct failure modes. This PR addresses
both.
- **Mode 1** — rate-limit (or other pre-stream rejection) at turn start:
the user's text only ever lives in the optimistic `useChat` bubble; the
backend rejects before the message is persisted, so the bubble is a lie
and a refresh / retry would lose the text.
- **Mode 2** — long-running turn interrupted mid-stream: the entire
turn's progress (assistant text, tool calls, reasoning) vanishes on
interruption — what users describe as "the turn is gone."
## Mode 1 — frontend: restore unsent text on 429
Backend can't recover this on its own: `check_rate_limit` raises before
`append_and_save_message`, so by the time the 429 surfaces there is no
DB row to roll forward. See
`autogpt_platform/backend/backend/api/features/chat/routes.py:916-922`
(rate-limit check) and `routes.py:945` (later append-and-save).
Frontend fix in
`autogpt_platform/frontend/src/app/(platform)/copilot/useCopilotStream.ts`:
when `useChat`'s `onError` reports a usage-limit error, we
- drop the optimistic user bubble (DB has no record of it, so leaving it
would be a phantom),
- push `lastSubmittedMsgRef.current` back into the composer via the
existing `setInitialPrompt` slot — the same slot URL pre-fills use, so
`useChatInput`'s `consumeInitialPrompt` effect picks it up
automatically,
- clear `lastSubmittedMsgRef` so the dedup guard doesn't block re-send.
In-memory only; surviving a hard refresh while rate-limited is a
separate follow-up (would need localStorage persistence with TTL).
Test:
`autogpt_platform/frontend/src/app/(platform)/copilot/__tests__/useCopilotStream.test.ts`
— verifies the composer is repopulated and the optimistic bubble is
dropped on a 429.
## Mode 2 — backend: preserve interrupted partial in DB
### Root cause
The SDK retry loop in `stream_chat_completion_sdk` always rolls back
`session.messages` to the pre-attempt watermark on any exception. That
rollback is correct **before a retry** so attempt #2 doesn't duplicate
attempt #1's content. But it runs **before the retry decision is made**,
so when retries are exhausted (or no retry is attempted) the partial
work is discarded too.
Three branches of the retry loop ended in a final-failure state with
side effects worse than just losing the partial:
- `_HandledStreamError` non-transient: rollback then add error marker —
partial gone
- `Exception` with `events_yielded > 0`: rollback then break — **no
error marker added either**, so on refresh the chat looks like nothing
happened even though the user just watched tokens stream live
- `Exception` non-context-non-transient + the while-`else:` exhaustion
path: same, no marker
- Outer except (cancellation, GeneratorExit cleanup): didn't restore
captured partial
### Fix
`autogpt_platform/backend/backend/copilot/sdk/service.py`:
1. **`_InterruptedAttempt` dataclass** — holds the rolled-back `partial:
list[ChatMessage]` + optional `handled_error: _HandledErrorInfo`. Three
methods drive the contract:
- `capture(session, transcript_builder, transcript_snap,
pre_attempt_msg_count)` — slices `session.messages`, restores the
transcript, strips trailing error markers to prevent duplicate markers
after restore.
- `clear()` — drops captured state on a successful retry so outer
cleanup paths don't replay pre-retry content.
- `finalize(session, state, display_msg, retryable=...) ->
list[StreamBaseResponse]` — re-attaches partial, synthesizes
`tool_result` rows for orphan `tool_use` blocks, appends the canonical
error marker, and returns the flushed events so the caller can yield
them to the client (no double-flush).
2. **`_flush_orphan_tool_uses_to_session(session, state) ->
list[StreamBaseResponse]`** — synthesizes `tool_result` rows for any
`tool_use` that never resolved before the error so the next turn's LLM
context stays API-valid (Anthropic rejects orphan tool_use). Uses the
public `adapter.flush_unresolved_tool_calls` and returns the events for
the caller to yield.
3. **`_classify_final_failure(...) -> _FinalFailure | None`** — picks
the display message + stream code + retryable flag for the final-failure
exit. One source of truth for the in-history error marker and the
client-facing `StreamError` SSE yield so they can't drift.
4. **Consolidated post-loop emit**: the former three scattered blocks
(partial restore + redundant re-flush + two separate `yield StreamError`
sites) collapsed to one block driven by `_classify_final_failure` →
`_FinalFailure` → `finalize()` → yield events + single `StreamError`.
5. **Adapter `flush_unresolved_tool_calls`** (renamed from
`_flush_unresolved_tool_calls` to drop the `# noqa: SLF001` suppressors
on cross-module callers).
Each retry-loop rollback site calls `interrupted.capture(...)`; the
success break calls `interrupted.clear()`; the post-loop failure block
calls `interrupted.finalize(...)` exactly once.
The baseline service already preserves partial work via its existing
finally block — no change needed there.
## Tests
Backend (`backend/copilot/sdk/interrupted_partial_test.py`, new, 18
tests):
- `TestInterruptedAttemptCapture` — slice semantics + stale-marker
stripping
- `TestInterruptedAttemptFinalize` — appends partial then marker,
handles empty partial, no-op on `None` session, flushes unresolved tools
between partial and marker, returns flushed events for caller to yield
- `TestFlushOrphanToolUses` — synthesizes `tool_result` rows, returns
events, no-op on None state / no unresolved
- `TestClassifyFinalFailure` — handled_error wins, attempts_exhausted,
transient_exhausted, stream_err fallback, returns None on success path
- `TestRetryRollbackContract` — end-to-end: capture + finalize yields
the exact content the user saw streaming live plus the error marker
1022 total SDK tests pass (baseline + new).
Frontend (`useCopilotStream.test.ts`): 1 new test — `restores the unsent
text and drops the optimistic user bubble on 429 usage-limit`.
## Out of scope
- Frontend rendering tweaks for the interrupted-turn marker (existing
error-marker rendering already works).
- Refresh-survival of the unsent text in Mode 1 (would require
localStorage persistence with TTL) — separate follow-up.
- Hard process-kill / OOM where Python `finally` doesn't run — needs a
different mechanism (pod-level checkpoint sweeper).
## Checklist
- [x] My code follows the style guidelines of this project
(black/isort/ruff via `poetry run format`)
- [x] I have performed a self-review of my own code
- [x] I have added relevant unit tests
- [x] I have run lint and tests locally (1022 SDK tests pass)
## Test plan
- [ ] Verify a long-running turn that hits transient-retry exhaustion
preserves partial assistant text + tool results in chat history after
refresh
- [ ] Verify the next user message after an interrupted turn carries
enough context that the model can continue the prior task instead of
inventing a new one
- [ ] Verify a successful retry (attempt #1 fails, attempt #2 succeeds)
shows ONLY attempt #2's content (no leaked partial from #1)
- [ ] Verify hitting daily usage limit at turn start re-populates the
composer with the unsent text and removes the optimistic user bubble
---------
Co-authored-by: Claude <noreply@anthropic.com>
## Problem
`create_or_add_to_user_notification_batch` does:
1. `find_unique(..., include={"Notifications": True})` — loads ALL
notifications for the batch (thousands for heavy AGENT_RUN users),
causing Postgres `statement_timeout` in dev.
2. Find-then-update is non-atomic — concurrent invocations either hit
`@@unique([userId, type])` violations or drop notifications.
Real Sentry: `canceling statement due to statement timeout` on this
exact query, traced to `database-manager` pod.
## Fix
Use Prisma `upsert` (atomic) and skip the eager include. Only load
notifications if the caller actually needs them (audited — the sole
caller `NotificationManager._should_batch` ignores the returned DTO and
separately fetches the oldest message via
`get_user_notification_oldest_message_in_batch`).
## Tests
- Single + existing-batch upsert paths
- Concurrent race regression
## Out of scope
Unrelated to PR #12900 (Redis Cluster migration). Separate change.
## What
Replaces 4 string-valued LaunchDarkly flags with a single JSON-valued
flag for copilot model routing:
- ~~`copilot-fast-standard-model`~~
- ~~`copilot-fast-advanced-model`~~
- ~~`copilot-thinking-standard-model`~~
- ~~`copilot-thinking-advanced-model`~~
**New:** `copilot-model-routing` (JSON), keyed `{mode: {tier: model}}`:
```json
{
"fast": { "standard": "anthropic/claude-sonnet-4-6", "advanced": "anthropic/claude-opus-4-6" },
"thinking": { "standard": "moonshotai/kimi-k2.6", "advanced": "anthropic/claude-opus-4-6" }
}
```
## Why
Same pattern as the sibling consolidation in #12915 (pricing /
cost-limits flags) and the merged #12910 (tier-multipliers):
- One flag per config domain — less LD UI clutter, easier audit trail.
- Atomic updates — rotating fast.standard + thinking.standard is a
single save.
- Fewer LD entities to name, version, target, explain.
- Mirrors the now-uniform copilot-* JSON-flag shape.
## How
- `backend/util/feature_flag.py`: drop the four `COPILOT_*_MODEL` enum
values, add `COPILOT_MODEL_ROUTING`.
- `backend/copilot/model_router.py`: rewrite `resolve_model` to fetch
the JSON flag once per call and walk `payload[mode][tier]`. Missing
mode, missing tier-within-mode, non-string cell value, non-dict payload,
or LD failure all fall back to the corresponding `ChatConfig` default
(same user-visible semantics as before). `_FLAG_BY_CELL` removed
entirely; `_config_default` / `ModelMode` / `ModelTier` unchanged.
- Per-user LD targeting preserved — cohorts can still receive different
routing.
- No caching added (preserves existing uncached behaviour).
- Docstring references in `copilot/config.py` + `copilot/sdk/service.py`
updated to point at the new nested key path; one docstring in
`service_test.py` likewise.
## Operator action required BEFORE merging
This PR removes 4 LD flags and introduces 1 replacement.
1. In LaunchDarkly, create `copilot-model-routing` (type: **JSON**,
server-side only). Default variation = union of the current four string
flags, shaped as:
```json
{
"fast": { "standard": "<current copilot-fast-standard-model>",
"advanced": "<current copilot-fast-advanced-model>" },
"thinking": { "standard": "<current copilot-thinking-standard-model>",
"advanced": "<current copilot-thinking-advanced-model>" }
}
```
Omit any cell that's currently unset (its `ChatConfig` default will be
used).
2. Merge this PR.
3. After deploy + smoke, delete the four legacy flags:
- `copilot-fast-standard-model`
- `copilot-fast-advanced-model`
- `copilot-thinking-standard-model`
- `copilot-thinking-advanced-model`
## Testing
- `backend/copilot/model_router_test.py` rewritten — 27 tests pass:
- LD unset / `None` payload → fallback for every cell.
- Full JSON → each cell maps to its value (parametrized).
- Partial JSON (missing mode, missing tier-within-mode, mode value not a
dict).
- Non-dict payloads (str / list / int / bool) → fallback + warning.
- Non-string cell values (number, list, bool, dict) → fallback +
'non-string' warning.
- Empty-string cell → fallback + 'empty string' warning (not
'non-string').
- LD raises → fallback + warning with `exc_info`.
- `user_id=None` → skip LD entirely.
- Single-LD-call regression guard against re-introducing per-cell flag
fan-out.
- `backend/copilot/sdk/service_test.py`: 61 tests still pass (it mocks
`_resolve_thinking_model_for_user`, so the inner flag change is
transparent).
- `black --check` / `ruff check` / `isort --check` all clean.
## Sibling
- #12915 — same consolidation pattern for stripe-price / cost-limits
flags.
## Checklist
- [x] I have read the project's contributing guide.
- [x] I have clearly described what this PR changes and why.
- [x] My code follows the style guidelines of this project.
- [x] I have added tests that prove my fix is effective or that my
feature works.
- [ ] New and existing unit tests pass locally with my changes (CI will
confirm).
## Why
PR #12909's pricing refresh was sourced from aggregators (pricepertoken,
blog mirrors) instead of provider pricing pages. Follow-up audit against
**official provider docs** caught **22 stale entries** — 9 LLM token
rates + 12 non-LLM block rates + 1 block that needed a code refactor to
bill dynamically. Also flagged by Sentry: Mistral models were sitting on
the wrong provider's rate table.
Cross-verified JS-rendered pages (docs.x.ai, DeepSeek, Kimi) via
agent-browser.
## Corrections applied
### LLM TOKEN_COST (9 entries)
| Model | Old | New | Reason |
|---|---|---|---|
| `GPT5` | 94/750 | **188/1500** | Was OpenAI Batch API rate; Standard
is $1.25/$10 |
| `DEEPSEEK_CHAT` | 42/63 | **21/42** | Unified to deepseek-v4-flash at
$0.14/$0.28 (Sept 2025) |
| `DEEPSEEK_R1_0528` | 82/329 | **21/42** | Same v4-flash routing |
| `MISTRAL_LARGE_3` | 300/900 | **300/900** (restored after brief 75/225
detour) | Routes via OpenRouter ($2/$6), not Mistral direct |
| `MISTRAL_NEMO` | 3/6 → 23/23 | **5/5** | Routes via OpenRouter
($0.035/$0.035); Mistral-direct $0.15 doesn't apply |
| `KIMI_K2_0905` | 82/330 | **90/375** | Matches K2 family $0.60/$2.50 |
| `KIMI_K2_5` | 90/450 | **66/300** | OpenRouter pass-through $0.44/$2 |
| `KIMI_K2_6` | 143/600 | **112/698** | OpenRouter pass-through
$0.7448/$4.655 |
| `META_LLAMA_4_MAVERICK` | 30/90 | **75/116** | Groq $0.50/$0.77
(deprecated 2026-02-20) |
### Non-LLM BLOCK_COSTS — rate corrections (11 entries)
Under-billing fixes:
- `AIVideoGeneratorBlock` (FAL) SECOND 3 → **15 cr/s**
- `CreateTalkingAvatarVideoBlock` (D-ID) RUN 15 → **100 cr**
- Nano Banana Pro/2 across 3 blocks: RUN 14 → **21 cr**
- `UnrealTextToSpeechBlock` RUN 5 → **COST_USD 150 cr/$** (block now
emits `chars × $0.000016`)
Over-billing fixes:
- `IdeogramModelBlock` default 16 → **12**, V_3 18 → **14**
- `AIImageEditorBlock` FLUX_KONTEXT_MAX 20 → **12**
- `ValidateEmailsBlock` 250 → **150 cr/$**
- `SearchTheWebBlock` 100 → **150 cr/$**
- `GetLinkedinProfilePictureBlock` 3 → **1 cr**
### Non-LLM BLOCK_COSTS — block refactored for dynamic billing (1 entry)
- **`ReplicateModelBlock`** (the generic "run any Replicate model"
wrapper) migrated from flat RUN 10 cr → **COST_USD 150 cr/$**. Block now
uses `client.predictions.async_create + async_wait` instead of
`async_run(wait=False)` so it can read `prediction.metrics.predict_time`
and bill `predict_time × $0.0014/s` (Nvidia L40S mid-tier, where most
popular public models run).
Additionally (addressing CodeRabbit's critical review on this refactor):
`async_wait()` returns normally regardless of terminal status — it
doesn't raise on `failed`/`canceled` like the old `async_run` did. The
block now explicitly checks `prediction.status` after `async_wait()` and
raises `RuntimeError` on `failed` (with `prediction.error` as context)
or `canceled` **before** `merge_stats`, so failed runs are never billed
for partial compute time.
**Why this matters:** flat 10 cr was 10–500× under-billing long
video/LLM runs (users could wire in a $50/hr A100 Llama inference and
pay us $0.10). It was also 20× over-billing trivial SDXL runs. Now
scales with real compute time AND no longer bills failed predictions.
### Documentation-only
- **Grok legacy models** (grok-3, grok-4-0709, grok-4-fast,
grok-code-fast-1): dropped from docs.x.ai's public pricing page but
still callable via the API. Added inline comment noting this; rates kept
at their verified launch pricing.
- **Mistral routing**: added comment explaining why TOKEN_COST for
MISTRAL_* is the OpenRouter safety floor (not Mistral-direct) since
`ModelMetadata.provider = "open_router"` for all Mistral entries.
## How
- For each entry, opened the **official provider pricing page** directly
and computed `our_cr = round(1.5 × provider_usd × 100)`.
- For JS-rendered pages (docs.x.ai, api-docs.deepseek.com), used
agent-browser headless to render + extract rates from the DOM.
- Migrated 2 blocks (`UnrealTextToSpeechBlock`, `ReplicateModelBlock`)
from flat RUN to COST_USD — the Replicate migration touched the block's
SDK interaction.
- Updated 2 FAL-video unit tests that asserted the old `3 cr/s` rate.
- Updated 3 stale test assertions: 2 for Unreal TTS (still on
`characters` cost_type) + 1 for ZeroBounce (old 250 cr).
## Known remaining risk (explicitly out of scope)
- **`ReplicateFluxAdvancedModelBlock`** not migrated — bounded to Flux
models ($0.04–$0.08), flat 10 cr stays within 1.25–2.5× margin. Separate
PR if desired.
- **AgentMail** on free tier (1 RUN). When paid pricing publishes,
revisit.
- **Live Replicate API verification**: mitigated via 9 unit tests
covering the refactored path (`async_create` version-vs-model branching,
metrics-based billing emission, failed/canceled raises,
zero/missing-metrics no-emission, `async_wait` ordering), and SDK
signature confirmed via `inspect.signature` — but no real API call
executed. A smoke test on a cheap model before merge is still
recommended.
## Test plan
- [x] `poetry run pytest backend/data/block_cost_config_test.py
backend/executor/block_usage_cost_test.py
backend/blocks/claude_code_cost_test.py
backend/blocks/cost_leak_fixes_test.py
backend/blocks/block_cost_tracking_test.py
backend/copilot/tools/helpers_test.py
backend/blocks/replicate/replicate_block_cost_test.py -q` — all passing
(80+ tests).
- [x] Sources: openai.com/api/pricing, claude.com/pricing,
api-docs.deepseek.com, mistral.ai/pricing,
platform.kimi.ai/docs/pricing, docs.x.ai, groq.com/pricing,
replicate.com, fal.ai, d-id.com, ideogram.ai, zerobounce.net, jina.ai,
unrealspeech.com, enrichlayer.com.
- [ ] Live Replicate API call to verify `predictions.async_create +
async_wait + metrics.predict_time` path.
## What
Consolidates two groups of LaunchDarkly flags into single JSON-valued
flags, matching the pattern established by `copilot-tier-multipliers`
(merged in #12910):
**Stripe prices** — 4 string flags → 1 JSON flag:
- ~~`stripe-price-id-basic`~~ / ~~`-pro`~~ / ~~`-max`~~ /
~~`-business`~~
- **New:** `copilot-tier-stripe-prices` (JSON)
```json
{ "PRO": "price_xxx", "MAX": "price_yyy" }
```
**Cost limits** — 2 number flags → 1 JSON flag:
- ~~`copilot-daily-cost-limit-microdollars`~~ /
~~`copilot-weekly-cost-limit-microdollars`~~
- **New:** `copilot-cost-limits` (JSON)
```json
{ "daily": 625000, "weekly": 3125000 }
```
## Why
- One flag to manage per config domain (LD UI less cluttered, easier
audit trail).
- Atomic updates — e.g., rotating Pro + Max prices happens in a single
save.
- Fewer LD entities to name, version, target, and explain.
- Mirrors the just-merged `copilot-tier-multipliers` shape so the whole
pricing/limits config is uniform.
## How
- `get_subscription_price_id(tier)` now parses
`copilot-tier-stripe-prices` and looks up `tier.value` — returns `None`
when the flag is unset, non-dict, tier key missing, or value isn't a
non-empty string.
- `get_global_rate_limits` uses a new sibling
`_fetch_cost_limits_flag()` helper (60s cache, `cache_none=False`) that
extracts `daily` / `weekly` int keys independently and falls back to the
existing `ChatConfig` defaults when any key is missing / non-int /
negative. A broken `daily` doesn't wipe out `weekly` (or vice versa).
- Tests rewritten to mock the new JSON shapes + cover partial / invalid
/ missing-key fallbacks.
## ⚠️ Operator action required BEFORE merging
This PR **removes 6 LD flags** and introduces 2 replacements. To avoid a
pricing/rate-limit outage, do this in LaunchDarkly first:
1. Create `copilot-tier-stripe-prices` (type: **JSON**). Default
variation = union of the current `stripe-price-id-*` values:
```json
{ "PRO": "<current stripe-price-id-pro>", "MAX": "<current
stripe-price-id-max>" }
```
Omit BASIC / BUSINESS if those flags are unset today.
2. Create `copilot-cost-limits` (type: **JSON**). Default variation =
the current two flags' values:
```json
{ "daily": <current daily microdollars>, "weekly": <current weekly
microdollars> }
```
3. Merge this PR.
4. After deploy + smoke test, delete the six legacy flags:
- `stripe-price-id-{basic,pro,max,business}`
- `copilot-daily-cost-limit-microdollars`
- `copilot-weekly-cost-limit-microdollars`
## Testing
- Backend unit tests: `pytest backend/copilot/rate_limit_test.py
backend/data/credit_subscription_test.py
backend/api/features/subscription_routes_test.py` — rewritten to
exercise the JSON flag shapes + fallback paths; passes locally.
- `black --check` / `ruff check` / `isort --check` — all clean.
## Checklist
- [x] I have read the project's contributing guide.
- [x] I have clearly described what this PR changes and why.
- [x] My code follows the style guidelines of this project.
- [x] I have added tests that prove my fix is effective or that my
feature works.
- [ ] New and existing unit tests pass locally with my changes (CI will
confirm).
- Rename _format_bytes → format_bytes (public API, used cross-module)
- Add unit boundary rollup (1024 KB → 1.0 MB) matching frontend formatBytes
- Show WorkspaceStorageSection even when token limits are null
- Move import to top level in routes.py
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Ensures every SubscriptionTier has an entry in _DEFAULT_TIER_MULTIPLIERS.
Matches the existing storage limit guard — both fail immediately if
someone adds a new tier without updating the mappings.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
- Move _format_bytes import to top of routes.py (was mid-function)
- Add test_every_subscription_tier_has_storage_limit that iterates all
SubscriptionTier enum values and asserts each has a TIER_WORKSPACE_STORAGE_MB
entry — fails immediately if someone adds a new tier without a storage limit
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
- Resolve conflicts in test files and SubscriptionTierSection
- Update TIER_WORKSPACE_STORAGE_MB for renamed tiers: FREE→BASIC, add MAX
- Update tier descriptions in helpers.ts with storage limits
- Update rate_limit_test.py for new tier names
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
## Summary
- `.claude/skills/pr-test/SKILL.md` referenced
`/Users/majdyz/Code/AutoGPT/.ign.testing.{lock,log}` in 5 places, which
breaks the skill for anyone else who clones the repo.
- Replaced with `$REPO_ROOT`, which is already defined in Step 0 as `git
-C "$WORKTREE_PATH" worktree list | head -1 | awk '{print $1}'`. That
resolves to the main/primary worktree from any sibling worktree,
preserving the original "always pin the lock to the root checkout so all
siblings see the same file" semantics.
- No behavior change for the existing user; repo becomes portable for
everyone else.
## Test plan
- [x] `grep -n "/Users/majdyz" .claude/skills/pr-test/SKILL.md` returns
only the two intentional mentions in the "never paste absolute paths
into PR comments" warning.
- [x] `$REPO_ROOT` is defined in Step 0 before any Step 3.0 usage.
## Summary
- **Backend (`copilot/rate_limit`)** — ``TIER_MULTIPLIERS`` is now
float-typed and resolvable through a new LaunchDarkly flag
``copilot-tier-multipliers``. The integer defaults live on as
``_DEFAULT_TIER_MULTIPLIERS`` and are merged with whatever LD returns
(missing / invalid keys inherit defaults; LD failures fall back to
defaults without raising). ``get_global_rate_limits`` now honours the
flag per-user and casts ``int(base * multiplier)`` so downstream
microdollar math stays integer even when LD hands back a fractional
multiplier (e.g. 8.5×). Cached for 60 s via ``@cached(ttl_seconds=60,
maxsize=8, cache_none=False)`` to match the pattern in
``get_subscription_price_id``.
- **Backend (`api/features/v1`)** — ``SubscriptionStatusResponse`` gains
``tier_multipliers: dict[str, float]``, populated for the same set of
tiers that make it into ``tier_costs`` so hidden tiers never get a
rendered badge.
- **Frontend (`SubscriptionTierSection`)** — drops the hard-coded ``"5x"
/ "20x"`` strings from ``TIERS`` and introduces
``formatRelativeMultiplier(tierKey, tierMultipliers)``: the lowest
*visible* multiplier becomes the baseline (no badge), every other tier
renders ``"N.Nx rate limits"`` relative to it. Fractional LD values like
8.5× round to one decimal.
The admin rate-limit page (``/admin/rate-limits``) keeps the static
``TIER_MULTIPLIERS`` defaults — it's admin-facing, infrequently viewed,
and fine to lag the LD value until next deploy (noted in-code).
Related upstream: this PR stacks logically after #12903 (which added the
``MAX`` tier + LD-configurable prices) but does **not** require it —
each PR can merge in either order. No schema changes, no migration.
## Test plan
- [x] ``poetry run black backend/... --check`` + ``poetry run ruff check
backend/...`` pass
- [x] ``pnpm format`` pass (modified files unchanged)
- [x] New backend tests: ``TestGetTierMultipliers`` (defaults, LD
override, invalid JSON, unknown tier / non-positive values, LD failure)
— **5 / 5 pass**
- [x] New backend test:
``TestGetGlobalRateLimitsWithTiers::test_ld_override_applies_fractional_multiplier``
— **pass**
- [x] ``backend/copilot/rate_limit_test.py`` — non-DB subset **72 / 72
pass**; ``TestGetUserTier`` / ``TestSetUserTier`` require the full
test-server fixture (Redis + Prisma) and are not run in this worktree —
same behaviour on clean ``dev``
- [x] ``backend/api/features/subscription_routes_test.py`` — **40 / 40
pass** (includes new
``test_get_subscription_status_tier_multipliers_ld_override``)
- [x] Frontend vitest targeted suite — **51 / 51 pass**
- ``helpers.test.ts`` — new ``formatRelativeMultiplier`` cases
(lowest-tier null, integer ratio, fractional ratio, hidden-tier null,
fractional LD)
- ``SubscriptionTierSection.test.tsx`` — three new cases for relative
badges, rebasing when the lowest tier is hidden, fractional LD overrides
## Why
`ClaudeCodeBlock` was a flat `RUN, 100 cr/run` entry when real cost is
**$0.02–$1.50/run**. Plugging that leak surfaced the question "are other
blocks doing the same?" — an audit found **7 more cost-leak surfaces**.
This PR closes all of them atomically so the cost pipeline is uniform
post-#12894.
## What
### 1. ClaudeCodeBlock → COST_USD 150 cr/$ (the headline)
Claude Code CLI's `--output-format json` already returns
`total_cost_usd` on every call, rolling up Anthropic LLM + internal
tool-call spend. Block now emits it via `merge_stats`:
```python
total_cost_usd = output_data.get("total_cost_usd")
if total_cost_usd is not None:
self.merge_stats(NodeExecutionStats(
provider_cost=float(total_cost_usd),
provider_cost_type="cost_usd",
))
```
Registered as `COST_USD, 150 cr/$` — matches the 1.5× margin baked into
every `TOKEN_COST` entry.
### 2. Exa websets — ~40 blocks instrumented
Registered as `COST_USD 100 cr/$` but **never emitted `provider_cost`**
→ ran wallet-free. Added `extract_exa_cost_usd` + `merge_exa_cost`
helpers in `exa/helpers.py` and threaded `merge_exa_cost(self,
response)` through every Exa SDK call across 14 files (59 call sites).
Future-proof: lights up as soon as `exa_py` surfaces `cost_dollars` on
webset response types.
### 3. AIConditionBlock — registered under LLM_COST
Full LLM block with token-count instrumentation already in place, but
**no `BLOCK_COSTS` entry at all** → wallet-free. One-line fix: added to
the LLM_COST group next to AIConversationBlock.
### 4. Pinecone × 3 — added BLOCK_COSTS
- `PineconeInitBlock` + `PineconeQueryBlock`: 1 cr/run RUN (platform
overhead; user pays Pinecone directly).
- `PineconeInsertBlock`: ITEMS scaling with `len(vectors)` emitted via
`merge_stats`.
### 5. Perplexity Sonar (all 3 tiers) → COST_USD 150 cr/$
Block already extracted OpenRouter's `x-total-cost` header into
`execution_stats.provider_cost`; just tagged it `cost_usd` and flipped
the registry. **Deep Research was under-billing up to 30×** ($0.20–$2.00
real vs flat 10 cr).
### 6. CodeGenerationBlock (Codex / GPT-5.1-Codex) → COST_USD 150 cr/$
Block computes USD from `response.usage.input_tokens / output_tokens`
using GPT-5.1-Codex rates ($1.25/M in + $10/M out) and emits `cost_usd`.
Was flat 5 cr for arbitrary-length generations.
### 7. VideoNarrationBlock (ElevenLabs) → COST_USD 150 cr/$
Block computes USD from `len(script) × $0.000167` (Starter tier per-char
price) and emits `cost_usd`. **Was under-billing ~25–30× on long
scripts** (5K-char narration: flat 5 cr vs ~$0.83 real = 125 cr).
### 8. Meeting BaaS FetchMeetingData → COST_USD 150 cr/$
Join block keeps its flat 30 cr commit. FetchMeetingData now extracts
`duration_seconds` from the response metadata, computes USD via
`duration × $0.000192/sec`, and emits `cost_usd`. Long meetings (hours)
no longer fit inside the 30 cr deposit.
## Why 150 cr/$
Matches the **1.5× margin already baked into `TOKEN_COST` for every
direct LLM block**:
| Model | Real | Our rate (per 1M) | Markup |
|---|---|---|---|
| Claude Sonnet 4 | $3/$15 | 450/2250 cr | 1.5× |
| GPT-5 | $2.50/$10 | 375/1500 cr | 1.5× |
| Gemini 2.5 Pro | $1.25/$5 | 187/750 cr | 1.5× |
Applying the same ratio to `total_cost_usd` ≡ `cost_amount=150` (1 cr ≈
$0.01 → 100 cr/$ pass-through × 1.5× = 150).
## Test plan
- [x] **Unit**: new `claude_code_cost_test.py` (9 tests) + existing
`exa/cost_tracking_test.py` (16 tests) + full cost pipeline. **119/119
pass**.
- [x] `poetry run ruff format` + `poetry run ruff check backend/` —
clean.
- [ ] Live E2E: real ClaudeCode / Perplexity Deep Research / Codex run
with balance delta verification (post-merge).
## Follow-ups (not in this PR)
- `exa_py` SDK update to surface `cost_dollars` on Webset response types
(upstream) — unlocks real billing for the 40 webset blocks.
- Replicate suite: migrate per-model RUN entries to COST_USD via
`prediction.metrics["predict_time"] × per-model $/sec`.
### Why / What / How
**Why:** The Settings v2 API keys page was a UI-only stub with 100 mock
rows, a noop "Create Key" button, noop delete buttons, and no
empty/loading states. Users couldn't actually manage their keys from the
new Settings UI. Ships SECRT-2273.
**What:** Replaces the mock with a working page: paginated list
(15/page) with infinite scroll, create flow with one-time plaintext
reveal, single + batch revoke with confirmation dialogs, per-key details
dialog, skeleton loader, animated empty state, toast + mutation-loading
feedback, and responsive header.
https://github.com/user-attachments/assets/bc576de3-0369-4e73-b945-c66c142ebfe5
<img width="397" height="860" alt="Screenshot 2026-04-24 at 11 26 53 AM"
src="https://github.com/user-attachments/assets/ed8681ea-7d16-40cc-96f7-72d798857229"
/>
**How:**
- **Backend** adds a new `GET /api/api-keys/paginated` route returning
`{ items, total_count, page, page_size, has_more }`. The legacy `GET
/api/api-keys` is untouched so the existing profile page keeps working.
The list fn runs `find_many` + `count` in parallel and filters to
`ACTIVE` status by default so revoked keys stay hidden.
- **Frontend** fetches via TanStack Query. Right now the hook consumes
the legacy endpoint with client-side slicing (15/page) so the page works
against staging today; once the paginated route ships we swap to the
generated `useGetV1ListUserApiKeysPaginatedInfinite` hook that's already
in the regenerated client.
- All new UI lives in `src/app/(platform)/settings/api-keys/components/`
— no legacy components reused. Shared primitives (Dialog, Form, Toast,
Skeleton, InfiniteScroll, BaseTooltip) come from the atoms/molecules
design system.
- Empty state uses a vertical marquee of ghost key-cards (framer-motion,
translateY 0→-50% on a duplicated stack, linear easing, symmetric mask
fade). Respects `prefers-reduced-motion`.
- Settings layout ScrollArea switched to `h-full` on mobile and
`md:h-[calc(100vh-60px)]` on desktop to remove a double scrollbar that
appeared when the mobile nav took space above the fixed-height scroll
region.
### Changes 🏗️
**Backend**
- `GET /api/api-keys/paginated` — new route, page + page_size query
params, `ListAPIKeysPaginatedResponse`.
- `list_user_api_keys_paginated` — new data fn, gathers find_many +
count, default ACTIVE-only filter.
- Existing `/api/api-keys` routes untouched.
**Frontend (settings/api-keys)**
- `page.tsx` + `components/APIKeyList/`, `APIKeyRow/`, `APIKeysHeader/`,
`APIKeySelectionBar/` — real-data wiring, drop mock array.
- `components/hooks/` — `useAPIKeysList`, `useCreateAPIKey`,
`useRevokeAPIKey`.
- `components/CreateAPIKeyDialog/` — zod-validated form + success view
with copy.
- `components/DeleteAPIKeyDialog/` — confirm with loading state; single
+ batch.
- `components/APIKeyInfoDialog/` — shows masked key, scopes,
description, created/last_used.
- `components/APIKeyListEmpty/` +
`APIKeyListEmpty/components/APIKeyMarquee.tsx` — animated empty state.
- `components/APIKeyListSkeleton/` — 6-row skeleton.
**Other**
- `settings/layout.tsx` — responsive ScrollArea height (fixes
double-scrollbar on mobile).
- `components/ui/scroll-area.tsx` — optional `showScrollToTop` FAB.
- `__tests__/placeholder-pages.test.tsx` — drop api-keys from
placeholder list.
- `AGENTS.md` — Phosphor `-Icon` suffix convention note.
- `api/openapi.json` — regenerated with new paginated endpoint.
### Checklist 📋
#### For code changes:
- [x] I have clearly listed my changes in the PR description
- [x] I have made a test plan
- [ ] I have tested my changes according to the test plan:
- [ ] Page loads → skeleton → list with real keys
- [ ] Empty state renders with the vertical marquee (and stays static
with `prefers-reduced-motion`)
- [ ] Create key dialog: name + description + permissions validates;
success view shows plaintext once + copy works; closing resets state
- [ ] Revoke single key via row trash icon → confirm dialog → toast on
success → row disappears
- [ ] Batch-revoke via selection bar → confirm dialog → all revoked
- [ ] Info icon next to each key opens the details dialog (scopes,
timestamps, masked key)
- [ ] Infinite scroll loads more rows when scrolling past page 1 (≥16
keys)
- [ ] Mobile (<640px): single scrollbar, Create Key button below title
at size=small
- [ ] Desktop (md+): same layout as before, scroll-to-top FAB appears
after scrolling
#### For configuration changes:
- [x] `.env.default` is updated or already compatible with my changes
- [x] `docker-compose.yml` is updated or already compatible with my
changes
- [x] I have included a list of my configuration changes in the PR
description (under **Changes**)
### Why / What / How
**Why:** The Settings area is getting a redesign (per Figma
[Settings-Page](https://www.figma.com/design/YGck0Hb0GEgFzwbX47kSNs/Settings-Page?node-id=1-2)).
Ticket SECRT-2272 covers just the shell so content/forms for each
section can land in follow-up PRs without blocking on the nav
restructure. v1 at `/profile/settings` must stay intact for end users
during the rollout.
**What:** Adds a new parallel Settings hub at `/settings` (dedicated
sidebar + 7 placeholder sub-routes) behind a new `SETTINGS_V2`
LaunchDarkly flag. Default `false` so nothing changes for users until
the flag flips. Backend is untouched.
https://github.com/user-attachments/assets/dd680eaf-3d41-4a9a-87f3-d06d536a2503
**How:**
- New `Flag.SETTINGS_V2 = "settings-v2"` added to `use-get-flag.ts` with
`defaultFlags[Flag.SETTINGS_V2] = false`. Gate the whole route group at
`layout.tsx` via existing `FeatureFlagPage` HOC which redirects to
`/profile/settings` when the flag is off.
- `SettingsSidebar` replicates the Figma spec (237px, 7 items at 217×38,
`gap-[7px]`, rounded-[8px], active `bg-[#EFEFF0]` + text `#1F1F20` Geist
Medium, inactive text `#505057` Geist Regular, icon 16px Phosphor
light/regular at `#1F1F20`). Colors + typography use the canonical
tokens exported by Figma (zinc-50 `#F9F9FA`, zinc-200 `#DADADC` for the
right-border, etc.).
- `SettingsNavItem` is extracted as its own component and owns its
per-item entrance variant.
- Per-link loading indicator uses Next.js 15's `useLinkStatus()` hook —
spinner appears on the right of the clicked item and clears
automatically once the target page renders.
- `SettingsMobileNav` (< md breakpoint): sidebar hides; a pill trigger
with the current section's icon + label opens a Radix Popover listing
all 7 sections.
- Entrance animations via framer-motion, tuned to Emil Kowalski's
guidelines — `cubic-bezier(0, 0, 0.2, 1)` ease-out, all durations ≤
280ms, only `transform` and `opacity`, `useReducedMotion` disables
movement but keeps fade. Sidebar items stagger in (40ms offset). Main
content re-animates on every route change via `key={pathname}`.
- All 7 placeholder pages render the section title (Poppins Medium 22/28
via `variant="h4"`, `#1F1F20`) + "Coming soon" copy; they are
intentionally client components to avoid hook-order issues with the
client-side flag gate in the layout.
### Changes 🏗️
- `src/services/feature-flags/use-get-flag.ts`: register
`Flag.SETTINGS_V2` + default `false`
- `src/app/(platform)/settings/layout.tsx`: flag gate + responsive shell
+ route-keyed content animation
- `src/app/(platform)/settings/page.tsx`: client-side redirect to
`/settings/profile`
- `src/app/(platform)/settings/components/SettingsSidebar/`:
- `SettingsSidebar.tsx` — aside with staggered entrance
- `SettingsNavItem.tsx` — per-item Link + icon + label + loader
(extracted)
- `useSettingsSidebar.ts` — hook mapping nav items with `isActive` from
`usePathname`
- `helpers.ts` — typed nav item config (label / href / Phosphor icon) ×
7
-
`src/app/(platform)/settings/components/SettingsMobileNav/SettingsMobileNav.tsx`:
mobile Popover trigger
- 7 placeholder pages: `profile`, `creator-dashboard`, `billing`,
`integrations`, `preferences`, `api-keys`, `oauth-apps`
**Follow-up PRs will migrate real content into each tab.** LaunchDarkly
flag key `settings-v2` must be created in the LD dashboard before
enabling for users.
### Checklist 📋
#### For code changes:
- [x] I have clearly listed my changes in the PR description
- [x] I have made a test plan
- [x] I have tested my changes according to the test plan:
- [x] `NEXT_PUBLIC_FORCE_FLAG_SETTINGS_V2=true` → `/settings` redirects
to `/settings/profile`, sidebar renders 7 items with "Profile" active
- [x] Click each nav item → URL changes, active item highlights, content
pane re-animates, per-link spinner shows during navigation
- [x] Viewport < 768px → sidebar hides, mobile pill trigger opens
Popover with all 7 items; selecting one navigates and closes
- [x] Without the flag env override, `/settings` redirects to
`/profile/settings` (v1 unchanged)
- [x] `pnpm types` clean; prettier clean on touched files
- [x] Manual a11y pass with `prefers-reduced-motion` enabled — fade
remains, translations disabled
#### For configuration changes:
- [x] `.env.default` is updated or already compatible with my changes
*(no new env vars required; existing `NEXT_PUBLIC_FORCE_FLAG_*` pattern
covers local override)*
- [x] `docker-compose.yml` is updated or already compatible with my
changes *(no docker changes)*
- [x] I have included a list of my configuration changes in the PR
description *(LaunchDarkly dashboard must have `settings-v2` flag
created before enabling; no other config changes)*
## Why
Review comments on #12883 (thanks @Pwuts) surfaced a few spots where the
managed-credential plumbing's names and docstrings didn't match what the
code actually does:
- `_read_or_create_profile_key` suggests "read from any source or create
new", but only migrates the legacy
`managed_credentials.ayrshare_profile_key` side-channel — it doesn't
read an existing managed credential. (That check lives in the outer
`_provision_under_lock`.)
- Docstrings refer to "the startup sweep" in several places — there's no
startup hook; the sweep runs on `/credentials` fetches.
- `is_available` / `auto_provision` relationship wasn't explicit;
readers couldn't tell whether `is_available` was a config check or a
liveness check, or which of the two gates the sweep checks first.
## What
Naming + docstring cleanup. **Zero behavior changes.**
- Rename `_read_or_create_profile_key` →
`_migrate_legacy_or_create_profile_key` with docstring explaining why it
doesn't re-check the managed cred.
- Replace "startup sweep" → "credentials sweep" everywhere.
- `ManagedCredentialProvider` class docstring now names the two gates:
1. `auto_provision` — does this provider participate in the sweep at
all?
2. `is_available` — are the required env vars / secrets set?
- `is_available` docstring now spells out: what it checks (env vars),
what it does NOT check (upstream health), and that it's only consulted
when `auto_provision=True`.
- `ensure_managed_credentials` docstring defines "credentials sweep",
when it fires, how the per-user in-memory cache works.
- Module-level docstring drops the stale "non-blocking background task"
wording (#12883 made the sweep bounded-await).
## How
4 files, all backend:
- `backend/integrations/managed_credentials.py`
- `backend/integrations/managed_providers/ayrshare.py`
- `backend/integrations/managed_providers/ayrshare_test.py`
- `backend/api/features/integrations/router.py`
Tests: 13/13 Ayrshare tests pass against the rename.
## Checklist
- [x] Follows style guide
- [x] Existing tests still pass (no functional change)
- [x] No new tests needed — pure rename + docstring change
### Why / What / How
**Why:** Resolves#12875. CoPilot's agent-builder was hardcoding Google
Drive file IDs into consuming blocks' `input_default` instead of wiring
an `AgentGoogleDriveFileInputBlock`. A beta user hit this across **13
saved versions** of one agent. Root causes:
1. `validate_io_blocks` only accepted the literal base `AgentInputBlock`
/ `AgentOutputBlock` IDs, so even when CoPilot used a specialized
subclass like `AgentGoogleDriveFileInputBlock` as the only input, the
validator forced it to keep a throwaway base alongside — entrenching the
anti-pattern.
2. Running a Drive consumer directly via CoPilot's `run_block` silently
failed because the auto-credentials flow (picker attaches
`_credentials_id`) existed only in the graph executor, never in
CoPilot's direct-execution path.
3. Drive picker guidance lived in `agent_generation_guide.md` instead of
on the blocks themselves, so it duplicated and drifted from the code.
4. Observed in a live session: when asked to read a private sheet,
CoPilot refused with "share publicly or use the builder" instead of
calling `run_block` and letting the picker render — the prompt rule was
buried and the fallback path (omitted required picker field) returned a
generic schema preview.
**What:** Four coordinated platform + CoPilot improvements. No
block-specific validator rules, no Drive-specific code in UI or prompt.
**How:**
#### 1. `validate_io_blocks` subclass support
Accepts any block with `uiType == "Input"` / `"Output"` (populated from
`Block.block_type` at registration). `AgentGoogleDriveFileInputBlock`,
`AgentDropdownInputBlock`, `AgentTableInputBlock`, etc. stand alone.
Base-ID fallback preserved for call sites that pass a minimal blocks
list.
#### 2. Inline picker via `run_block`
- Extracted `_acquire_auto_credentials` from
`backend/executor/manager.py` into shared
`backend/executor/auto_credentials.py` (exports
`acquire_auto_credentials` + `MissingAutoCredentialsError`).
- Wired it into `backend/copilot/tools/helpers.py::execute_block`. When
`_credentials_id` is present, the block executes with creds injected
(chained flows work). When missing/null, `execute_block` returns the
existing `SetupRequirementsResponse` — frontend's `FormRenderer` renders
the picker inline via the existing
`GoogleDrivePickerField`/`GoogleDrivePickerInput`. On pick, the LLM
re-invokes `run_block` with the populated input — same continuation
pattern as OAuth-missing-credentials. No new response types, no new
continuation tool, no new frontend component.
- `run_block` now short-circuits to `SetupRequirementsResponse` when
missing required fields include a picker-backed field, skipping the
schema-preview round trip the LLM would otherwise take.
- `get_inputs_from_schema` spreads the full property schema (`**schema`)
instead of whitelisting — any `format` / `json_schema_extra` / custom
widget config flows through to the generic custom-field dispatch on the
frontend. Future picker formats (date pickers, file pickers, etc.) work
without backend changes.
- Frontend `SetupRequirementsCard/helpers.ts` uses index-signature
passthrough for arbitrary schema keys — no widget-specific code in that
layer.
#### 3. `validate_only` parameter on `run_block`
`run_block(id, {})` is not always a safe probe — for blocks with zero
required inputs, it executes. New `validate_only: true` parameter
returns `BlockDetailsResponse` (schema + missing-input list) without
executing, rendering picker cards, or charging credits. Same response
shape as the existing schema preview — no new branch, just an extra
condition on the existing one. LLM uses this for pre-flight when it's
unsure whether a block has required inputs.
#### 4. Block-local picker guidance
Agent-generation picker guidance relocated from the guide onto the
blocks themselves — surfaced at `find_block` time, exactly when the LLM
decides to wire a picker-backed consumer:
- `GoogleDriveFileField` (shared factory for every Drive field on
Sheets/Docs/etc.) appends a standard hint to the caller's description
covering: feed from the specialized input block, never hardcode (even
one parsed from a URL), picker is the only credential source.
- `AgentGoogleDriveFileInputBlock`'s block description now covers when
it's required, the `allowed_views` mapping, wiring direction, and a
concrete link-shape example.
- `agent_generation_guide.md` loses the dedicated 71-line Drive section.
The IO-blocks section now tells the LLM specialized subclasses satisfy
the requirement and carry their own usage guidance in block/field
descriptions — read them when `find_block` surfaces a match.
- New "Picker-backed inputs via `run_block`" section in the CoPilot
prompt, written generically (picker fields detected via `format` /
`auto_credentials` schema hints, no provider names hardcoded) — covers:
don't ask the user for URLs/IDs, don't refuse private-resource asks,
chained picker objects pass through as-is.
- Sharpened `MissingAutoCredentialsError` message so when a bare ID
reaches execution, the error explicitly tells the LLM the picker renders
inline (not "ask the user for something").
### Changes 🏗️
- `backend/copilot/tools/agent_generator/validator.py` —
`_collect_io_block_ids` + subclass-aware `validate_io_blocks`.
- `backend/executor/auto_credentials.py` (new) — shared
`acquire_auto_credentials` + `MissingAutoCredentialsError`.
- `backend/executor/manager.py` — imports from the shared module, drops
the local copy.
- `backend/copilot/tools/helpers.py` — `execute_block` calls
`acquire_auto_credentials`, merges kwargs, releases locks in `finally`,
returns `SetupRequirementsResponse` on missing creds.
`get_inputs_from_schema` spreads the full property schema.
- `backend/copilot/tools/run_block.py` — picker-field short-circuit +
`validate_only` parameter.
- `backend/copilot/prompting.py` — "Picker-backed inputs via
`run_block`" + "Pre-flight with `validate_only`" sections.
- `backend/blocks/google/_drive.py` — `GoogleDriveFileField` appends the
agent-builder hint to every Drive consumer's description.
- `backend/blocks/io.py` — `AgentGoogleDriveFileInputBlock` description
expanded.
- `backend/copilot/sdk/agent_generation_guide.md` — Drive section
removed, IO-blocks subclass note expanded.
- `frontend/.../SetupRequirementsCard/helpers.ts` — index-signature
passthrough for arbitrary schema keys; schema fields propagate into the
generated RJSF schema.
- Tests: new `TestExecuteBlockAutoCredentials` (4 cases) +
`validate_only` + picker-short-circuit cases in `run_block_test.py`;
`manager_auto_credentials_test.py` moved to new import path; 6 new
frontend cases in `SetupRequirementsCard/__tests__/helpers.test.ts`
covering schema passthrough.
- Also: one-line hoist of `import secrets` in
`backend/integrations/managed_providers/ayrshare.py` — ruff E402
introduced by #12883 was blocking our lint post-merge.
### Checklist 📋
#### For code changes:
- [x] I have clearly listed my changes in the PR description
- [x] I have made a test plan
- [x] I have tested my changes according to the test plan:
- [x] Backend unit suites: validator_test (48), helpers_test (40),
run_block_test (19), manager_auto_credentials_test (15) — **all green**
- [x] Frontend `SetupRequirementsCard` helpers — **75/75 pass**
(including 6 new passthrough cases)
- [x] `poetry run format` (ruff + isort + black) clean on touched files
(pre-existing pyright errors in unrelated `graphiti_core` /
`StreamEvent` / etc. files not introduced by this PR)
- [x] Live CoPilot chat on dev-builder confirmed the setup card renders
`custom/google_drive_picker_field` for a Drive consumer block called via
`run_block`
- [x] Live agent-generation confirmed CoPilot creates a subclass-only
agent (`AgentGoogleDriveFileInputBlock` → `GoogleSheetsReadBlock` →
`AgentOutputBlock`) with no throwaway base `AgentInputBlock`
#### For configuration changes:
- [x] N/A — no config changes
---------
Co-authored-by: majdyz <zamil.majdy@agpt.co>
## What
Introduces a new `MAX` tier slot between `PRO` and `BUSINESS`
(self-service $320/mo at 20× capacity), routes every self-service tier's
Stripe price ID through LaunchDarkly, and hides tiers from the UI when
their price isn't configured. `BUSINESS` stays in the enum at 60× as a
reserved/future self-service slot (hidden by default until its LD price
flag is set). ENTERPRISE stays admin-managed.
## Tier shape after this PR
| Enum | UI label | Multiplier | LD price flag | Surfaced in UI by
default |
|---|---|---|---|---|
| `FREE` | Basic | 1× | `stripe-price-id-basic` | no (flag unset) |
| `PRO` | Pro | 5× | `stripe-price-id-pro` | yes (already live) |
| `MAX` **(new)** | Max | 20× | `stripe-price-id-max` | no (flag unset
until $320 price ready) |
| `BUSINESS` | Business | 60× | `stripe-price-id-business` | no
(reserved / future) |
| `ENTERPRISE` | — | 60× | — (admin-managed) | no (Contact-Us only) |
## Prisma
- Added `MAX` between `PRO` and `BUSINESS` in `SubscriptionTier`.
- Migration `add_subscription_tier_max/migration.sql` uses `ALTER TYPE
... ADD VALUE IF NOT EXISTS 'MAX' BEFORE 'BUSINESS'` (transactional
since PG 12). No data migration — no rows currently on BUSINESS via
self-service flows.
## Backend
- `get_subscription_price_id` flag map covers
`FREE`/`PRO`/`MAX`/`BUSINESS`. ENTERPRISE returns `None`.
- `GET /credits/subscription.tier_costs` only includes tiers whose LD
price ID is set. Current tier always present as a safety net.
- `POST /credits/subscription` routes by LD-resolved prices instead of
hard-coding `tier == FREE`:
- Target `FREE` + `stripe-price-id-basic` unset → legacy
cancel-at-period-end (unchanged behaviour).
- Target has LD price → modify in-place when user has an active sub,
else Checkout Session.
- Priced-FREE users with no sub fall through to Checkout (admin-granted
DB-flip shortcut gated on `current_tier != FREE`).
- `sync_subscription_from_stripe` + `get_pending_subscription_change`
cover FREE/PRO/MAX/BUSINESS in the price-to-tier map so every tier's
Stripe webhook reconciles cleanly.
- Pending-tier mapping collapsed into a single membership check.
- `TIER_MULTIPLIERS`: `FREE=1, PRO=5, MAX=20, BUSINESS=60,
ENTERPRISE=60`.
## Frontend
- UI labels: FREE→"Basic", MAX→"Max", BUSINESS→"Business" (PRO
unchanged). `TIER_ORDER` now `[FREE, PRO, MAX, BUSINESS, ENTERPRISE]`.
- `SubscriptionTierSection` filters by `tier_costs` — any tier without a
backend-provided price is hidden (current tier always visible).
- `formatCost` surfaces "Free" only when `FREE` is actually `$0`;
non-zero `stripe-price-id-basic` renders `$X.XX/mo`.
- Admin rate-limit display lists all five tiers with multiplier badges.
## LaunchDarkly flag actions (operator)
- **New:** `stripe-price-id-basic` → FREE tier. Set to `""` or a `$0`
Stripe price.
- **New:** `stripe-price-id-max` → MAX tier. Point at the `$320` Stripe
price when you launch the Max tier.
- **Unchanged:** `stripe-price-id-pro` (PRO), `stripe-price-id-business`
(BUSINESS — leave unset until you're ready for the 60× Business tier).
- Base rate limits stay on `copilot-daily-cost-limit-microdollars` /
`copilot-weekly-cost-limit-microdollars` (Basic's limit; everything else
= × tier multiplier).
## Out of scope
- Subscription-required onboarding screen / middleware gating (separate
PR).
- "Pricing available soon" vs Stripe-failure disambiguation in the UI
(follow-up).
## Testing
- Backend: 213 tests across `subscription_routes_test.py`,
`credit_subscription_test.py`, `rate_limit_test.py`,
`admin/rate_limit_admin_routes_test.py` — all passing.
- Frontend: 91 tests across `credits/` + `admin/rate-limits/` — all
passing.
- Fresh-backend manual E2E on the pre-MAX commit confirmed tier-hiding
works (`tier_costs` returns only the current tier when LD flags are
unset).
## Checklist
- [x] I have read the project's contributing guide.
- [x] I have clearly described what this PR changes and why.
- [x] My code follows the style guidelines of this project.
- [x] I have added tests that prove my fix is effective or that my
feature works.
- [ ] New and existing unit tests pass locally with my changes (CI will
confirm).
### Why / What / How
**Why.** Backend CI was failing at startup with `relation
"platform.AgentNode" does not exist`. Prisma's `migrate deploy` uses the
`schema.prisma` datasource, which doesn't declare a schema, so when
`DATABASE_URL` has no `?schema=platform` query param (as in CI / raw
Supabase), Prisma creates tables in `public` — but the lifespan
migration `backend.data.graph.migrate_llm_models` hardcoded
`platform."AgentNode"` in its raw SQL and crashed the boot.
**What.** Switched `migrate_llm_models` to use the
`execute_raw_with_schema` helper and the `{schema_prefix}` placeholder —
the same pattern already used by the sibling
`fix_llm_provider_credentials` migration in the same file. The helper in
`backend/data/db.py` reads the schema from `DATABASE_URL` at runtime and
substitutes `"platform".` or an empty prefix, so the query works in both
dev (schema=platform) and CI / raw Supabase (public).
**How.**
- Template change: `UPDATE platform."AgentNode"` → `UPDATE
{{schema_prefix}}"AgentNode"` (f-string double-brace escape so
`{schema_prefix}` survives to `.format()` inside
`execute_raw_with_schema`).
- Replace `db.execute_raw(...)` with `execute_raw_with_schema(...)`;
drop the now-unused `prisma as db` import.
- Regression test: mocks `execute_raw_with_schema` and asserts every
emitted query contains `{schema_prefix}` and no longer contains
`platform."AgentNode"`.
### Audit
Audited the other three lifespan migrations in
`backend/api/rest_api.py::lifespan_context`:
- `backend.data.user.migrate_and_encrypt_user_integrations` — uses
Prisma ORM, no raw SQL. OK.
- `backend.data.graph.fix_llm_provider_credentials` — already uses
`query_raw_with_schema` + `{schema_prefix}`. OK.
- `backend.integrations.webhooks.utils.migrate_legacy_triggered_graphs`
— uses Prisma ORM, no raw SQL. OK.
Also grepped the whole backend for `platform."` in Python files —
`migrate_llm_models` was the only offender; the other hits were
unrelated string content (docstrings, error messages, test data).
### Changes
- `autogpt_platform/backend/backend/data/graph.py`: `migrate_llm_models`
now uses `execute_raw_with_schema` with the `{schema_prefix}`
placeholder; unused `prisma as db` import dropped.
- `autogpt_platform/backend/backend/data/graph_test.py`: added
`test_migrate_llm_models_uses_schema_prefix_placeholder` regression
test.
### Checklist
#### For code changes:
- [x] I have clearly listed my changes in the PR description
- [x] I have made a test plan
- [x] I have tested my changes according to the test plan:
- [x] Ran `migrate_llm_models` under mocked `execute_raw_with_schema` —
all 7 emitted UPDATE queries contain `{schema_prefix}` and none hardcode
`platform."AgentNode"`.
- [x] Verified the f-string double-brace escape by evaluating the
template and running `.format(schema_prefix=...)` — substitution is
correct for both `"platform".` and empty-prefix (public-schema) cases.
- [x] `poetry run pyright backend/data/graph.py` clean (pre-existing
pyright error on `backend/api/features/v1.py:834` on `origin/dev` is
unrelated).
- [x] Grepped the whole backend for other hardcoded `platform."..."`
raw-SQL occurrences — none found.
#### For configuration changes:
- [x] `.env.default` is updated or already compatible with my changes
(N/A — no config changes)
- [x] `docker-compose.yml` is updated or already compatible with my
changes (N/A — no config changes)
## Why
Beta user report: AutoPilot told them to sign up for Ayrshare themselves
— which AutoGPT actually manages — because AutoPilot inferred the
requirement from the block description string rather than any structured
schema. Root cause: Ayrshare was the only block family whose
"credential" lived in a bespoke
`UserIntegrations.managed_credentials.ayrshare_profile_key` side channel
and whose blocks declared **no** `credentials` field. `find_block` /
`resolve_block_credentials` had nothing to show the LLM, so the LLM
guessed.
(An initial commit added a runtime `gh` CLI bootstrap for a separate "gh
isn't installed in the sandbox" report — that work was empirically
verified unnecessary and reverted; see the commit history for the bench
results.)
## What
**Ayrshare now goes through the standard managed-credential flow:**
- New `AyrshareManagedProvider` alongside the existing
`AgentMailManagedProvider`. Provisions the per-user profile as
`APIKeyCredentials(provider="ayrshare", is_managed=True)` via the shared
`add_managed_credential` path. Reuses any legacy
`managed_credentials.ayrshare_profile_key` value on first provision so
existing users keep their linked social accounts.
- `AyrshareManagedProvider.is_available()` returns `False` so the
`ensure_managed_credentials` startup sweep **never** auto-provisions
Ayrshare (profile quota is a real per-user subscription cost). New
public `ensure_managed_credential(user_id, store, provider)` helper lets
the `/api/integrations/ayrshare/sso_url` route provision on demand,
reusing the same distributed Redis lock + upsert path as AgentMail.
- New `ProviderBuilder.with_managed_api_key()` method registers
`api_key` as a supported auth type without the env-var-backed default
credential that `with_api_key()` creates — so the org-level Ayrshare
admin key cannot leak to blocks as a "profile key".
- `BaseAyrshareInput` gains a shared `credentials` field; all 13 social
blocks inherit it. Each `run()` now takes `credentials:
APIKeyCredentials`; the inline `get_profile_key` guard + "please link a
social account" error is gone. Standard `resolve_block_credentials`
pre-run check owns the "not connected" path, returning a normal
`SetupRequirementsResponse`.
- **Migration-ordering safety:** `post_provision` hook on
`ManagedCredentialProvider` clears the legacy `ayrshare_profile_key`
field **only after** `add_managed_credential` has durably stored the
managed credential. If persistence fails, the legacy key stays intact so
a retry can reuse it — covered by `TestMigrationOrderingSafety`.
- New public `IntegrationCredentialsStore.get_user_integrations()` —
reads no longer have to reach past the `_get_user_integrations` privacy
fence or abuse `edit_user_integrations` as a pseudo-read.
- `/api/integrations/ayrshare/sso_url` collapses from a 60-line
provision-then-sign dance to: pre-flight `settings_available()`,
`ensure_managed_credential`, fetch the credential, sign a JWT.
- `IntegrationCredentialsStore.set_ayrshare_profile_key` removed — the
managed credential is now the only write path.
- Legacy `UserIntegrations.ManagedCredentials.ayrshare_profile_key`
field is retained so the managed provider can migrate existing users on
first provision; removing the field is a follow-up once rollout has
propagated.
## How
After this PR, `find_block` returns Ayrshare blocks with a structured
`credentials_provider: ['ayrshare']`. AutoPilot sees the credential
requirement the same way it sees GitHub's or AgentMail's, calls
`run_block`, and gets a plain `SetupRequirementsResponse` when the
managed credential has not been provisioned yet. No more
description-string speculation; the whole Ayrshare flow is the normal
flow.
The Builder's `AyrshareConnectButton` (`BlockType.AYRSHARE`) still works
— it hits the same endpoint, now a thin wrapper over the managed
provider — so users still get the "Connect Social Accounts" popup for
OAuth'ing individual social networks.
## Test plan
- [x] `poetry run pytest backend/blocks/test/test_block.py -k "ayrshare
or PostTo"` — 26/26 pass.
- [x] `poetry run pytest
backend/integrations/managed_providers/ayrshare_test.py` — 10/10 pass.
- [x] `poetry run pytest
backend/api/features/integrations/router_test.py` — 21/21 pass.
- [x] `poetry run pyright` on all touched backend files — 0 errors.
- [x] Runtime sanity: `find_block` on `PostToXBlock` lists
`credentials_provider: ['ayrshare']` in the JSON schema.
- [ ] Manual QA in preview: connect social account via Builder's
"Connect Social Accounts" button → post to X via CoPilot end-to-end.
- [ ] Verify existing users with
`managed_credentials.ayrshare_profile_key` continue to work without
re-linking.
## Why
PR #12893 shipped flat-floor credit charges so no provider sits
wallet-free. This PR is the next step: make dynamic pricing actually
dynamic. Blocks that scale with walltime, item count, provider-reported
USD, or token volume now get billed based on captured execution stats
instead of a fixed floor.
Before this PR `BlockCostType` only had `RUN` / `BYTE` / `SECOND`, and
`SECOND` was dead code — no caller ever passed `run_time > 0`, so every
per-second entry evaluated to 0. This PR wires the stats plumbing
through, adds the cost-type variants that cover the real billing models
our providers charge on, and migrates blocks across the codebase to use
them.
## What
### Machinery
- `BlockCostType` gains `ITEMS`, `COST_USD`, `TOKENS`. `BlockCost` gains
`cost_divisor: int = 1` so SECOND/ITEMS/TOKENS can express "1 credit per
N units" without fractional amounts.
- `block_usage_cost(..., stats: NodeExecutionStats | None = None)` —
pre-flight (no stats) dynamic types return 0 so the balance check isn't
blocked on unknown-future cost; post-flight (stats populated) they
consume captured execution stats.
- `TokenRate` model + `TOKEN_COST` table (~60 models: Claude family,
GPT-5 family, Gemini 2.5, Groq/Llama, Mistral, Cohere, DeepSeek, Grok,
Kimi, Perplexity Sonar). Rates are credits per 1M tokens with input /
output / cache-read / cache-creation split.
- `compute_token_credits(input_data, stats)` — reads
`stats.input_token_count / output_token_count / cache_read_token_count /
cache_creation_token_count`, multiplies by `TOKEN_COST[model]`, ceils to
integer credits. Falls back to flat `MODEL_COST[model]` for unmapped
models (no silent under-billing).
- `billing.charge_reconciled_usage(node_exec, stats)` — runs
post-flight, charges positive delta / refunds negative delta. RUN-only
blocks produce zero delta (no-op). Swallows `InsufficientBalanceError` +
unexpected errors so reconciliation never poisons the success path.
- Pre-flight balance guard — dynamic-cost blocks (0 pre-flight charge)
are blocked when the wallet is non-positive. Closes Sentry `r3132206798`
(HIGH).
- Reconciliation fires `handle_low_balance` on positive delta so users
still get alerted after post-flight reconciliation.
### Block migrations — cost-type changes
| Provider / block family | Old | New | Cost type |
|---|---|---|---|
| All LLM blocks (Anthropic / OpenAI / Groq / Open Router / Llama API /
v0 / AIML, via `LLM_COST` list) | RUN, flat per-model from `MODEL_COST`
| `TOKEN_COST` per-token rate table (input / output / cache-read /
cache-creation) | **TOKENS** |
| Jina `SearchTheWebBlock` | RUN, 1 cr | 100 cr / $ (≈ 1 cr per $0.01
call) | **COST_USD** |
| ZeroBounce `ValidateEmailsBlock` | RUN, 2 cr | 250 cr / $ (≈ 2 cr per
$0.008 validation) | **COST_USD** |
| Apollo `SearchOrganizationsBlock` | RUN, 2 cr flat | 1 cr / 2 orgs
(divisor=2) | **ITEMS** |
| Apollo `SearchPeopleBlock` (no enrich) | RUN, 10 cr flat | 1 cr /
person | **ITEMS** |
| Apollo `SearchPeopleBlock` (enrich_info=true) | RUN, 20 cr flat | 2 cr
/ person | **ITEMS** |
| Firecrawl (all blocks — Crawl, MapWebsite, Search, Extract, Scrape,
via `ProviderBuilder.with_base_cost`) | RUN, 1 cr | 1000 cr / $ (1 cr
per Firecrawl credit ≈ $0.001) | **COST_USD** |
| DataForSEO (KeywordSuggestions, RelatedKeywords, via `with_base_cost`)
| RUN, 1 cr | 1000 cr / $ | **COST_USD** |
| Exa (~45 blocks, via `with_base_cost`) | RUN, 1 cr | 100 cr / $ (Deep
Research $0.20 → 20 cr) | **COST_USD** |
| E2B `ExecuteCodeBlock` / `InstantiateCodeSandboxBlock` /
`ExecuteCodeStepBlock` | RUN, 2 cr flat | 1 cr / 10 s walltime
(divisor=10) | **SECOND** |
| FAL `AIVideoGeneratorBlock` | RUN, 10 cr flat | 3 cr / walltime s |
**SECOND** |
### Cost-leak fixes — interim values (flagged 🔴 CONSERVATIVE INTERIM in
Notion)
Separate from the type migrations above, these 3 providers had real API
costs but were under-billed (or wallet-free):
| Provider / block | Old | New | Cost type | Plan for proper fix |
|---|---|---|---|---|
| Stagehand (`StagehandObserve` / `Act` / `Extract`, via
`with_base_cost`) | RUN, 1 cr | 1 cr / 3 walltime s (divisor=3) |
**SECOND** | Have blocks emit `provider_cost` USD (session_seconds ×
$0.00028 + real LLM USD) → migrate to `COST_USD 100 cr/$`. |
| Meeting BaaS `BaasBotJoinMeetingBlock` (via `@cost` decorator
override) | RUN, 5 cr | RUN, 30 cr | RUN | Surface meeting duration on
`FetchMeetingData` response → migrate Join to `SECOND` or `COST_USD`
post-flight. |
| AgentMail (~37 blocks, via `with_base_cost`) | **0 cr (unbilled)** |
RUN, 1 cr | RUN | Revisit when AgentMail publishes paid-tier pricing
(currently beta). |
### UI
- `NodeCost.tsx` dynamic labels: RUN → `N /run`, SECOND → `~N /sec` (or
`~N / Xs` with divisor), ITEMS → `~N /item` (or `/ X items`), COST_USD →
`~N · by USD`, TOKENS → `~N · by tokens` (tooltip explains cache
discount).
- Floor amounts prefixed with `~` for dynamic types so users see an
estimate, not a hard guarantee.
## How
The resolver split is the key design decision. Instead of charging the
"true" cost entirely post-flight (which would let a user burn credits
they don't have), pre-flight returns a safe estimate:
- RUN: full `cost_amount` (same as before — backwards compatible).
- SECOND/ITEMS/COST_USD: `0` when stats aren't populated yet.
- TOKENS: `MODEL_COST[model]` as a flat floor from the existing rate
table.
Post-flight, the executor calls `charge_reconciled_usage`, which
evaluates the same resolver with stats and charges the positive delta
(or refunds the negative delta). RUN blocks get a 0-delta no-op; dynamic
blocks get their actual charge. Failure modes are bounded: insufficient
balance is logged (not raised; reconciliation must never poison a
success), unexpected errors are swallowed and alerted via Discord.
TOKENS routes through a dedicated `compute_token_credits` helper so the
rate table (`TOKEN_COST`) can grow organically without touching resolver
logic. Models not yet in `TOKEN_COST` fall back to the flat `MODEL_COST`
tier.
Migration for providers with a real USD spend (Exa, Firecrawl,
DataForSEO, Jina Search, ZeroBounce) is a one-line `_config.py` change
via the extended `ProviderBuilder.with_base_cost`. Each block's `run()`
populates `provider_cost` from the response (Exa's `cost_dollars.total`,
Firecrawl's `credits_used`, etc.) via `merge_stats`, and the post-flight
resolver multiplies by `cost_amount` credits/$.
## Test plan
- [x] 92/92 cost-pipeline tests pass — `block_usage_cost_test.py`,
`billing_reconciliation_test.py`, `manager_cost_tracking_test.py`,
`block_cost_config_test.py`.
- [x] Deep E2E against live stack (real DB, `database_manager` RPC): 8/8
scenarios pass — RUN pre-flight, dry-run no-charge, TOKENS refund, ITEMS
scaling, ITEMS zero-items short-circuit, COST_USD exact + ceil
semantics, pre-flight balance guard. Report:
https://github.com/Significant-Gravitas/AutoGPT/pull/12894#issuecomment-4307672357
- [x] `poetry run ruff check` / `ruff format` / `pnpm format` / `pnpm
lint` / `pnpm types` — clean.
- [x] Manual UI: `NodeCost.tsx` renders `~N · by tokens` for
AITextGeneratorBlock, `~N · by USD` for Jina/Exa/Firecrawl.
## Follow-ups (not in this PR)
- Stagehand / Meeting BaaS / Ayrshare: expose provider-side unit cost
(session-seconds, meeting duration, platform analytics credits) to
migrate from interim flat/walltime to fully dynamic `COST_USD`.
- Replicate / Revid: walltime-based billing once response cost is piped
through.
- AgentMail: final rate once paid tier is published.
## Why
`gh auth status` looked flaky in the E2B sandbox. Not actually flaky: it
fails deterministically when the user has not connected GitHub (or the
token is missing/expired), and our wrapper disguises that legitimate
exit-1 as a sandbox infrastructure failure.
Root cause: E2B's `sandbox.commands.run()` raises `CommandExitException`
for **any** non-zero exit. We caught it as a generic `Exception` and
returned an `ErrorResponse` with message:
```
E2B execution failed: Command exited with code 1 and error:
{stderr}
```
When the model runs `gh auth status 2>&1`, stderr is redirected to
stdout — so `exc.stderr` is empty **and** `exc.stdout` (which carries
the real info, e.g. "You are not logged into any GitHub hosts") is
discarded. The model sees a generic infra failure, can't tell it's an
auth-check signal, and prompts the user with broken-looking errors
instead of calling `connect_integration(provider="github")`.
Compare: the local bubblewrap path already handles non-zero exits
correctly by returning a `BashExecResponse` with `exit_code` set. The
E2B path was asymmetric.
## What
- Import `CommandExitException` and catch it explicitly in
`_execute_on_e2b` before the generic handler.
- Return a `BashExecResponse` with the real `exit_code`, `stdout`, and
`stderr` from the exception (scrubbed of injected secret values, same as
the success path).
- Extract shared scrub/build logic into `_build_response` to avoid
duplicating it across the success and exit-exception branches.
- Keep `TimeoutException` and the catch-all `except Exception` for real
infra failures.
## How
Result shape now matches bubblewrap: non-zero exit is a valid result,
not an error. The model sees:
```
message: "Command executed with status code 1"
exit_code: 1
stdout: "You are not logged into any GitHub hosts. ..."
stderr: ""
```
instead of the prior cryptic "E2B execution failed" message.
## Test plan
- [x] New unit test `test_nonzero_exit_returned_as_bash_exec_response`
in `bash_exec_test.py` — mocks `sandbox.commands.run` to raise
`CommandExitException`, asserts `BashExecResponse` with correct
`exit_code`, and verifies secret scrubbing on both `stdout` and
`stderr`.
- [x] `poetry run pytest backend/copilot/tools/bash_exec_test.py` — 5
passed.
- [x] `poetry run pyright` on changed files — 0 errors.
- [x] `poetry run ruff` — clean.
Backend: Replace raw byte counts ("262,144,000 bytes") with formatted
sizes ("250 MB of your 1 GB quota") and include upgrade guidance.
Frontend: Parse 413 JSON response in direct-upload.ts to extract the
human-readable detail message instead of dumping raw JSON to the user.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
### Why / What / How
**Why:** A beta user spent significant time trying to build and run
agents that read Google Sheets. Four separate failures compounded on
their session — all already open in Linear as SECRT-2266 through
SECRT-2269. Three in-flight PRs each addressed a piece but conflicted on
the same files (`backend/data/model.py`, `backend/blocks/_base.py`,
`autogpt_libs/.../types.py`), so landing them individually would have
been churn. One of the four reported issues (the credential-delete
crash) is also the top unresolved Sentry issue `AUTOGPT-SERVER-6HB` with
100+ events going back to 2025-10-20 — it was archived as "ignored" but
is a real regression. Bug #4 required new work; the others we got by
adopting the existing open PRs and addressing a pending review comment.
**What:** This PR consolidates the three in-flight PRs, adds the two
pieces of new work needed to fully close the beta blockers, and
addresses the pending review on one of the three PRs so it doesn't
require a second round.
- **Closes PR #12004** — Google Drive auto-credentials handling (merged
in)
- **Closes PR #12748** — Incremental OAuth for scope upgrades (merged
in)
- **Closes PR #12588** — superseded by the systemic None-guard here (see
"How" below)
- **Adds Bug 2 fix** — Google credential deletion no longer crashes on
`revoke_tokens`
- **Adds Bug 4 validator** — the agent builder can no longer save a
graph with a hardcoded Drive file ID
**How:**
1. **Adopt PR #12004 (Bug 1 — auto-credentials resolution).** Tags
Drive-file fields as `is_auto_credential` on `CredentialsFieldInfo`,
exposes `BlockSchema.get_auto_credentials_fields()` and
`Graph.regular_credentials_inputs` / `auto_credentials_inputs`, extracts
`_acquire_auto_credentials()` in the executor to resolve embedded
`_credentials_id` at run time, clears `_credentials_id` on agent fork so
cloned agents don't inherit the original author's credential, and fixes
the Firefox referrer policy on the Google Drive picker script load.
2. **Adopt PR #12748 (Bug 3 — credential accumulation).** OAuth callback
now merges scopes into an existing credential (explicit via
`credential_id` in OAuth state, or implicit via `provider + username`
match) instead of appending a new row on every reconnect. GitHub's
non-incremental OAuth path requests the union of existing + new scopes
at login so the upgrade path works there too.
3. **Replace PR #12588 with a systemic None-guard (addresses reviewer
feedback).** The original PR added a per-block `credentials:
GoogleCredentials | None = None` + early guard pattern that would need
to be repeated across 50+ blocks with `GoogleDriveFileField`. Per the
reviewer's ask, we moved the guard into `Block._execute()` once: after
the `setdefault` loop, if `kwargs[kwarg_name] is None` we raise
`BlockExecutionError` with a clean user-facing message. The per-block
change in `sheets.py` is dropped so `credentials: GoogleCredentials`
stays non-`Optional`. Dry-run path skips the guard (executor
intentionally runs blocks without resolved creds for schema validation).
4. **Fix Bug 2 — Google revoke_tokens (SECRT-2267,
AUTOGPT-SERVER-6HB).** `revoke_tokens()` was handing our Pydantic
`OAuth2Credentials` into google-auth's `AuthorizedSession`, which calls
`self.credentials.before_request(...)` on the object and crashes with
`AttributeError: 'OAuth2Credentials' object has no attribute
'before_request'`. Google's token revoke endpoint doesn't need any auth
header — just `token=<token>` in the form body per [Google's
docs](https://developers.google.com/identity/protocols/oauth2/web-server#tokenrevoke).
Switched to the platform's async `Requests` helper, matching how
`reddit.py` / `github.py` / `todoist.py` / other providers do
revocation. No google-auth objects involved.
5. **Fix Bug 4 — hardcoded Drive file IDs in agent graphs
(SECRT-2269).** Evidence from the beta user's session: CoPilot's
agent-builder produced 13 saved graph versions in one session where each
one stuffed either a bare string (`"1KAv…"`) or a partial object
(`{"id": "1KAv…"}`) into
`GoogleSheetsReadBlock.constantInput.spreadsheet`, never wiring an
`AgentGoogleDriveFileInputBlock` as the intended input. Bare-string
versions failed pydantic validation with `is not of type 'object'`;
object-with-only-`id` versions would have crashed at run time because
`_acquire_auto_credentials` has no `_credentials_id` to resolve. Added a
validator in `GraphModel._validate_graph_get_errors` that flags any
auto-credentials field whose `input_default.<field>` is a bare string OR
a dict missing `_credentials_id`, when there's no upstream link feeding
the field. Remediation text is format-aware: when
`field_schema["format"] == "google-drive-picker"` it names
`AgentGoogleDriveFileInputBlock` specifically; for any other future
auto-credentials format (OneDrive / Dropbox / etc.) the remediation is
generic, so we don't ship a stale Google-specific hint that doesn't
apply.
A companion handoff for the CoPilot agent-builder team is drafted at
`/tmp/agent-builder-ticket-drive-file-input.md` (to be filed in their
tracker). The validator here is a safety net so reviewers and the LLM
both get a clear error with the correct remediation; the agent-builder
itself still needs to learn the correct pattern so it stops trying to
hardcode Drive files in the first place.
### Changes 🏗️
**Backend**
- `backend/data/model.py` — merged `is_auto_credential` +
`input_field_name` (#12004) with `OAuthState.credential_id` (#12748);
kept HEAD's defensive `set()` copy on `discriminator_values`.
- `backend/blocks/_base.py` — `_execute()` runs the auto-credentials
setdefault loop + raises `BlockExecutionError` when a resolved value is
`None`.
- `backend/blocks/google/sheets_test.py` — 2 new tests (systemic
None-guard behaviour).
- `backend/blocks/google/_drive.py`, `_drive_test.py` — unchanged on
this branch (earlier bare-string validator was reverted after feedback;
see "Out of scope" below).
- `backend/data/graph.py` — auto-credentials anti-pattern validator in
`_validate_graph_get_errors`.
- `backend/data/graph_test.py` — 11 new tests for the validator.
- `backend/integrations/oauth/google.py` — `revoke_tokens` swapped to
`Requests().post`, removed `AuthorizedSession` misuse.
- `backend/integrations/oauth/google_test.py` — 3 new tests covering the
revoke happy path, no-access-token, and non-2xx-response.
- `backend/integrations/credentials_store.py` — from #12748.
- `backend/api/features/integrations/router.py` — incremental-OAuth
callback + scope upgrade helpers (from #12748).
- `backend/api/features/integrations/incremental_oauth_test.py` — 15
tests (from #12748).
- `backend/api/features/chat/tools/utils.py` → renamed to
`backend/copilot/tools/utils.py` during merge; now uses
`regular_credentials_inputs` for missing-creds + matching (from #12004).
- `backend/copilot/tools/utils_test.py` — moved from
`api/features/chat/tools/`, import paths updated.
- `backend/api/features/library/db.py` — library preset guard uses
`regular_credentials_inputs` (from #12004).
- `backend/data/graph.py` — `regular_credentials_inputs` /
`auto_credentials_inputs` properties + `_reassign_ids` clears
`_credentials_id` on fork (from #12004).
- `backend/executor/manager.py` — `_acquire_auto_credentials()`
extracted + validation (from #12004).
- `backend/executor/utils.py`, `utils_test.py`,
`manager_auto_credentials_test.py` — auto-credentials tests (from
#12004).
**Frontend**
- `frontend/src/components/contextual/GoogleDrivePicker/helpers.ts` —
Firefox referrer fix (from #12004).
-
`frontend/src/components/contextual/CredentialsInput/useCredentialsInput.ts`,
`src/hooks/useCredentials.ts`, `src/lib/autogpt-server-api/client.ts`,
`src/providers/agent-credentials/credentials-provider.tsx`,
`src/app/api/openapi.json` — incremental-OAuth scope upgrade UI (from
#12748).
**Shared libs**
- `autogpt_libs/supabase_integration_credentials_store/types.py` —
merged additions from both #12004 and #12748.
### Test plan 📋
- [x] `poetry run lint` — clean
- [x] `poetry run pytest backend/data/graph_test.py` — 55 passed
including 11 new validator tests
- [x] `poetry run pytest backend/integrations/oauth/google_test.py` — 3
new tests passing
- [x] `poetry run pytest backend/blocks/google/sheets_test.py` — 2 new
tests passing
- [x] `poetry run pytest backend/blocks/google/
backend/integrations/oauth/ backend/executor/ backend/data/graph_test.py
backend/api/features/integrations/ backend/copilot/tools/utils_test.py`
— 250 passed, 6 pre-existing failures that require the docker stack
(RabbitMQ/Redis/Postgres) and fail identically on `origin/dev`
- [x] `pnpm format` — clean
- [x] `pnpm lint` — 3 pre-existing `<img>` warnings on files I didn't
touch, no errors
- [x] `pnpm types` — pre-existing errors on `AgentActivityDropdown` that
also fail on `origin/dev` (unrelated to this PR; needs a separate fix on
dev)
- [x] Live repro on dev verified Bug 2 fires against current prod code —
two fresh Sentry events in `AUTOGPT-SERVER-6HB` at 2026-04-21T21:35:54Z
on `app:dev-behave:cloud` matching the exact `DELETE
/api/integrations/google/credentials/{cred_id}` path. Airtable OAuth2
delete as a control worked cleanly, confirming Google-specific.
- [x] Live repro on dev verified Bug 4 (CoPilot direct-run variant) —
`{"spreadsheet": {"id": "..."}}` → `Cannot use file 'None' (type: None)`
from `_validate_spreadsheet_file` mimeType check, as expected.
Reviewer post-merge verification:
- [ ] Delete a Google OAuth credential via the Integrations UI —
succeeds cleanly, no Sentry event fires
- [ ] Connect Google twice (same account, same scopes) — credential
count stays at 1 (dedup)
- [ ] Save an agent graph with
`GoogleSheetsReadBlock.constantInput.spreadsheet = "bare-id"` via API —
graph validator rejects with `AgentGoogleDriveFileInputBlock`
remediation
- [ ] Save an agent graph with `GoogleSheetsReadBlock` whose
`spreadsheet` is fed by an upstream
`AgentGoogleDriveFileInputBlock.result` — validator accepts, agent runs
### Out of scope (for follow-ups)
- **Bug 1 — "Failed to retrieve Google OAuth credentials"** in
`frontend/src/components/contextual/GoogleDrivePicker/useGoogleDrivePicker.ts:163`.
Zero hits for this string in the beta user's Langfuse traces and we
weren't able to reproduce it from a clean flow. Most likely a
stale-credential race condition (delete in another tab, picker queries a
stale React-Query cache). Tracked as a separate task; not blocking.
- **CoPilot first-attempt mimeType retry loop.** Observed on dev:
CoPilot's first call to `GoogleSheetsReadBlock` sends `{"spreadsheet":
{"id": "..."}}` without `mimeType`, hits `_validate_spreadsheet_file`,
retries with mimeType. Costs a round-trip. Two possible fixes (relax
`_validate_spreadsheet_file` to skip when mimeType is `None` and let
Google's API surface the real error; OR extend
`get_auto_credentials_fields` metadata so CoPilot's tool description
prompts it to always include mimeType). Deliberately deferred — fixing
only one of "API caller sends a bare string" or "CoPilot sends an
incomplete object" risked the same auth-ambiguity the bare-string commit
in this branch history hit.
- **CoPilot agent-builder prompt/guide update.** The validator here
produces the correct error message, but the agent-builder model still
needs to learn to use `AgentGoogleDriveFileInputBlock` upfront rather
than discover it through validator retries. Separate handoff ticket
filed.
🤖 Generated with [Claude Code](https://claude.com/claude-code)
<!-- CURSOR_SUMMARY -->
---
> [!NOTE]
> **Medium Risk**
> Touches OAuth credential issuance/upgrade paths and introduces a new
endpoint that returns raw access tokens (scope-gated), plus broad
changes to execution-time credential resolution/validation; mistakes
could impact auth/security or break integrations.
>
> **Overview**
> Fixes several Google/Drive agent-builder blockers by **supporting
incremental OAuth scope upgrades** and by hardening how
credential-bearing file inputs (“auto-credentials”) are validated,
resolved, and cleared on graph fork.
>
> On the integrations API, `/{provider}/login` now accepts
`credential_id` and persists it in `OAuthState` to upgrade an existing
OAuth2 credential on callback (explicit upgrade), with an implicit merge
path for same `provider+username`. The callback path now merges
scopes/metadata, preserves ID/title, preserves existing
`refresh_token`/`username` when missing from incremental responses,
blocks upgrades for managed/system credentials, and adds a **new
`/{provider}/credentials/{cred_id}/picker-token` endpoint** to return a
short-lived access token for provider-hosted pickers (currently
allowlisted to Google Drive scopes).
>
> For auto-credentials, `CredentialsFieldInfo` gains
`is_auto_credential` + `input_field_name`, graphs now expose
`regular_credentials_inputs` vs `auto_credentials_inputs`, and multiple
callers switch from `aggregate_credentials_inputs()` to
`regular_credentials_inputs` so embedded picker credentials aren’t
treated as user-mapped inputs. Execution-time auto-credential
acquisition is extracted into `_acquire_auto_credentials()` with clearer
error handling and lock cleanup; block execution adds a systemic guard
to surface a clean `Missing credentials` error when auto-credentials are
absent.
>
> Separately fixes Google credential deletion by rewriting
`GoogleOAuthHandler.revoke_tokens()` to use the platform `Requests`
helper (bounded retries) instead of `AuthorizedSession`, and expands
test coverage across these flows (incremental OAuth, picker-token,
auto-credential validation/acquisition, graph validator, and frontend
diagnostics test stubs).
>
> <sup>Reviewed by [Cursor Bugbot](https://cursor.com/bugbot) for commit
cac36eae9f. Bugbot is set up for automated
code reviews on this repo. Configure
[here](https://www.cursor.com/dashboard/bugbot).</sup>
<!-- /CURSOR_SUMMARY -->
---------
Co-authored-by: Claude Opus 4.6 <noreply@anthropic.com>
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
## Why
The cost-tracking audit on 2026-04-23 ([Platform System
Credentials](https://www.notion.so/auto-gpt/4d251f343fe146bcb91b6a037d1bfc3c))
surfaced three gaps where the user wallet was silently subsidising
third-party spend:
1. **Ayrshare (13 blocks)** — zero charge on every social post. No
`BLOCK_COSTS` entry, no SDK `.with_base_cost` registration. Platform
absorbs the entire ~$149/mo Business plan.
2. **Bannerbear** — flat 1 credit/call below the ~$0.025/image unit cost
on the Starter tier ($49/mo / 2K images).
3. **JinaChunkingBlock** — wallet-free; siblings (`JinaEmbeddingBlock`,
`SearchTheWebBlock`) are charged.
## What
- New `backend/blocks/ayrshare/_cost.py` with two-tier
`AYRSHARE_POST_COSTS` (5 credits when `is_video=True`, 2 credits
otherwise — first-match wins in `block_usage_cost`).
- All 13 `PostTo*Block` classes decorated with
`@cost(*AYRSHARE_POST_COSTS)`.
- `BannerbearTextOverlayBlock` floor: 1 → 3 credits in
`bannerbear/_config.py`.
- `JinaChunkingBlock` added to `BLOCK_COSTS` with a flat 1-credit floor.
- `cost(...)` decorator generic-ized via `TypeVar`, so pyright retains
`PostToXBlock.Input/Output` narrowing.
## How
Ayrshare uses a decorator-based registration (not a direct `BLOCK_COSTS`
entry) because each `post_to_*.py` block imports from `backend.sdk`, and
`backend.sdk.cost_integration` imports `BLOCK_COSTS` — listing the
blocks in `block_cost_config.py` would create a circular import. The
`@cost` decorator defined in `sdk/cost_integration.py` was already the
approved escape hatch for this exact shape.
cost_filter in `block_usage_cost` already supports boolean-field
matching (see Apollo's `enrich_info` tier), so `{"is_video": True}` and
`{"is_video": False}` select the right tier at execution time.
`is_video` defaults to `False` on `BaseAyrshareInput`, so posts that
omit the field still land on the 2-credit default.
## Test plan
- [x] `poetry run pytest backend/data/block_cost_config_test.py` — new
6-test suite covers Ayrshare video/non-video/default tiers, the
Bannerbear floor, and the Jina chunking floor
- [x] `poetry run pytest backend/executor/manager_cost_tracking_test.py`
— no regressions (45 pre-existing tests still pass)
- [x] `poetry run ruff format` + `poetry run isort` + `poetry run ruff
check --fix`
- [x] `poetry run pyright` on touched files — 0 errors, 0 warnings
(pre-existing `LlmModel.KIMI_K2_*` errors are on dev and unrelated)
- [ ] Manual: run an Ayrshare post through the builder and confirm 2cr
(text/image) vs 5cr (video) charge
## Why
The "Thought for 1m 46s" label under assistant replies has been
misleading
because the backend persists the whole-turn wall clock (from turn start
to
stream end) — which includes tool execution, browser sessions, graph
runs,
etc. Users also had no way to see when a message was actually sent /
received.
## What
- **Per-message timestamps** — `ChatMessage.created_at` (already on the
DB row)
is now serialised through the pydantic model and the
`SessionDetailResponse`,
then plumbed into the UI. Hovering the "Thought for X" label now shows
the
absolute local date/time via a tooltip.
- **Accurate reasoning duration** — new
`ChatMessage.reasoningDurationMs`
column. Backend accumulates time between `reasoning-start` and
`reasoning-end` SSE events inside `publish_chunk` (via the session meta
hash). `mark_session_completed` reads the total and persists it
alongside
the existing `durationMs`. Frontend prefers `reasoning_duration_ms` when
present, falls back to `duration_ms` for legacy rows.
## How
- `schema.prisma` gains `reasoningDurationMs Int?`; migration
`20260423120000_add_reasoning_duration_ms` adds the column.
- `publish_chunk` gains a side-effect that writes `reasoning_started_at`
/
`reasoning_ms_total` into the existing per-session Redis meta hash when
reasoning events pass through. No extra IO path, no extra Redis key.
- `set_turn_duration` accepts an optional `reasoning_duration_ms` arg
and
patches both the DB row and the cached session in place, mirroring the
existing behaviour for `duration_ms`.
- Frontend: `convertChatSessionMessagesToUiMessages` now returns
`durations`, `reasoningDurations`, and `timestamps` maps. `TurnStatsBar`
picks the best available value and wraps the label in the design-system
`BaseTooltip` so hover reveals the local timestamp.
## Test plan
- [x] `poetry run pytest
backend/copilot/db_test.py::test_set_turn_duration_*`
- [x] `poetry run pytest backend/copilot/stream_registry_test.py`
- [x] `pnpm format` / `pnpm lint` / `pnpm types` (copilot area)
- [x] `pnpm test:unit src/app/\(platform\)/copilot` — 705 tests pass (4
pre-existing `jszip` module resolution failures unrelated to this
change)
- [ ] Manual: open a session with a long tool run and confirm the new
"Thought for X" reflects only reasoning time (falls back for old rows)
and the tooltip surfaces the local timestamp.
## Why
Investigation of two reported sessions
([85804387](https://dev-builder.agpt.co/copilot?sessionId=85804387-7708-4fdc-8ec9-64283cdd902d),
[19d69dec](https://dev-builder.agpt.co/copilot?sessionId=19d69dec-210f-4439-a94b-2d7d443b9909))
where Kimi K2.6 via OpenRouter was running ~30 min per turn with no
actions completed (Discord report from Toran). Langfuse traces showed:
- 31 generation calls per turn at p90 = 151s, max = 415s
- 2.57M uncached tokens, `cache_create=0`, ~4% cache_read — Moonshot's
OpenRouter endpoint silently drops Anthropic-style cache writes
- **3 SDK-internal compactions per turn** — each compaction is itself a
slow LLM round-trip
- Reconciled OpenRouter cost was being recorded to a DB row but never
surfaced on the Langfuse trace, leaving operators to grep pod logs
## What
Four commits, split by concern.
### 1. `fix(backend/copilot): skip CLAUDE_AUTOCOMPACT_PCT_OVERRIDE for
Moonshot/Kimi` (`5fd9c5aa`)
`env.py` was unconditionally setting
`CLAUDE_AUTOCOMPACT_PCT_OVERRIDE=50` (introduced in #12747 to cap
cache-creation cost on Anthropic where context >200K = 54% of total
cost). On Kimi where `cache_create=0` silently, the cache-cost rationale
doesn't apply — but the 50% threshold still made the bundled CLI
auto-compact at ~100K tokens, triggering 3+ compactions per turn against
Kimi's larger effective window. Each compaction added a slow LLM
round-trip (one in our test ran 166s and burned the budget cap before
the user got any output).
Threads the resolved `sdk_model` (and `fallback_model`) into
`build_sdk_env` and skips the env var when the model matches
`is_moonshot_model(...)`. The CLI then uses its default ~93% threshold,
cutting compaction passes to 0–1.
### 2. `feat(backend/copilot): backfill OpenRouter reconciled cost to
Langfuse trace` (`f3de3624` + follow-ups `5ce3d038`, `d2c1a2cd`,
`d8e08525`, `d243bf6c9`)
`record_turn_cost_from_openrouter` runs as a fire-and-forget task after
the OTel span closes, so the Langfuse trace UI showed the SDK CLI's
rate-card estimate only — for non-Anthropic OpenRouter routes that
estimate is Sonnet pricing on Kimi tokens (~5x too high).
The backfill captures `langfuse.get_current_trace_id()` and threads it
into the reconcile task, which emits an `openrouter-cost-reconcile`
child event with the authoritative cost + token usage. **Bug caught
during /pr-test:** `propagate_attributes` only annotates an existing
OTel span, it doesn't create one — by the time the `finally` block runs,
SDK-emitted spans have ended and `get_current_trace_id()` returns None.
Fixed in `d8e08525` by wrapping the turn in
`langfuse.start_as_current_span(name="copilot-sdk-turn")`. Also tags
fallback-path events with `cost_source` so operators can distinguish
reconciled vs estimated turns.
### 3. `feat(backend/copilot): expose CLAUDE_AUTOCOMPACT_PCT_OVERRIDE as
a config knob` (`72416f73`)
The previously-hardcoded `50` is now
`claude_agent_autocompact_pct_override` (default 50, env
`CHAT_CLAUDE_AGENT_AUTOCOMPACT_PCT_OVERRIDE`). Setting to 0 omits the
env var entirely so the CLI uses its native ~93% threshold — useful when
the post-compact floor (system prompt + tool defs ≈ 65–110K) sits close
to an aggressive trigger and operators see back-to-back compaction
cascades. Moonshot routes still skip the env var unconditionally
regardless of config.
### 4. `fix(backend/copilot): align SDK retry compaction target with CLI
autocompact threshold` (`730ad256`)
`_reduce_context` was calling `compact_transcript` without an explicit
`target_tokens`, so it fell back to `get_compression_target(model) =
context_window - 60K`. For Sonnet 200K that's 140K — well above the
CLI's PCT=50 trigger of 90K — and for Kimi 256K it's 196K, above the
CLI's default 167K trigger. Result: a successful retry compaction landed
at 140K/196K and the CLI immediately re-compacted on the next call →
**two compactions per recovered turn**.
New `_compaction_target_tokens(model)` mirrors the CLI's `i6_()` formula
(`min(window * pct/100, window - 13K)`) with a 20K safety buffer so the
post-compact context sits comfortably below the CLI's trigger.
## How — empirical validation against the actual long Kimi transcript
Replayed the 199-message transcript from session 85804387 through the
bundled CLI in two configurations:
| | Post-fix (no override) | Pre-fix (`PCT_OVERRIDE=50`) |
|---|---|---|
| `autocompact: tokens=` | 126,312 | 126,341 |
| `threshold=` | **167,000** | **90,000** |
| Decision | 126K < 167K → **skip** | 126K > 90K → **COMPACTION FIRES**
|
| Duration | 21s | **166s** (8x slower) |
| Cost | $0.34 | **$0.82** (2.4x more) |
| Output | PONG (success) | empty (hit $0.50 budget cap, exit 1) |
The pre-fix configuration burned $0.82 of compaction work over 166s and
never produced a user response — exactly the failure mode reported.
**Why cascade happens at 50%, not at 93%:** post-compaction context is
`summary (~5–10K) + system_prompt + tool_definitions + skills + active
TodoWrite + memory ≈ 65–110K floor`. With trigger at 90K, post-compact
floor sits AT or above the trigger → next assistant message tips over →
immediate re-compaction → cascade until the CLI's rapid-refill breaker
trips at 3 attempts. With trigger at 167K, the same floor sits
comfortably below trigger → no cascade.
## Considered but not done
- **Force `cache_control` markers to reach Moonshot**: bundled CLI sends
them by default; Moonshot silently drops them per their own docs (uses
`X-Msh-Context-Cache` headers, not body markers). Real fix needs
bypassing OpenRouter — out of scope.
- **Slim the system prompt + tool definitions** to lower the
post-compact floor: real win but separate refactor with tool-use
accuracy A/B.
- **LD-driven auto-fallback to Sonnet on Kimi degradation**:
`claude_agent_fallback_model` already wires `--fallback-model` for
overload (529); auto-flipping on slowness needs latency aggregation
infra that doesn't exist yet.
## Test plan
- [x] `poetry run pytest backend/copilot/sdk/env_test.py
backend/copilot/sdk/openrouter_cost_test.py
backend/copilot/sdk/service_helpers_test.py` — 111 passed (37 env + 23
cost + 51 helpers, including 6 new env tests, 3 backfill tests, 6 new
compaction-target tests)
- [x] `poetry run pytest backend/copilot/sdk/` — 970+ passed
- [x] `poetry run pyright .` — 0 errors
- [x] `poetry run format` — clean
- [x] /pr-test --fix end-to-end against dev — 5/5 scenarios PASS,
including Anthropic route ($0.0174 cost +0.0% delta) and Moonshot route
($0.028 vs $0.018 → +58.2% delta validates reconcile rationale)
- [x] Transcript replay validation: pre-fix vs post-fix on real
126K-token transcript → 8x slower / 2.4x more expensive / fails entirely
on pre-fix; clean PONG on post-fix
## Why
On prod, longer copilot runs (complex feature implementations, multi-bug
fix chains) error out with `Exceeded 30 tool-call rounds without a final
response`, lose mid-stream assistant output, and the UI appears to
re-dispatch an older prompt. Reported by @itsababseh in #breakage for
session `661ba0cc-a905-4c66-bf11-61eb5423d775`.
Langfuse trace of that session shows 52 turns / 344 LLM calls; **two
turns hit exactly 30 rounds** (Turn 38: implementing kill-cam/headshot
juice pass; Turn 42: fixing multi-bug list). Both were legitimate,
non-looping work that simply needed more rounds to complete. Round 30
fired `bash_exec`, the loop cut off cold, no summary was ever produced,
and the stream surfaced `baseline_tool_round_limit`. Frontend
subsequently re-dispatched the same user message several times (turns
39–41 × 3, turns 43–47 × 5 with identical prompt), which is what the
user perceives as "falling back into acting on an older command."
Root cause: [`_MAX_TOOL_ROUNDS =
30`](https://github.com/Significant-Gravitas/AutoGPT/blob/cf6d7034f/autogpt_platform/backend/backend/copilot/baseline/service.py#L125)
has been unchanged since the baseline path was introduced (#12276).
Modern agent turns with Claude Code / Kimi / Sonnet routinely need more.
## What
- Raise `_MAX_TOOL_ROUNDS` from 30 → 100.
- Pass `last_iteration_message` to `tool_call_loop` so the final round
receives a "stop calling tools, wrap up" system hint. The model now
produces a graceful summary on the last round instead of being cut off
mid-tool.
## How
Two-line change in
[`backend/copilot/baseline/service.py`](https://github.com/Significant-Gravitas/AutoGPT/blob/fix/copilot-baseline-tool-round-limit/autogpt_platform/backend/backend/copilot/baseline/service.py):
- Bump the module-level constant.
- Define `_LAST_ITERATION_HINT` and wire it via the existing
`last_iteration_message` kwarg on
[`tool_call_loop`](https://github.com/Significant-Gravitas/AutoGPT/blob/cf6d7034f/autogpt_platform/backend/backend/util/tool_call_loop.py#L188).
The shared loop already handles appending it only on the final iteration
(see `tool_call_loop_test.py::test_last_iteration_message_appended`).
Frontend retry cascade on `baseline_tool_round_limit` is a separate UX
issue — logging it as a follow-up.
## Checklist
- [x] My code follows the project's style guidelines
- [x] I have performed a self-review
- [x] Existing `tool_call_loop_test.py` covers `last_iteration_message`
behavior (10/10 passing)
- [x] No new migrations
- [x] No breaking changes (constant/kwarg only)
## Why
A 25-min-old copilot turn ended up a zombie in Redis (`status=running`
for 60+ min, queued user messages never drained) after a rolling deploy
of `autogpt-copilot-executor`. Root cause:
1. Cluster churn during the rollout broke a Redis call mid-turn.
2. `_execute_async`'s `finally` tried to publish the failure via
`mark_session_completed` on the same (now-broken) event loop +
thread-local Redis client.
3. That Redis call *also* failed; the exception was caught and logged
but never reached Redis — so the session meta stayed `running`.
4. `on_run_done` then completed the future normally, `active_tasks`
drained, the pod exited.
5. The zombie persisted until the 65-min stale-session watchdog reaped
it. While it was live, queued-message pushes succeeded (HTTP only checks
`status=running`), so the UI showed "Queued" bubbles that never drained.
## What
The fix is **one small addition** in the per-turn lifecycle:
### `sync_fail_close_session` — last line of defense in
`processor.execute`'s `finally`
Invoked from `CoPilotProcessor.execute()`'s `finally` on every turn
exit. Submits the CAS coroutine to the processor's long-lived
`self.execution_loop` via `asyncio.run_coroutine_threadsafe` — the same
pattern `ExecutionProcessor.on_graph_execution` uses at
[executor/manager.py:881-892](autogpt_platform/backend/backend/executor/manager.py#L881-L892)
to bridge sync→async through `node_execution_loop`.
- Calls `mark_session_completed(session_id,
error_message=SHUTDOWN_ERROR_MESSAGE)`, which is a CAS on `status ==
"running"`. If the async path already wrote a terminal state the CAS
no-ops; otherwise we mark `failed` and the UI transitions cleanly.
- Bounded by inner `asyncio.wait_for(timeout=10s)` and outer
`future.result(timeout=12s)` so a genuinely unreachable Redis can't hang
the safety net.
- Reuses the long-lived execution loop (no per-turn TCP connect, no
`@thread_cached` thrashing).
The outer `future.result()` in `_execute()` is bounded by
`_CANCEL_GRACE_SECONDS` (5s) so a wedged event loop can't trap the flow
before the safety net fires.
### `cleanup()` stays aligned with agent-executor
Mirrors the pattern from `backend.executor.manager.cleanup` — a single
method that:
1. Flags + tells the broker to stop consuming.
2. Passively waits for `active_tasks` to drain (up to
`GRACEFUL_SHUTDOWN_TIMEOUT_SECONDS`).
3. Worker / executor / lock teardown.
No pre-emptive cancellation of healthy turns, no fail-close step for
stuck turns. Same proven shape agent-executor uses.
### Timeout alignment
Raised both `COPILOT_CONSUMER_TIMEOUT_SECONDS` and
`GRACEFUL_SHUTDOWN_TIMEOUT_SECONDS` to 6h so a rolling deploy can let
the longest legitimate turn finish via its own lifecycle path. Matched
in infra at `terminationGracePeriodSeconds: 21600`
(Significant-Gravitas/AutoGPT_cloud_infrastructure#311).
### RabbitMQ policy — deploy prep
The `x-consumer-timeout` queue argument is changing from 1h → 6h. Tested
empirically on dev's RabbitMQ 4.1.4: `queue_declare` is tolerant of
`x-consumer-timeout` mismatches, so no queue delete is needed. To make
the new timeout **immediately effective for running consumers** (so pods
mid-shutdown don't have their consumer cancelled at the old 1h limit),
apply a policy before deploying:
```bash
rabbitmqctl set_policy copilot-consumer-timeout \
"^copilot_execution_queue$" \
'{"consumer-timeout": 21600000}' \
--apply-to queues
```
Already applied on dev. Apply on prod before the PR's prod deploy.
### Incidental rename
- `_clear_pending_messages_unsafe` → `clear_pending_messages_unsafe`
(keeps the `_unsafe` warning suffix; importable without the
leading-underscore private marker).
## How
Before: transient Redis failure → async finally silently fails → zombie
session → queued messages never drain.
After: transient Redis failure → `execute()`'s sync finally runs
`mark_session_completed` on the processor's long-lived loop → session
correctly marked failed → UI sees terminal state immediately.
SIGTERM path unchanged from the "let in-flight work finish" design: old
pod stops taking new work, existing turns complete naturally.
## Test plan
- [x] `TestSyncFailCloseSession` unit tests — invokes
`mark_session_completed` with the shutdown error, swallows Redis
failures, bounded timeout fires when Redis hangs.
- [x] `TestExecuteSafetyNet` — verifies the `finally` always fires,
including SIGTERM-interrupted and zombie-Redis scenarios.
- [x] Existing `TestExecuteAsyncAclose` + pending_messages tests still
pass (18 passed).
- [x] `pyright` on touched files: 0 errors.
- [x] Manual E2E on native dev stack: sent a `sleep 300 && echo hewwo`
task, SIGTERMed mid-turn at +40s, observed:
- `[CoPilotExecutor] [cleanup N] Starting graceful shutdown...`
- Drain-wait ran for ~4.5 min ("1 tasks still active, waiting...")
- Turn finished with `result=Done! The command finished after 5 minutes
and printed: hewwo`
- `Cleaned up completed session` → `Graceful shutdown completed`
- No zombie.
- [x] `poetry run format` applied.
- [x] RabbitMQ policy verified on dev. Apply on prod before prod deploy.
- [ ] Verified behavior on next production rolling deploy.
### Why / What / How
**Why:** The in-conversation Question GUI is unreliable in production —
users submitting answers can get their messages dropped and the agent
gets stuck on the auto-generated "please proceed" step with no way to
make progress. Discord report:
https://discord.com/channels/1126875755960336515/1496474512966029472/1496537943287005365
(see attached video). Pause/queue semantics still need a rework; until
then, the right call is to stop the model from reaching for this tool.
**What:** Removes `ask_question` from the copilot tool registry so the
model never sees or calls it. Historical sessions that already contain
`ask_question` tool calls still render (frontend renderers + response
model untouched), so this is non-destructive to existing chats.
Re-enabling once UX is reworked is a small revert.
**How:**
- Drop the `AskQuestionTool` import + registry entry from
`backend/copilot/tools/__init__.py`.
- Drop `"ask_question"` from the `ToolName` literal in
`backend/copilot/permissions.py` — required because a runtime
consistency check asserts the literal matches `TOOL_REGISTRY.keys()`.
- Delete the "Clarifying — Before or During Building" section from
`backend/copilot/sdk/agent_generation_guide.md` so the SDK-mode system
prompt no longer instructs the model to call `ask_question`.
- Drop the three `prompting_test.py` tests that asserted the guide
mentions that section.
- Keep `ask_question.py`, its unit test, `ClarificationNeededResponse`,
and the frontend `AskQuestion`/`ClarificationQuestionsCard` components
untouched so old sessions still render and re-enabling is a small
revert.
### Changes 🏗️
- `backend/copilot/tools/__init__.py` — remove `AskQuestionTool` import
and `"ask_question"` entry in `TOOL_REGISTRY`.
- `backend/copilot/permissions.py` — remove `"ask_question"` from the
`ToolName` literal.
- `backend/copilot/sdk/agent_generation_guide.md` — remove the
"Clarifying — Before or During Building" section.
- `backend/copilot/prompting_test.py` — remove
`TestAgentGenerationGuideContainsClarifySection` and the now-unused
`Path` import.
### Checklist 📋
#### For code changes:
- [x] I have clearly listed my changes in the PR description
- [x] I have made a test plan
- [ ] I have tested my changes according to the test plan:
- [x] `poetry run pytest backend/copilot/tools/
backend/copilot/permissions_test.py backend/copilot/prompting_test.py` —
805+78 tests pass, consistency check between `ToolName` literal and
`TOOL_REGISTRY` still holds.
- [ ] Smoke-test in dev: start a copilot session and confirm the model
no longer lists/calls `ask_question` (its OpenAI tool schema is gone
from `get_available_tools()` and from the SDK `allowed_tools`).
- [ ] Load a historical session that contains an `ask_question` tool
call in its transcript — confirm the frontend still renders the question
card (no regression on legacy sessions).
## Why
Several loose ends from the Kimi SDK-default merge (#12878), plus
follow-ups surfaced during review + E2E testing:
1. **Kimi-specific pricing lived inline in `sdk/service.py`** alongside
unrelated SDK plumbing — any future non-Anthropic vendor would have
piled onto the same file.
2. **Moonshot's Anthropic-compat endpoint honours `cache_control: {type:
ephemeral}`**, but the baseline cache-marking gate
(`_is_anthropic_model`) was narrow enough to exclude it → Moonshot fell
back to automatic prefix caching, which drifts readily between turns.
3. **Kimi reasoning rendered AFTER the answer text** on dev because the
summary-walk hoist only reorders within one `AssistantMessage.content`
list, and Moonshot splits each turn into multiple sequential
AssistantMessages (text-only, then thinking-only).
4. **Title generation's LLM call bypassed cost tracking** — admin
dashboard under-reported total provider spend by the aggregate of those
per-session calls.
5. **Cost override** was using the requested primary model, not the
actually-executed model — when the SDK fallback activates the override
mis-routes pricing.
## What
### Moonshot module
New `backend/copilot/moonshot.py`:
- `is_moonshot_model(model)` — prefix check against `moonshotai/`
- `rate_card_usd(model)` — published Moonshot rates, default `(0.60,
2.80)` per MTok with per-slug override slot
- `override_cost_usd(...)` — moved from `sdk/service.py`, replaces CLI's
Sonnet-rate estimate with real rate card
- `moonshot_supports_cache_control(model)` — narrow gate for cache
markers
Rate card is **not canonical** — authoritative cost comes from the
OpenRouter `/generation` reconcile; this module only improves the
in-turn estimate and the reconcile's lookup-fail fallback. Signal
authority: reconcile >> rate card >> CLI.
### Baseline cache-control widened to Moonshot
- New `_supports_prompt_cache_markers` = `_is_anthropic_model OR
is_moonshot_model`
- Both call sites (system-message cache dict, last-tool cache marker)
switched to the wider gate
- OpenAI / Grok / Gemini still return `false` — those endpoints 400 on
the unknown field
**Measured impact in /pr-test:** baseline Kimi continuation turns jumped
to ~98% cache hit (334 uncached + 12.8K cache_read on a 13.1K prompt).
### SDK partial-messages default-on (fixes the reasoning-order bug)
- `CHAT_SDK_INCLUDE_PARTIAL_MESSAGES` flipped from `default=False` →
`default=True`
- Kimi stream now emits `reasoning-start → reasoning-delta* →
reasoning-end → text-start → text-delta*` in the correct order —
verified in /pr-test
- Kill-switch: set `CHAT_SDK_INCLUDE_PARTIAL_MESSAGES=false` to fall
back to summary-only emission
### SDK cost override scoped to Moonshot
- Call site now explicitly gates `if _is_moonshot_model(active_model)` —
Anthropic turns trust CLI's number directly
- Added `_RetryState.observed_model` populated from
`AssistantMessage.model`, preferred over `state.options.model` so
fallback-model turns bill correctly (addresses CodeRabbit review)
### Title cost capture
- `_generate_session_title` now returns `(title, ChatCompletion)` so the
caller controls cost persistence
- `_update_title_async` runs title-persist and cost-record as
independent best-effort steps
- `_title_usage_from_response` helper reads `prompt_tokens /
completion_tokens / cost_usd` (OR's `usage.cost` off `model_extra`)
- Provider label derived from `ChatConfig.base_url` (`open_router` /
`openai`)
- No exception suppressors — `isinstance(cost_raw, (int, float))` check
replaces the inner `float()` try/except
### Misc
- Kimi tool-name whitespace strip in the response adapter — Kimi
occasionally emits tool names with leading spaces the CLI dispatcher
can't resolve
- TODO marker on the rate-card for post-prod-soak removal
## How
- Detection is **prefix-based** (`moonshotai/`) — future Kimi SKUs
transparently inherit rate card + cache-control gate
- Baseline cache-marking was already structured; only the gate changes
- Partial-messages default-on relies on the adapter's diff-based
reconcile (shipped in #12878) which has soaked stable
- Title cost path mirrors `tools/web_search.py`'s pattern for reading
OR's `usage.cost`
## Test plan
- [x] `pytest backend/copilot/moonshot_test.py` — 21 tests
- [x] `pytest backend/copilot/baseline/service_unit_test.py` — updated
for widened gate
- [x] `pytest backend/copilot/sdk/*_test.py
backend/copilot/service_test.py` — no regressions
- [x] Full E2E on local native stack — 10/10 scenarios pass (see
test-report comment)
- [x] Measured: baseline Kimi ~98% cache hit on continuation, SDK Kimi
~62% (capped by Moonshot's prefix ceiling)
## Deferred
SDK-path Moonshot cache hit rate stays at ~62% on long prompts.
`native_tokens_cached=18432` regardless of turn/session suggests a
Moonshot-side cap on cached prefix size. Not fixable by our code —
requires proxy rewriting requests or upstream Moonshot change.
## Summary
Per-user model routing for the copilot via LaunchDarkly. Replaces the
pure-env-var pick on every `(mode, tier)` cell of the model matrix with
an LD-first resolver that falls back to the `ChatConfig` default. Lets
us roll out non-default routes (e.g. Kimi K2.6 on baseline standard) to
a user cohort without shipping a deploy.
| | standard | advanced |
|----------|------------------------------------|------------------------------------|
| fast | `copilot-fast-standard-model` | `copilot-fast-advanced-model` |
| thinking | `copilot-thinking-standard-model` |
`copilot-thinking-advanced-model` |
All four flags are **string-valued** — the value IS the model identifier
(e.g. `"anthropic/claude-sonnet-4-6"` or `"moonshotai/kimi-k2.6"`).
## What ships
- **New module `backend/copilot/model_router.py`** with a single
`resolve_model(mode, tier, user_id, *, config)` coroutine. That's the
one place both paths consult.
- **4 new `Flag` enum values** in `backend/util/feature_flag.py`
(reusing the existing `get_feature_flag_value` helper which already
supports arbitrary return types).
- **`baseline/service.py::_resolve_baseline_model`** → async, takes
`user_id`.
- **`sdk/service.py::_resolve_sdk_model_for_request`** → takes
`user_id`, consults LD for both standard and advanced thinking cells.
- **Default flip**: `fast_standard_model` default goes back to
`anthropic/claude-sonnet-4-6`. Non-Anthropic routes now ship via LD
targeting — safer rollback, per-user cohort control, no redeploy
required to flip.
## Behavior preserved
- `config.claude_agent_model` explicit override still wins
unconditionally (existing escape hatch for ops).
- `use_claude_code_subscription=true` on the standard thinking tier
still returns `None` so the CLI picks the model tied to the user's
Claude Code subscription.
- All legacy env var aliases (`CHAT_MODEL`, `CHAT_ADVANCED_MODEL`,
`CHAT_FAST_MODEL`) still bind to their cells.
- LD client exceptions / misconfigured (non-string) flag values fall
back silently to config default with a single warning log — never fails
the request.
## Files
| File | Change |
|---|---|
| `backend/copilot/model_router.py` | new — `resolve_model` +
`_config_default` + `_FLAG_BY_CELL` map |
| `backend/copilot/model_router_test.py` | new — 11 cases |
| `backend/util/feature_flag.py` | add 4 string-valued `Flag` entries |
| `backend/copilot/config.py` | flip `fast_standard_model` default to
Sonnet |
| `backend/copilot/baseline/service.py` | `_resolve_baseline_model` →
async + LD resolver |
| `backend/copilot/sdk/service.py` | `_resolve_sdk_model_for_request` →
LD resolver + user_id |
| `backend/copilot/baseline/transcript_integration_test.py` | update
tests for new signature + default |
## Test plan
- [x] `poetry run pytest backend/copilot/model_router_test.py
backend/copilot/baseline/transcript_integration_test.py
backend/copilot/sdk/service_test.py backend/copilot/config_test.py` —
**112 passing**
- [x] 11 resolver cases: missing user → fallback, LD string wins,
whitespace stripped, non-string value → fallback, empty string →
fallback, LD exception → fallback + warn, each of 4 cells routes to its
distinct flag
- [x] Legacy env aliases still bind to their new fields
- [ ] Manual dev-env smoke: flip `copilot-fast-standard-model` LD
targeting to `moonshotai/kimi-k2.6` for one user and confirm baseline
uses Kimi while other users stay on Sonnet
- [ ] Confirm SDK path still honors subscription mode (LD not consulted
when `use_claude_code_subscription=true` + standard tier)
## Rollout
1. Merge this PR → default stays Sonnet / Opus across the matrix, no
behavior change.
2. Create the 4 LD flags as string-typed in the LaunchDarkly console
(defaults matching config, so no drift if targeting empty).
3. Add per-user / per-cohort targeting in LD for the routes we want to
roll out (Kimi on baseline standard for a percentage, etc.).
## Summary
Make Kimi K2.6 the default for the SDK (extended-thinking) copilot path,
mirroring the baseline default landed in #12871. The SDK already routes
through OpenRouter (see
[`build_sdk_env`](autogpt_platform/backend/backend/copilot/sdk/env.py) —
`ANTHROPIC_BASE_URL` is set to OpenRouter's Anthropic-compatible
`/v1/messages` endpoint), but the model resolver was unconditionally
stripping the vendor prefix, which prevented routing to anything except
Anthropic models. This PR unblocks Kimi (and any other non-Anthropic
OpenRouter vendor) on the SDK fast tier and flips the default to match
the baseline path.
## Why
After #12871 the baseline (`fast_*`) path runs Kimi K2.6 by default —
~5x cheaper than Sonnet at SWE-Bench parity — but the SDK (`thinking_*`)
path was still pinned to Sonnet because:
1. **Model name normalization stripped the vendor prefix.**
`_normalize_model_name("moonshotai/kimi-k2.6")` returned `"kimi-k2.6"`,
which OpenRouter cannot route — the unprefixed form only resolves for
Anthropic models. The docstring on `thinking_standard_model` claimed
"the Claude Agent SDK CLI only speaks to Anthropic endpoints", but the
env builder shows the CLI happily talks to OpenRouter's `/messages`
endpoint, which routes to any vendor in the catalog.
2. **The default was `anthropic/claude-sonnet-4-6`.** Same model on a
more expensive route.
3. **Cost label was hardcoded to `provider="anthropic"`** on the SDK
path's `persist_and_record_usage` call, making cost-analytics rows
misleading once Kimi runs.
## What
1. **`_normalize_model_name`**
([sdk/service.py](autogpt_platform/backend/backend/copilot/sdk/service.py))
— when `config.openrouter_active` is True, the canonical `vendor/model`
slug is preserved unchanged so OpenRouter can route to the correct
provider. Direct-Anthropic mode keeps the existing strip-prefix +
dot-to-hyphen conversion (Anthropic API requires both) and now **raises
`ValueError`** when paired with a non-Anthropic vendor slug — silent
strip would have sent `kimi-k2.6` to the Anthropic API and produced an
opaque `model_not_found`.
2. **`thinking_standard_model`**
([config.py](autogpt_platform/backend/backend/copilot/config.py)) —
default flipped from `anthropic/claude-sonnet-4-6` to
`moonshotai/kimi-k2.6`. Field description rewritten; rollback to Sonnet
is one env var
(`CHAT_THINKING_STANDARD_MODEL=anthropic/claude-sonnet-4.6`).
3. **`@model_validator(mode="after")`** on `ChatConfig`
([config.py:_validate_sdk_model_vendor_compatibility](autogpt_platform/backend/backend/copilot/config.py))
— fail at config load when `use_openrouter=False` is paired with a
non-Anthropic SDK slug. The runtime guard in `_normalize_model_name` is
kept as defence-in-depth, but the validator turns a per-request 500 into
a boot-time error message the operator sees once, before any traffic
lands. Covers `thinking_standard_model`, `thinking_advanced_model`, and
`claude_agent_fallback_model`. Subscription mode is exempt (resolver
returns `None` and never normalizes). The credential-missing case
(`use_openrouter=True` + no `api_key`) is intentionally NOT a boot-time
error so CI builds and OpenAPI-schema export jobs that construct
`ChatConfig()` without secrets keep working — the runtime guard still
catches it on the first SDK turn.
4. **Cost provider attribution**
([sdk/service.py:stream_chat_completion_sdk](autogpt_platform/backend/backend/copilot/sdk/service.py))
— `persist_and_record_usage` now passes `provider="open_router" if
config.openrouter_active else "anthropic"` instead of hardcoded
`"anthropic"`. The dollar value still comes from
`ResultMessage.total_cost_usd`; this just fixes the analytics label.
5. **Baseline rollback example** ([config.py:fast_standard_model
description](autogpt_platform/backend/backend/copilot/config.py)) — same
dot-vs-hyphen footgun fix (CodeRabbit catch).
6. **Tests** — `TestNormalizeModelName` (sdk/) monkeypatches a
deterministic config per case (the helper-test variants were passing
accidentally based on ambient env). New
`TestSdkModelVendorCompatibility` class in `config_test.py` covers all
five validator shapes (default-Kimi + direct-Anthropic raises, anthropic
override succeeds, openrouter mode succeeds, subscription mode skips
check, advanced+fallback tier also validated, empty fallback skipped).
`_ENV_VARS_TO_CLEAR` extended to all model/SDK/subscription env aliases
so a leftover dev `.env` value can't mask validator behaviour. New
`_make_direct_safe_config` helper for direct-Anthropic tests.
## Test plan
- [x] `poetry run pytest backend/copilot/config_test.py
backend/copilot/sdk/service_test.py
backend/copilot/sdk/service_helpers_test.py
backend/copilot/sdk/env_test.py
backend/copilot/sdk/p0_guardrails_test.py` — 238 pass
- [x] `poetry run pytest backend/copilot/` — 2560 pass + 5 pre-existing
integration failures (need real API keys / browser env, unrelated)
- [x] CI green on `feat/copilot-sdk-kimi-default` (35 pass / 0 fail / 1
neutral)
- [x] Manual: SDK extended_thinking turn against Kimi K2.6 via
OpenRouter on the native dev stack — request lands with
`model=moonshotai/kimi-k2.6`, response streams back, multi-turn
`--resume` recalls facts across turns. Backend log: `[SDK] Per-request
model override: standard (moonshotai/kimi-k2.6)`.
- [x] Manual: rollback path —
`CHAT_THINKING_STANDARD_MODEL=anthropic/claude-sonnet-4.6` resumes
Sonnet routing.
## Known follow-ups (not in this PR)
These surfaced during manual testing and will need separate PRs:
- **SDK CLI cost is wrong for non-Anthropic models.**
`ResultMessage.total_cost_usd` comes from a static Anthropic pricing
table baked into the CLI binary; for Kimi K2.6 it falls back to Sonnet
rates, **over-billing ~5x** ($0.089 vs the real ~$0.018 for ~30K prompt
+ ~80 completion). The `provider` label is now correct but the dollar
value isn't. Needs either a per-model rate card override on our side or
a CLI patch upstream.
- **Mid-session model switch (Kimi → Opus) breaks.** Kimi's
`ThinkingBlock`s have no Anthropic `signature` field; when the user
toggles standard → advanced after a Kimi turn, Opus rejects the replayed
transcript with `Invalid signature in thinking block`. Needs transcript
scrubbing on model switch (similar to existing
`TestStripStaleThinkingBlocks` pattern).
- **Reasoning UI ordering on Kimi.** Moonshot/OpenRouter places
`reasoning` AFTER text in the response; the SDK's
`AssistantMessage.content` reflects that order, and `response_adapter`
emits SSE events in the same order — so reasoning lands BELOW the answer
in the UI instead of above. Needs `ThinkingBlock` hoisting in
`response_adapter.py`.
## Summary
Add `TodoWrite` to baseline copilot so the "task checklist" UI works on
non-Claude models (Kimi, GPT, Grok, etc.) the same way it works on the
SDK path. Baseline previously had no `TodoWrite` tool at all — only SDK
mode did via the Claude Code CLI's built-in — so models on baseline just
couldn't reach for a planning checklist.
This closes the last clear feature gap blocking baseline from being the
primary copilot path without giving up model flexibility.
## What ships
- **New MCP tool `TodoWrite`** in `TOOL_REGISTRY`, schema matching the
one the frontend's `GenericTool.helpers.ts` (`getToolCategory → "todo"`)
already renders as the **Steps** accordion. The tool is a stateless echo
— the canonical list lives in the model's latest tool-call args and
replays from transcript on subsequent turns.
- **Prompt guidance** in `SHARED_TOOL_NOTES` teaching the model when to
use it (3+ step tasks; always send the full list; exactly one
`in_progress` at a time).
- **Sharpened `run_sub_session` guidance** in the same prompt section —
framed explicitly as the context-isolation primitive for baseline.
Clearer for the model, no dual-primitive confusion.
## How the SDK path stays untouched
- SDK mode keeps using the CLI-native `TodoWrite` built-in.
- `BASELINE_ONLY_MCP_TOOLS = {"TodoWrite"}` in `sdk/tool_adapter.py`
filters the baseline MCP wrapper out of SDK's `allowed_tools` — no name
shadowing.
- `SDK_BUILTIN_TOOL_NAMES` is now an explicit allowlist (not
auto-derived from capitalization) so the classification stays coherent
when a capitalized tool is platform-owned.
## Files
| File | Change |
|---|---|
| `backend/copilot/tools/todo_write.py` | new — `TodoWriteTool` |
| `backend/copilot/tools/__init__.py` | register in `TOOL_REGISTRY` |
| `backend/copilot/tools/models.py` | add `TodoItem` +
`TodoWriteResponse` + `ResponseType.TODO_WRITE` |
| `backend/copilot/permissions.py` | explicit `SDK_BUILTIN_TOOL_NAMES`;
`apply_tool_permissions` maps baseline-only tools to CLI name for SDK |
| `backend/copilot/sdk/tool_adapter.py` | `BASELINE_ONLY_MCP_TOOLS`
filter |
| `backend/copilot/prompting.py` | `TodoWrite` + sharpened
`run_sub_session` guidance |
| `backend/api/features/chat/routes.py` | add `TodoWriteResponse` to
`ToolResponseUnion` |
| `backend/copilot/tools/todo_write_test.py` | new — schema + execute
tests |
| `frontend/src/app/api/openapi.json` | regenerated |
| `tools/tool_schema_test.py` | budget bumped `32_800 → 34_000` (actual
33_865, +1_065 headroom) |
## Test plan
- [x] `poetry run pytest backend/copilot/
backend/api/features/chat/routes_test.py` — **1010 passing**
- [x] Tool schema char budget regression gate passes
- [x] `_assert_tool_names_consistent` passes
- [x] **E2E on local native stack (Kimi K2.6 via OpenRouter,
`CHAT_USE_CLAUDE_AGENT_SDK=false`)**: baseline called `TodoWrite` on a
3-step prompt, SSE stream carried the exact `{content, activeForm,
status}` shape the UI expects, "Steps" dialog renders `Task list — 0/3
completed` with all three items (see test-report comment below).
- [x] Negative cases covered: two `in_progress` → rejected, missing
`activeForm` → rejected, non-list `todos` → rejected.
### Why / What / How
**Why.** Three problems on the baseline copilot path that compound:
extended-thinking turns froze the UI for minutes because Kimi K2.6
events were buffered in `state.pending_events: list` until the full
`tool_call_loop` iteration finished (reasoning arrived in one lump at
the end); the SSE stream replayed 1000 events on every reconnect and the
frontend opened multiple SSE streams in quick succession on tab-focus
thrash (reconnect storm → UI flickers, tab freezes); the `web_search`
tool hit Anthropic's server-side beta directly via a dispatch-model
round-trip that fed entire page contents back through the model for a
second inference pass (observed $0.072 on a 74K-token call); and the
simulator dry-run path ran on Gemini Flash without any cost tracking at
all, so every dry-run was free on the platform's microdollar ledger.
**What.** Grouped deltas, all targeting reliability, cost, and UX of the
copilot live-answer pipeline:
- **Live per-token baseline streaming.** `state.pending_events` is now
an `asyncio.Queue` drained concurrently by the outer async generator.
The tool-call loop runs as a background task; reasoning / text / tool
events reach the SSE wire during the upstream OpenRouter stream, not
after it. `None` is the close sentinel; inner-task exceptions are
re-raised via `await loop_task` once the sentinel arrives. An
`emitted_events: list` mirror preserves post-hoc test inspection.
Coalescing widened 32/40 → 64/50 ms to halve the React re-render rate on
extended-thinking turns while staying under the ~100 ms perceptual
threshold.
- **Reasoning render flag** — `ChatConfig.render_reasoning_in_ui: bool =
True` wired through both `BaselineReasoningEmitter` and
`SDKResponseAdapter`. When False the wire `StreamReasoning*` events are
suppressed while the persisted `ChatMessage(role='reasoning')` rows
always survive (decoupled from the render flag so audit/replay is
unaffected); the service-layer yield filter does the gating. Tokens are
still billed upstream; operator kill-switch for UI-level flicker
investigations.
- **Reconnect storm mitigations** — `ChatConfig.stream_replay_count: int
= 200` (was hard-coded 1000) caps `stream_registry.subscribe_to_session`
XREAD size. Frontend `useCopilotStream::handleReconnect` adds a 1500 ms
debounce via `lastReconnectResumeAtRef`, so tab-focus thrash doesn't fan
out into 5–6 parallel replays in the same second.
- **web_search rewritten to Perplexity Sonar via OpenRouter** — single
unified credential, real `usage.cost` flows through
`persist_and_record_usage(provider='open_router')`. Two tiers via a
`deep` param: `perplexity/sonar` (~$0.005/call quick) and
`perplexity/sonar-deep-research` (~$0.50–$1.30/call multi-step
research). Replaces the Anthropic-native + server-tool dispatches; drops
the hardcoded pricing constants entirely.
- **Synthesised answer surfaced end-to-end** — Sonar already writes a
web-grounded answer on the same call we pay for; the new
`WebSearchResponse.answer` field passes it through and the accordion UI
renders it above citations so the agent doesn't re-fetch URLs that are
usually bot-protected anyway.
- **Deep-tier cost warning + UI affordances** — `deep` param description
is explicit that it's ~100× pricier; UI labels read "Researching /
Researched / N research sources" when `deep=true` so users know what's
running.
- **Simulator cost tracking + cheaper default** —
`google/gemini-2.5-flash` → `google/gemini-2.5-flash-lite` (3× cheaper
tokens) and every dry-run now hits
`persist_and_record_usage(provider='open_router')` with real
`usage.cost`. Previously each sim was free against the user's
microdollar budget.
- **Typed access everywhere** — cost extractors now use
`openai.types.CompletionUsage.model_extra["cost"]` and
`openai.types.chat.ChatCompletion` / `Annotation` /
`AnnotationURLCitation` with no `getattr` / duck typing. Mirrors the
baseline service's `_extract_usage_cost` pattern; keep in sync.
**How.** Key file touches:
1. `copilot/config.py` — `render_reasoning_in_ui`,
`stream_replay_count`, `simulation_model` default.
2. `copilot/baseline/service.py` — `_BaselineStreamState.pending_events:
asyncio.Queue`, `_emit` / `_emit_all` helpers, outer generator runs
`tool_call_loop` as a background task + yields from queue concurrently.
3. `copilot/baseline/reasoning.py` —
`BaselineReasoningEmitter(render_in_ui=...)`, coalescing bumped to 64
chars / 50 ms.
4. `copilot/sdk/service.py` — `state.adapter.render_reasoning_in_ui`
threaded through every adapter construction.
5. `copilot/sdk/response_adapter.py` — `render_reasoning_in_ui` wiring +
service-layer yield filter gating for wire suppression while persistence
stays intact.
6. `copilot/stream_registry.py` — `count=config.stream_replay_count`.
7. `frontend/.../useCopilotStream.ts::handleReconnect` — 1500 ms
debounce.
8. `copilot/tools/web_search.py` + `models.py` — Sonar quick/deep paths,
`WebSearchResponse.answer` + typed extractors.
9. `frontend/.../GenericTool/*` — `answer` render + deep-aware labels /
accordion titles.
10. `executor/simulator.py` + `executor/manager.py` +
`copilot/config.py` — cost tracking + model swap + `user_id` threading.
### Changes
- `copilot/config.py` — new `render_reasoning_in_ui`,
`stream_replay_count`; `simulation_model` default flipped to Flash-Lite.
- `copilot/baseline/service.py` — `pending_events: asyncio.Queue`
refactor; outer gen runs loop as task, yields from queue live.
- `copilot/baseline/reasoning.py` —
`BaselineReasoningEmitter(render_in_ui=...)` + 64/50 coalesce.
- `copilot/sdk/service.py` + `response_adapter.py` —
`render_reasoning_in_ui` wire suppression (persistence preserved).
- `copilot/stream_registry.py` — replay cap from config.
- `copilot/tools/web_search.py` + `models.py` — Sonar quick/deep +
`answer` field + typed extractors.
- `copilot/tools/helpers.py` — tool description tightens `deep=true`
cost warning.
- `frontend/.../useCopilotStream.ts` — reconnect debounce.
- `frontend/.../GenericTool/GenericTool.tsx` + `helpers.ts` + tests —
render `answer`, deep-aware verbs / titles.
- `executor/simulator.py` + `simulator_test.py` + `executor/manager.py`
— cost tracking + model swap + user_id plumbing.
### Follow-up (deferred to a separate PR)
SDK per-token streaming via `include_partial_messages=True` was
attempted (commits `599e83543` + `530fa8f95`) and reverted here. The
two-signal model (StreamEvent partial deltas + AssistantMessage summary)
needs proper per-block diff tracking — when the partial stream delivers
a subset of the final block content, emit only
`summary.text[len(already_emitted):]` from the summary rather than
gating on a binary flag. Binary gating truncated replies in the field
when the partial stream delivered less than the summary (observed: "The
analysis template you" cut off mid-sentence because partial had streamed
that much and the rest only lived in the summary). SDK reasoning still
renders end-of-phase (as today); this PR's baseline per-token streaming
is unaffected.
### Checklist
For code changes:
- [x] Changes listed above
- [x] Test plan below
- [x] Tested according to the test plan:
- [x] `poetry run pytest backend/copilot/baseline/ backend/copilot/sdk/
backend/copilot/tools/web_search_test.py
backend/executor/simulator_test.py` — all pass (155 baseline + 927 SDK +
web_search + simulator)
- [x] `pnpm types && pnpm vitest run
src/app/(platform)/copilot/tools/GenericTool/` — pass
- [x] Manual: baseline live-streaming — Kimi K2.6 reasoning arrives
token-by-token, coalesced (no end-of-stream burst).
- [x] Manual: quick web_search via copilot UI — ~$0.005/call, answer +
citations rendered, cost logged as `provider=open_router`.
- [x] Manual: deep web_search — dispatched only on explicit research
phrasing; `sonar-deep-research` billed, UI labels say "Researched" / "N
research sources".
- [x] Manual: simulator dry-run — Gemini Flash-Lite, `[simulator] Turn
usage` log entry, PlatformCostLog row visible.
- [x] Manual: reconnect debounce — tab-focus thrash no longer produces
parallel XREADs in backend log.
- [ ] Manual: `CHAT_RENDER_REASONING_IN_UI=false` smoke-check —
reasoning collapse absent, no persisted reasoning row on reload.
For configuration changes:
- [x] `.env.default` — new config knobs fall back to pydantic defaults;
existing `CHAT_MODEL`/`CHAT_FAST_MODEL`/`CHAT_ADVANCED_MODEL` legacy
envs still honored upstream (unchanged by this PR).
### Companion PR
PR #12876 closes the `run_block`-via-copilot cost-leak gap (registers
`PerplexityBlock` / `FactCheckerBlock` in `BLOCK_COSTS`; documents the
credit/microdollar wallet boundary). Separate because the credit-wallet
side is orthogonal to the copilot microdollar / rate-limit surface this
PR ships.
### Why / What / How
**Why:** The old `CompactionTracker` set a `_done` flag after the first
completion and short-circuited every subsequent compaction in the same
turn. That blocked the SDK-internal compaction from running after a
pre-query compaction had already fired, so prompt-too-long errors
couldn't actually recover — retries saw the flag, bailed, and we re-hit
the context limit.
**What:** Drop the `_done` flag, track attempts and completions as
separate lists, and expose counters + an observability metadata builder
so callers can record compaction activity per turn.
**How:**
- Remove `_done` and `_compact_start` short-circuits.
- Track `_attempted_sources` / `_completed_sources` /
`_completed_count`.
- Expose `attempt_count`, `completed_count`, and
`get_observability_metadata()` / `get_log_summary()` for downstream
instrumentation (no caller change required in this PR).
### Changes 🏗️
- `backend/copilot/sdk/compaction.py` — rewritten `CompactionTracker`
internals; adds properties + observability helpers.
- `backend/copilot/sdk/compaction_test.py` — tests for multi-compaction
flow + new counters.
### Checklist 📋
#### For code changes:
- [x] I have clearly listed my changes in the PR description
- [x] I have made a test plan
- [ ] I have tested my changes according to the test plan:
- [ ] `poetry run pytest backend/copilot/sdk/compaction_test.py -xvs`
passes
- [ ] Local chat that hits prompt-too-long now recovers via SDK
compaction instead of failing the turn
<!-- CURSOR_SUMMARY -->
---
> [!NOTE]
> **Medium Risk**
> Changes core streaming compaction state transitions and persistence
timing, which could affect UI event sequencing or compaction completion
behavior under concurrency; coverage is improved with new
multi-compaction tests.
>
> **Overview**
> Fixes `CompactionTracker` so compaction is no longer single-shot per
turn: removes the `_done`/event-gate behavior, queues multiple
`on_compact()` hook firings via a pending transcript-path deque, and
allows subsequent SDK-internal compactions after a pre-query compaction
within the same query.
>
> Adds lightweight instrumentation by tracking attempt/completion
sources and counts, plus `get_observability_metadata()` and
`get_log_summary()` (including source summaries like `sdk_internal:2`).
Updates/expands tests to cover multi-compaction flows, transcript-path
handling, and the new counters/metadata.
>
> <sup>Reviewed by [Cursor Bugbot](https://cursor.com/bugbot) for commit
9bf8cdd367. Bugbot is set up for automated
code reviews on this repo. Configure
[here](https://www.cursor.com/dashboard/bugbot).</sup>
<!-- /CURSOR_SUMMARY -->
---------
Co-authored-by: majdyz <zamil.majdy@agpt.co>
### Why / What / How
**Why.** Three unrelated but interlocking problems on the baseline
(OpenRouter) copilot path, all blocking us from making Kimi K2.6 the
default fast model:
1. **Cost / capability gap on the default.** Kimi K2.6 prices at $0.60 /
$2.80 per MTok — ~5x cheaper input and ~5.4x cheaper output than Sonnet
4.6 — while tying Opus on SWE-Bench Verified (80.2% vs 80.8%) and
beating it on SWE-Bench Pro (58.6% vs 53.4%). OpenRouter exposes the
same `reasoning` / `include_reasoning` extension on Moonshot endpoints
that #12870 plumbed for Anthropic, so the reasoning collapse lights up
end-to-end without per-provider code.
2. **Kimi reasoning deltas freeze the UI.** K2.6 emits ~4,700
reasoning-delta SSE events per turn vs ~28 on Sonnet — the AI SDK v6
Reasoning UIMessagePart can't keep up and the tab locks. Needs a
coalescing buffer upstream.
3. **Kimi loops on `require_guide_read`.** The guide-guard checks
`session.messages` for a prior `agent_building_guide` call, but tool
calls aren't flushed to `session.messages` until the end of the turn —
mid-turn the check keeps returning False and Kimi calls the guide-load
tool repeatedly in the same turn. Needs an in-flight tracker that lives
on `ChatSession`.
4. **No `web_search` tool on either path.** Kimi doesn't have a native
web-search equivalent and the SDK path's native `WebSearch` (the Claude
Code CLI's built-in) doesn't carry cost accounting. We need one
implementation that both paths share and that reports cost through the
same tracker as every other tool call.
**What.** Five grouped deltas on the baseline service, tool layer, and
config:
- **Kimi K2.6 default.** `fast_standard_model` defaults to
`moonshotai/kimi-k2.6`. Full 2×2 model matrix below. Rollback is one env
var.
- **4-config model matrix.** `fast_standard_model` /
`fast_advanced_model` / `thinking_standard_model` /
`thinking_advanced_model`. Each cell independent so baseline can run a
cheap provider at the standard tier without leaking into the SDK path
(which is Anthropic-only by CLI contract). Legacy env vars
(`CHAT_MODEL`, `CHAT_FAST_MODEL`, `CHAT_ADVANCED_MODEL`) stay aliased
via `validation_alias` so live deployments keep resolving to the same
effective cell.
- **Reasoning delta coalescing.** `BaselineReasoningEmitter` buffers
deltas and flushes on a char-count OR time-interval threshold (32 chars
/ 40 ms). ~4,700 → ~150 SSE events per turn on Kimi; no perceptible
change on Sonnet (which was already well under the threshold).
- **In-flight tool-call tracker.** `ChatSession._inflight_tool_calls`
PrivateAttr is populated when a tool-call block is emitted and cleared
at turn end. `session.has_tool_been_called_this_turn(name)` now returns
True mid-turn, not just after the tool-result lands in
`session.messages` — which is what `require_guide_read` needs to cut the
loop.
- **New `web_search` copilot tool.** Wraps Anthropic's server-side
`web_search_20250305` beta via `AsyncAnthropic` (direct — OpenRouter
can't proxy server-side tool execution). Dispatches through
`claude-haiku-4-5` with `max_uses=1`. Cost estimated from published
rates ($0.010 per search + Haiku tokens) since the Anthropic Messages
API doesn't report cost on the response; reported to
`persist_and_record_usage(provider='anthropic')` on both paths. SDK
native `WebSearch` moved from `_SDK_BUILTIN_ALWAYS` into
`SDK_DISALLOWED_TOOLS` so both paths now dispatch through
`mcp__copilot__web_search`.
**How.**
1. `copilot/config.py` — 2×2 model fields with `AliasChoices` preserving
legacy env var names. `populate_by_name = True` so
`ChatConfig(fast_standard_model=...)` works in tests.
2. `copilot/baseline/service.py::_resolve_baseline_model` — resolves the
active baseline cell from `mode` + `tier`, no longer delegates to the
SDK resolver.
3. `copilot/baseline/reasoning.py` — `BaselineReasoningEmitter` gains
`_pending_delta` / `_last_flush_monotonic` and flushes on
`len(_pending_delta) >= _COALESCE_MIN_CHARS` OR `monotonic() -
_last_flush_monotonic >= _COALESCE_MAX_INTERVAL_MS / 1000`.
`_is_reasoning_route` rewritten as an anchored prefix match covering
`anthropic/`, `anthropic.`, `moonshotai/`, and `openrouter/kimi-` —
split from the narrower `_is_anthropic_model` gate that still governs
`cache_control` markers (which Kimi doesn't support).
4. `copilot/model.py::ChatSession` — `_inflight_tool_calls: set[str] =
PrivateAttr(default_factory=set)` plus `announce_inflight_tool_call` /
`clear_inflight_tool_calls` / `has_tool_been_called_this_turn`.
5. `copilot/tools/helpers.py::require_guide_read` — check
`session.has_tool_been_called_this_turn(_AGENT_GUIDE_TOOL_NAME)` before
falling back to scanning `session.messages`.
6. `copilot/tools/web_search.py` — new `WebSearchTool` +
`_extract_results` + `_estimate_cost_usd`. `is_available` gated on
`Settings().secrets.anthropic_api_key` so the deployment can roll back
just by unsetting the key.
7. `copilot/tools/__init__.py` — registers `web_search` in
`TOOL_REGISTRY` so it becomes `mcp__copilot__web_search` in the SDK
path.
8. `copilot/sdk/tool_adapter.py` — `WebSearch` moves to
`SDK_DISALLOWED_TOOLS`.
### Changes
- `copilot/config.py` — 2×2 model matrix with legacy env alias
preservation; `populate_by_name=True`.
- `copilot/baseline/service.py::_resolve_baseline_model` — resolves
against the new matrix.
- `copilot/baseline/reasoning.py` — `BaselineReasoningEmitter`
coalescing buffer; `_is_reasoning_route` rewritten as anchored prefix
match (covers `anthropic/`, `anthropic.`, `moonshotai/`,
`openrouter/kimi-`).
- `copilot/model.py::ChatSession` — `_inflight_tool_calls` PrivateAttr +
helpers.
- `copilot/baseline/service.py::_baseline_tool_executor` — calls
`announce_inflight_tool_call` after emitting `StreamToolInputAvailable`;
`clear_inflight_tool_calls` in the outer `finally` before persist.
- `copilot/tools/helpers.py::require_guide_read` — reads the new tracker
first.
- `copilot/tools/web_search.py` (new) — Anthropic `web_search_20250305`
wrapper + cost estimator.
- `copilot/tools/web_search_test.py` (new) — extractor / cost / dispatch
/ registry tests (12 total).
- `copilot/tools/models.py` — `WebSearchResponse` + `WebSearchResult` +
`ResponseType.WEB_SEARCH`.
- `copilot/tools/__init__.py` — registers `web_search`.
- `copilot/sdk/tool_adapter.py` — moves native `WebSearch` to
`SDK_DISALLOWED_TOOLS`.
### Checklist
For code changes:
- [x] Changes listed above
- [x] Test plan below
- [ ] Tested according to the test plan:
- [x] `poetry run pytest backend/copilot/baseline/` — all pass
- [x] `poetry run pytest backend/copilot/sdk/` — all pass (SDK resolver
untouched)
- [x] `poetry run pytest backend/copilot/tools/web_search_test.py` — 12
pass
- [ ] Manual: send a multi-step prompt on fast mode with default config;
confirm backend routes to `moonshotai/kimi-k2.6`, SSE stream carries
`reasoning-start/delta/end` (coalesced), Reasoning collapse renders +
survives hard reload.
- [ ] Manual: 43-tool payload reliability on Kimi — watch for malformed
tool-call JSON or wrong-tool selection.
- [ ] Manual: `CHAT_FAST_STANDARD_MODEL=anthropic/claude-sonnet-4-6`
restarts confirm Sonnet routing (rollback path works).
- [ ] Manual: SDK path (`CHAT_USE_CLAUDE_AGENT_SDK=true`) still selects
the SDK service and uses `thinking_standard_model` = Sonnet (no Kimi
leaked into extended thinking).
- [ ] Manual: prompt that forces `web_search` — confirm results render,
`persist_and_record_usage(provider='anthropic')` runs, cost lands in the
per-user ledger.
- [ ] Manual: ask Kimi a question that would require
`agent_building_guide` — confirm the guide loads exactly once per turn
(no loop).
For configuration changes:
- [x] `.env.default` — all four model fields fall back to the pydantic
defaults; legacy `CHAT_MODEL` / `CHAT_FAST_MODEL` /
`CHAT_ADVANCED_MODEL` remain honored via `AliasChoices`.
### Why / What / How
**Why:** The share page was unbranded (no logo/navigation) and images
from workspace files couldn't render because the proxy didn't handle
public share URLs. Zip downloads also had several gaps — no size limits,
no workspace file support, silent failures on data URLs, and single
files got wrapped in unnecessary zips.
**What:** Adds AutoGPT branding to the share page, secure public access
to workspace files via a SharedExecutionFile allowlist, and a hardened
zip download module.
**How:** Backend scans execution outputs for `workspace://` URIs on
share-enable and persists an allowlist in a new `SharedExecutionFile`
table. A new unauthenticated endpoint serves files validated against
this allowlist. Frontend proxy routing is extended (with UUID
validation) to handle the 7-segment public share download path as a
binary response. Download logic is consolidated into a shared module
with size limits, parallel fetches, filename sanitization, and
single-file direct download.
### Changes 🏗️
**Share page branding:**
- AutoGPT logo header centered at top, linking to `/`
- Dark/light mode variants with correct `priority` on visible variant
only
**Secure public workspace file access (backend):**
- New `SharedExecutionFile` Prisma model with `@@unique([shareToken,
fileId])` constraint
- `_extract_workspace_file_ids()` scans outputs for `workspace://` URIs
(handles nested dicts/lists)
- `create_shared_execution_files()` / `delete_shared_execution_files()`
manage allowlist lifecycle
- Re-share cleans up stale records before creating new ones (prevents
old token access)
- `GET /public/shared/{token}/files/{id}/download` — validates against
allowlist, uniform 404 for all failures
- `Content-Disposition: inline` for share page rendering
- Hand-written Prisma migration
(`20260417000000_add_shared_execution_file`)
**Frontend proxy fix:**
- `isWorkspaceDownloadRequest` extended to match public share path
(7-segment)
- UUID format validation on dynamic path segments (file IDs, share
tokens)
- 30+ adversarial security tests: path traversal, SQL injection, SSRF
payloads, unicode homoglyphs, null bytes, prototype pollution, etc.
**Download module (`download-outputs.ts`):**
- Consolidated from two divergent copies into single shared module
- `fetchFileAsBlob` with content-length pre-check before buffering
- `sanitizeFilename` strips path traversal, leading dots, falls back to
"file"
- `getUniqueFilename` deduplicates with counter suffix
- `fetchInParallel` with configurable concurrency (5)
- 50 MB per-file limit, 200 MB aggregate limit
- Data URL try-catch, relative URL support (`/api/proxy/...`)
- Single-file downloads skip zip, go directly to browser download
- Dynamic JSZip import for bundle optimization
- 26 unit tests
**Share page file rendering:**
- `WorkspaceFileRenderer` builds public share URLs when `shareToken` is
in metadata
- `RunOutputs` propagates `shareToken` to renderer metadata
### Checklist 📋
#### For code changes:
- [x] I have clearly listed my changes in the PR description
- [x] I have made a test plan
- [x] I have tested my changes according to the test plan:
- [x] Share page renders with centered AutoGPT logo
- [x] Logo links to `/` and shows correct dark/light variant
- [x] Workspace images render inline on share page
- [x] Download all produces zip with workspace images included
- [x] Single-file download skips zip, downloads directly
- [x] Re-sharing generates new token and cleans up old allowlist records
- [x] Public file download returns 404 for files not in allowlist
- [x] All frontend tests pass (122 tests across 3 suites)
- [x] Backend formatter + pyright pass
- [x] Frontend format + lint + types pass
#### For configuration changes:
- [x] `.env.default` is updated or already compatible with my changes
- [x] `docker-compose.yml` is updated or already compatible with my
changes
- [x] I have included a list of my configuration changes in the PR
description (under **Changes**)
> Note: New Prisma migration required. No env/docker changes needed.
<!-- CURSOR_SUMMARY -->
---
> [!NOTE]
> **Medium Risk**
> Adds a new unauthenticated file download path gated by a database
allowlist plus a new Prisma model/migration; mistakes here could expose
workspace files or break sharing. Frontend download behavior also
changes significantly (zipping/fetching), which could impact
large-output performance and edge cases.
>
> **Overview**
> Enables **public rendering and downloading of workspace files on
shared execution pages** by introducing a `SharedExecutionFile`
allowlist tied to the share token and populating it when sharing is
enabled (and clearing it on disable/re-share).
>
> Adds `GET /public/shared/{share_token}/files/{file_id}/download` (no
auth) that validates the requested file against the allowlist and
returns a uniform 404 on failure; workspace download responses now
support `inline` `Content-Disposition` via the exported
`create_file_download_response` helper.
>
> Frontend updates the share page to pass `shareToken` into output
renderers so `WorkspaceFileRenderer` can build public-share download
URLs; the proxy matcher is extended/strictly UUID-validated for both
workspace and public-share download paths with extensive adversarial
tests. Output downloading is consolidated into `download-outputs.ts`
using dynamic `jszip` import, filename sanitization/deduping,
concurrency + size limits, and a single-file non-zip fast path.
>
> <sup>Reviewed by [Cursor Bugbot](https://cursor.com/bugbot) for commit
e2f5bd9b5a. Bugbot is set up for automated
code reviews on this repo. Configure
[here](https://www.cursor.com/dashboard/bugbot).</sup>
<!-- /CURSOR_SUMMARY -->
---------
Co-authored-by: Claude <noreply@anthropic.com>
Co-authored-by: claude[bot] <41898282+claude[bot]@users.noreply.github.com>
Co-authored-by: Nicholas Tindle <ntindle@users.noreply.github.com>
Co-authored-by: Otto <otto@agpt.co>
## Why
AutoPilot (CoPilot) needs to reach users across chat platforms — Discord
first, Telegram / Slack / Teams / WhatsApp next. To make usage and
billing coherent, every conversation resolves to one AutoGPT account.
There are two independent linking flows:
- **SERVER links**: the first person to claim a server (Discord guild,
Telegram group, …) becomes its owner. Anyone in the server can chat with
the bot; all usage bills to the owner.
- **USER links**: an individual links their 1:1 DMs with the bot to
their own AutoGPT account. Independent from server links — a server
owner still has to link their DMs separately.
## What
Backend for platform linking, split cleanly by trust boundary:
- **Bot-facing operations** run over cluster-internal RPC via a new
`PlatformLinkingManager(AppService)`. No shared bearer token; trust is
the cluster network itself.
- **User-facing operations** stay on REST under JWT auth (the same
pattern as every other feature).
### REST endpoints (JWT auth)
- `GET /api/platform-linking/tokens/{token}/info` — non-sensitive
display info for the link page
- `POST /api/platform-linking/tokens/{token}/confirm` — confirm a SERVER
link
- `POST /api/platform-linking/user-tokens/{token}/confirm` — confirm a
USER link
- `GET /api/platform-linking/links` / `DELETE /links/{id}` — manage
server links
- `GET /api/platform-linking/user-links` / `DELETE /user-links/{id}` —
manage DM links
### `PlatformLinkingManager` `@expose` methods (internal RPC)
- `resolve_server_link(platform, platform_server_id) -> ResolveResponse`
- `resolve_user_link(platform, platform_user_id) -> ResolveResponse`
- `create_server_link_token(req) -> LinkTokenResponse`
- `create_user_link_token(req) -> LinkTokenResponse`
- `get_link_token_status(token) -> LinkTokenStatusResponse`
- `start_chat_turn(req) -> ChatTurnHandle` — resolves the owner,
persists the user message, creates the stream-registry session, enqueues
the turn; returns `(session_id, turn_id, user_id, subscribe_from="0-0")`
so the caller subscribes directly to the per-turn Redis stream.
### New DB models
- `PlatformLink` — `(platform, platformServerId)` → owner's AutoGPT
`userId`
- `PlatformUserLink` — `(platform, platformUserId)` → AutoGPT `userId`
(for DMs)
- `PlatformLinkToken` — one-time token with `linkType` discriminator
(SERVER | USER) and 30-min TTL
## How
- **New `backend/platform_linking/` package**: `models.py` (Pydantic
types), `links.py` (link CRUD helpers — pure business logic), `chat.py`
(`start_chat_turn` orchestration), `manager.py`
(`PlatformLinkingManager(AppService)` + `PlatformLinkingManagerClient`).
Pattern matches `backend/notifications/` + `backend/data/db_manager.py`.
- **Exception translation at the edge**. Helpers raise domain exceptions
(`NotFoundError`, `LinkAlreadyExistsError`, `LinkTokenExpiredError`,
`LinkFlowMismatchError`, `NotAuthorizedError` — all `ValueError`
subclasses in `backend.util.exceptions` so they auto-register with the
RPC exception-mapping). REST routes translate to HTTP codes via a 7-line
`_translate()` helper.
- **Independent scopes, no DM fallback**. `find_server_link()` and
`find_user_link()` each query their own table. A user who owns a linked
server does not leak that identity into their DMs.
- **Race-safe token consumption**. Confirm paths do atomic `update_many`
with `usedAt = None` + `expiresAt > now` in the WHERE clause;
`create_*_token` invalidates pending tokens before issuing a new one.
- **Bug fix**: `start_chat_turn` persists the user message via
`append_and_save_message` before enqueueing the executor turn — mirrors
`backend/api/features/chat/routes.py`. The previous `chat_proxy.py`
skipped this and ran the executor with no user message in history.
- **Streaming**. Copilot streaming lives on Redis Streams (persistent,
replayable). The bot subscribes directly with `subscribe_from="0-0"`, so
late subscribers replay the full stream; no HTTP SSE proxy needed.
- **No PII in logs**: logs reference `session_id`, `turn_id`,
`server_id`, and AutoGPT `user_id` (last 8 chars), but never raw
platform user IDs.
- **New pod**. `PlatformLinkingManager` runs as its own `AppProcess` on
port `8009`; client via `get_platform_linking_manager_client()`. The
infra chart lands in
[cloud-infrastructure#310](https://github.com/Significant-Gravitas/AutoGPT_cloud_infrastructure/pull/310).
## Tests
- **Models** (`models_test.py`) — Platform / LinkType enums, request
validation (CreateLinkToken / ResolveServer / BotChat), response
schemas.
- **Helpers** (`links_test.py`) — resolve, token create (both flows, 409
on already-linked), token status (pending / linked / expired /
superseded-with-no-link), token info (404 / 410), confirm (404 / wrong
flow / already used / expired / same-user / other-user), delete authz.
- **AppService wiring** (`manager_test.py`) — `@expose` methods delegate
to helpers; client surface covers bot-facing ops and excludes
user-facing ones.
- **Adversarial** (`manager_test.py`, `routes_test.py`):
- `asyncio.gather` double-confirm with same user and with two different
users — exactly one winner, other gets clean `LinkTokenExpiredError`, no
double `PlatformLink.create`.
- Server- and user-link confirm races.
- `TokenPath` regex guard: rejects `%24`, URL-encoded path traversal,
>64 chars; accepts `secrets.token_urlsafe` shape.
- DELETE `link_id` with SQL-injection-style and path-traversal inputs
returns 404 via `NotFoundError`.
## Stack
- #12618 — bot service (rebased onto this so it can consume
`PlatformLinkingManagerClient`)
- #12624 — `/link/{token}` frontend page
-
[cloud-infrastructure#310](https://github.com/Significant-Gravitas/AutoGPT_cloud_infrastructure/pull/310)
— Helm chart for `copilot-bot` + new `platform-linking-manager`
Merge order: this → #12618 → #12624, infra whenever.
---------
Co-authored-by: coderabbitai[bot] <136622811+coderabbitai[bot]@users.noreply.github.com>
Co-authored-by: CodeRabbit <noreply@coderabbit.ai>
### Why / What / How
https://github.com/user-attachments/assets/ca26e0b0-d35d-4a5b-b95f-2421b9907742
**Why** — The Artifact & Side Task List project
(https://linear.app/autogpt/project/artifact-and-side-task-list-ef863c93da3c)
accumulated seven related bugs in the copilot artifact panel. The user
kept seeing panels stuck open, previews broken, clicks not registering —
each ticket was small but they all lived in the same small surface area,
so one review pass is easier than five.
Closes SECRT-2254, SECRT-2223, SECRT-2220, SECRT-2255, SECRT-2224,
SECRT-2256, SECRT-2221.
**What** — Five independent fixes, each in its own commit, shipped
together:
1. **Fragment-link interceptor + render error boundary** (SECRT-2255
crash when clicking `<a href="#x">` in HTML artifacts). Sandboxed srcdoc
iframes resolve fragment links against the parent's URL, so clicking
`#activation` in a Plotly TOC tried to navigate the copilot page into
the iframe. Inject a click-capture script into every artifact iframe;
also wrap the renderer in `ArtifactErrorBoundary` so any future render
throw surfaces with a copyable error instead of a blank panel.
2. **Close panel on copilot page unmount** (SECRT-2254 / 2223 / 2220 —
panel stays open, reopens on unrelated navigation, opens by default on
session switch). The Zustand store outlived page unmounts, so `isOpen:
true` survived `/profile` → `/home` → back. One `useEffect` cleanup in
`useAutoOpenArtifacts` calls `resetArtifactPanel()` on unmount.
3. **Sync loading flip on Try Again** (SECRT-2224 "try again doesn't do
anything"). Retry was correct but the loading-state flip was deferred to
an effect, so a retry that re-failed was visually indistinguishable from
a no-op. `retry()` now sets `isLoading: true` / `error: null`
synchronously with the click so the skeleton flashes every time.
4. **Pointer capture on resize drag** (SECRT-2256 "can't drag right when
expanded far left, click doesn't stop it"). The sandboxed iframe was
eating `pointermove`/`pointerup` events when the cursor drifted over it,
freezing the drag and never delivering the release. `setPointerCapture`
on the handle routes all subsequent pointer events through it regardless
of what's under the cursor.
5. **Stop size-gating natively-rendered artifacts + cache-bust retry**
(SECRT-2221 "broken hi-res PNG preview"). The blanket >10 MB size gate
pushed large images / videos / PDFs into `download-only`, so clicking a
hi-res PNG offered a download instead of a preview. Split the gate so it
only applies to content we actually render in JS (text/html/code/etc).
Image and video retries also append a cache-bust query so the browser
can't silently reuse a negative-cached failure.
**How** — Five commits, one concern each, preserved in the order they
were written. Every fix lands with a regression test that fails on the
unfixed code and passes after.
### Changes 🏗️
- `iframe-sandbox-csp.ts` + usage sites —
`FRAGMENT_LINK_INTERCEPTOR_SCRIPT` injected into all three srcdoc iframe
templates (HTML artifact, inline HTMLRenderer, React artifact).
- `ArtifactErrorBoundary.tsx` (new) — class error boundary local to the
artifact panel with a copyable error fallback.
- `useAutoOpenArtifacts.ts` — unmount cleanup calls
`resetArtifactPanel()`.
- `useArtifactContent.ts` — `retry()` flips loading state synchronously.
- `ArtifactDragHandle.tsx` — `setPointerCapture` /
`releasePointerCapture`; `touch-action: none`.
- `helpers.ts` — split classifier; `NATIVELY_RENDERED` exempts
image/video/pdf from the size gate.
- `ArtifactContent.tsx` — image/video carry a retry nonce that appends
`?_retry=N` on Try Again.
- Test files — new
`ArtifactErrorBoundary`/`ArtifactDragHandle`/`HTMLRenderer` tests, plus
regression cases added to `ArtifactContent.test.tsx`, `helpers.test.ts`,
`iframe-sandbox-csp.test.ts`, `reactArtifactPreview.test.ts`,
`useAutoOpenArtifacts.test.ts`.
### Checklist 📋
#### For code changes:
- [x] I have clearly listed my changes in the PR description
- [x] I have made a test plan
- [x] I have tested my changes according to the test plan:
- [x] `pnpm vitest run src/app/\(platform\)/copilot
src/components/contextual/OutputRenderers
src/lib/__tests__/iframe-sandbox-csp.test.ts` — 247/247 pass
- [x] `pnpm format && pnpm types` clean
- [x] Manual: open the Plotly-style TOC HTML artifact (SECRT-2255
repro), click each anchor — iframe scrolls internally, browser URL bar
stays put
- [x] Manual: open panel → navigate to /profile → navigate back → panel
closed (SECRT-2254)
- [x] Manual: panel open in session A → click different session → panel
closed (SECRT-2223)
- [ ] Manual: simulate a failed artifact fetch → click Try Again →
skeleton flashes before result (SECRT-2224)
- [x] Manual: expand panel to near-full width → drag back right,
crossing over the iframe → drag keeps working and release ends it
(SECRT-2256)
- [x] Manual: upload a ~25 MB PNG → clicking it previews in an `<img>`,
not a download button (SECRT-2221)
Replaces #12836, #12837, #12838, #12839, #12840 — same fixes, bundled
for review.
<!-- CURSOR_SUMMARY -->
---
> [!NOTE]
> **Medium Risk**
> Touches artifact rendering and iframe `srcDoc` generation (including
injected scripts) plus panel state/drag interactions; regressions could
break previews or resizing, but changes are scoped to the copilot
artifact UI with broad test coverage.
>
> **Overview**
> Improves Copilot’s artifact panel resilience and UX by **resetting
panel state on page unmount/session changes**, making content retries
immediately show the loading skeleton, and fixing resize drags via
pointer capture so iframes can’t “steal” pointer events.
>
> Hardens artifact rendering by adding a local `ArtifactErrorBoundary`
that reports to Sentry and shows a copyable error fallback instead of a
blank/crashed panel.
>
> Fixes iframe-based previews by injecting a
`FRAGMENT_LINK_INTERCEPTOR_SCRIPT` into HTML and React artifact `srcDoc`
so `#anchor` clicks scroll within the iframe rather than navigating the
parent URL, and adjusts artifact classification/retry behavior so large
images/videos/PDFs remain previewable and image/video retries cache-bust
failed URLs.
>
> <sup>Reviewed by [Cursor Bugbot](https://cursor.com/bugbot) for commit
bde37a13fd. Bugbot is set up for automated
code reviews on this repo. Configure
[here](https://www.cursor.com/dashboard/bugbot).</sup>
<!-- /CURSOR_SUMMARY -->
---------
Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
### Why
The flow builder had no AI assistance. Users had to switch to a separate
Copilot session to ask about or modify the agent they were looking at,
and that session had no context on the graph — so the LLM guessed, or
the user had to describe the graph by hand.
### What
An AI chat panel anchored to the `/build` page. Opens with a chat-circle
button (bottom-right), binds to the currently-opened agent, and offers
**only** two tools: `edit_agent` and `run_agent`. Per-agent session is
persisted server-side, so a refresh resumes the same conversation. Gated
behind `Flag.BUILDER_CHAT_PANEL` (default off;
`NEXT_PUBLIC_FORCE_FLAG_BUILDER_CHAT_PANEL=true` to enable locally).
### How
**Frontend — new**:
- `(platform)/build/components/BuilderChatPanel/` — panel shell +
`useBuilderChatPanel.ts` coordinator. Renders the shared Copilot
`ChatMessagesContainer` + `ChatInput` (thought rendering, pulse chips,
fast-mode toggle — all reused, no parallel chat stack). Auto-creates a
blank agent when opened with no `flowID`. Listens for `edit_agent` /
`run_agent` tool outputs and wires them to the builder in-place: edit →
`flowVersion` URL param + canvas refetch; run → `flowExecutionID` URL
param → builder's existing execution-follow UI opens.
**Frontend — touched (minimal)**:
- `copilot/components/CopilotChatActionsProvider` — new `chatSurface:
"copilot" | "builder"` flag so cards can suppress "Open in library" /
"Open in builder" / "View Execution" buttons when the chat is the
builder panel (you're already there).
- `copilot/tools/RunAgent/components/ExecutionStartedCard` — title is
now status-aware (`QUEUED → "Execution started"`, `COMPLETED →
"Execution completed"`, `FAILED → "Execution failed"`, etc.).
- `build/components/FlowEditor/Flow/Flow.tsx` — mount the panel behind
the feature flag.
**Backend — new**:
- `copilot/builder_context.py` — the builder-session logic module. Holds
the tool whitelist (`edit_agent`, `run_agent`), the permissions
resolver, the session-long system-prompt suffix (graph id/name + full
agent-building guide — cacheable across turns), and the per-turn
`<builder_context>` prefix (live version + compact nodes/links
snapshot).
- `copilot/builder_context_test.py` — covers both builders, ownership
forwarding, and cap behavior.
**Backend — touched**:
- `api/features/chat/routes.py` — `CreateSessionRequest` gains
`builder_graph_id`. When set, the endpoint routes through
`get_or_create_builder_session` (keyed on `user_id`+`graph_id`, with a
graph-ownership check). No new route; the former `/sessions/builder` is
folded into `POST /sessions`.
- `copilot/model.py` — `ChatSessionMetadata.builder_graph_id`;
`get_or_create_builder_session` helper.
- `data/graph.py` — `GraphSettings.builder_chat_session_id` (new typed
field; stores the builder-chat session pointer per library agent).
- `api/features/library/db.py` —
`update_library_agent_version_and_settings` preserves
`builder_chat_session_id` across graph-version bumps.
- `copilot/tools/edit_agent.py`, `run_agent.py` — builder-bound guard:
default missing `agent_id` to the bound graph, reject any other id.
`run_agent` additionally inlines `node_executions` into dry-run
responses so the LLM can inspect per-node status in the same turn
instead of a follow-up `view_agent_output`. `wait_for_result` docs now
explain the two dispatch modes.
- `copilot/tools/helpers.py::require_guide_read` — bypassed for
builder-bound sessions (the guide is already in the system-prompt
suffix).
- `copilot/tools/agent_generator/pipeline.py` + `tools/models.py` —
`AgentSavedResponse.graph_version` so the frontend can flip
`flowVersion` to the newly-saved version.
- `copilot/baseline/service.py` + `sdk/service.py` — inject the builder
context suffix into the system prompt and the per-turn prefix into the
current user message.
- `blocks/_base.py` — `validate_data(..., exclude_fields=)` so dry-run
can bypass credential required-checks for blocks that need creds in
normal mode (OrchestratorBlock). `blocks/perplexity.py` override
signature matches.
- `executor/simulator.py` — OrchestratorBlock dry-run iteration cap `1 →
min(original, 10)` so multi-role patterns (Advocate/Critic) actually
close the loop; `manager.py` synthesizes placeholder creds in dry-run so
the block's schema validation passes.
### Session lookup
The builder-chat session pointer lives on
`LibraryAgent.settings.builder_chat_session_id` (typed via
`GraphSettings`). `get_or_create_builder_session` reads/writes it
through `library_db().get_library_agent_by_graph_id` +
`update_library_agent(settings=...)` — no raw SQL or JSON-path filter.
Ownership is enforced by the library-agent query's `userId` filter. The
per-session builder binding still lives on
`ChatSession.metadata.builder_graph_id` (used by
`edit_agent`/`run_agent` guards and the system-prompt injection).
### Scope footnotes
- Feature flag defaults **false**. Rollout gate lives in LaunchDarkly.
- No schema migration required: `builder_chat_session_id` slots into the
existing `LibraryAgent.settings` JSON column via the typed
`GraphSettings` model.
- Commits that address review / CI cycles are interleaved with feature
commits — see the commit log for the per-change rationale.
### Test plan
- [x] `pnpm test:unit` + backend `poetry run test` for new and touched
modules
- [x] Agent-browser pass: panel toggle / auto-create / real-time edit
re-render / real-time exec URL subscribe / queue-while-streaming /
cross-graph reset / hard-refresh session persist
- [x] Codecov patch ≥ 80% on diff
---------
Co-authored-by: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
### Why / What / How
**Why:** Every request that went through Next's rewrite proxy broke
distributed tracing. The browser Sentry SDK emitted `sentry-trace` and
`baggage`, but `createRequestHeaders` only forwarded impersonation + API
key, so the backend started a disconnected transaction. The frontend →
backend lineage never appeared in Sentry. Same gap on
direct-from-browser requests: the custom mutator never attached the
trace headers itself, so even non-proxied paths lost the link.
**What:**
- **Server side:** forward `sentry-trace` and `baggage` from
`originalRequest.headers` alongside the existing impersonation/API key
forwarding.
- **Client side:** the custom mutator pulls trace data via
`Sentry.getTraceData()` and attaches it to outgoing headers when running
on the client.
**How:** Inline additions — no new observability module, no new
dependencies beyond `@sentry/nextjs` which the frontend already uses for
Sentry init.
### Changes 🏗️
- `src/lib/autogpt-server-api/helpers.ts` — forward `sentry-trace` +
`baggage` in `createRequestHeaders`.
- `src/app/api/mutators/custom-mutator.ts` — import `@sentry/nextjs`,
attach `Sentry.getTraceData()` on client-side requests.
- `src/app/api/mutators/__tests__/custom-mutator.test.ts` — three new
tests: trace-data present, trace-data empty, server-side no-op.
### Checklist 📋
#### For code changes:
- [x] I have clearly listed my changes in the PR description
- [x] I have made a test plan
- [ ] I have tested my changes according to the test plan:
- [x] `pnpm vitest run
src/app/api/mutators/__tests__/custom-mutator.test.ts` passes (6/6
locally)
- [x] `pnpm format && pnpm lint` clean
- [x] `pnpm types` clean for touched files (pre-existing unrelated type
errors on dev are untouched)
- [ ] In a local session with Sentry enabled, a `/copilot` chat turn
produces a distributed trace that spans frontend transaction → backend
transaction (single trace ID in Sentry)
<!-- CURSOR_SUMMARY -->
---
> [!NOTE]
> **Low Risk**
> Low risk: header-only changes to request construction for
observability, with added tests; primary risk is unintended header
propagation affecting upstream/proxy behavior.
>
> **Overview**
> Restores **Sentry distributed tracing continuity** for
frontend→backend calls by propagating `sentry-trace`/`baggage` headers.
>
> On the client, `customMutator` now reads `Sentry.getTraceData()` and
attaches string trace headers to outgoing requests (guarded for
server-side and older Sentry builds). On the server/proxy path,
`createRequestHeaders` now forwards `sentry-trace` and `baggage` from
the incoming `originalRequest` alongside existing impersonation/API-key
forwarding, with new unit tests covering these cases.
>
> <sup>Reviewed by [Cursor Bugbot](https://cursor.com/bugbot) for commit
0f6946b776. Bugbot is set up for automated
code reviews on this repo. Configure
[here](https://www.cursor.com/dashboard/bugbot).</sup>
<!-- /CURSOR_SUMMARY -->
---------
Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
### Changes 🏗️
This PR adds a comprehensive admin diagnostics dashboard for monitoring
system health and managing running executions.
https://github.com/user-attachments/assets/f7afa3ed-63d8-4b5c-85e4-8756d9e3879e
#### Backend Changes:
- **New data layer** (backend/data/diagnostics.py): Created a dedicated
diagnostics module following the established data layer pattern
- get_execution_diagnostics() - Retrieves execution metrics (running,
queued, completed counts)
- get_agent_diagnostics() - Fetches agent-related metrics
- get_running_executions_details() - Lists all running executions with
detailed info
- stop_execution() and stop_executions_bulk() - Admin controls for
stopping executions
- **Admin API endpoints**
(backend/server/v2/admin/diagnostics_admin_routes.py):
- GET /admin/diagnostics/executions - Execution status metrics
- GET /admin/diagnostics/agents - Agent utilization metrics
- GET /admin/diagnostics/executions/running - Paginated list of running
executions
- POST /admin/diagnostics/executions/stop - Stop single execution
- POST /admin/diagnostics/executions/stop-bulk - Stop multiple
executions
- All endpoints secured with admin-only access
#### Frontend Changes:
- **Diagnostics Dashboard**
(frontend/src/app/(platform)/admin/diagnostics/page.tsx):
- Real-time system metrics display (running, queued, completed
executions)
- RabbitMQ queue depth monitoring
- Agent utilization statistics
- Auto-refresh every 30 seconds
- **Execution Management Table**
(frontend/src/app/(platform)/admin/diagnostics/components/ExecutionsTable.tsx):
- Displays running executions with: ID, Agent Name, Version, User
Email/ID, Status, Start Time
- Multi-select functionality with checkboxes
- Individual stop buttons for each execution
- "Stop Selected" and "Stop All" bulk actions
- Confirmation dialogs for safety
- Pagination for handling large datasets
- Toast notifications for user feedback
#### Security:
- All admin endpoints properly secured with requires_admin_user
decorator
- Frontend routes protected with role-based access controls
- Admin navigation link only visible to admin users
### Checklist 📋
#### For code changes:
- [x] I have clearly listed my changes in the PR description
- [x] I have made a test plan
- [x] I have tested my changes according to the test plan:
- [x] Verified admin-only access to diagnostics page
- [x] Tested execution metrics display and auto-refresh
- [x] Confirmed RabbitMQ queue depth monitoring works
- [x] Tested stopping individual executions
- [x] Tested bulk stop operations with multi-select
- [x] Verified pagination works for large datasets
- [x] Confirmed toast notifications appear for all actions
#### For configuration changes:
- [x] `.env.default` is updated or already compatible with my changes
(no changes needed)
- [x] `docker-compose.yml` is updated or already compatible with my
changes (no changes needed)
- [x] I have included a list of my configuration changes in the PR
description (no config changes required)
<!-- CURSOR_SUMMARY -->
---
> [!NOTE]
> **Medium Risk**
> Adds new admin-only endpoints that can stop, requeue, and bulk-mark
executions as `FAILED`, plus schedule deletion, which can directly
impact production workload and data integrity if misused or buggy.
>
> **Overview**
> Introduces a **System Diagnostics** admin feature spanning backend +
frontend to monitor execution/schedule health and perform remediation
actions.
>
> On the backend, adds a new `backend/data/diagnostics.py` data layer
and `diagnostics_admin_routes.py` with admin-secured endpoints to fetch
execution/agent/schedule metrics (including RabbitMQ queue depths and
invalid-state detection), list problem executions/schedules, and perform
bulk operations like `stop`, `requeue`, and `cleanup` (marking
orphaned/stuck items as `FAILED` or deleting orphaned schedules). It
also extends `get_graph_executions`/`get_graph_executions_count` with
`execution_ids` filtering, pagination, started/updated time filters, and
configurable ordering to support efficient bulk/admin queries.
>
> On the frontend, adds an admin diagnostics page with summary cards and
tables for executions and schedules (tabs for
orphaned/failed/long-running/stuck-queued/invalid, plus confirmation
dialogs for destructive actions), wires it into admin navigation, and
adds comprehensive unit tests for both the new API routes and UI
behavior.
>
> <sup>Reviewed by [Cursor Bugbot](https://cursor.com/bugbot) for commit
15b9ed26f9. Bugbot is set up for automated
code reviews on this repo. Configure
[here](https://www.cursor.com/dashboard/bugbot).</sup>
<!-- /CURSOR_SUMMARY -->
---------
Co-authored-by: Claude <noreply@anthropic.com>
Co-authored-by: claude[bot] <41898282+claude[bot]@users.noreply.github.com>
Co-authored-by: Nicholas Tindle <ntindle@users.noreply.github.com>
Co-authored-by: copilot-swe-agent[bot] <198982749+Copilot@users.noreply.github.com>
### Why / What / How
**Why:** Fast-mode autopilot never renders a Reasoning block. The
frontend already has `ReasoningCollapse` wired up and the wire protocol
already carries `StreamReasoning*` events (landed for SDK mode in
#12853), but the baseline (OpenRouter OpenAI-compat) path never asks
Anthropic for extended thinking and never parses reasoning deltas off
the stream. Result: users on fast/standard get a good answer with no
visible chain-of-thought, while SDK users see the full Reasoning
collapse.
**What:** Plumb reasoning end-to-end through the baseline path by opting
into OpenRouter's non-OpenAI `reasoning` extension, parsing the
reasoning delta fields off each chunk, and emitting the same
`StreamReasoningStart/Delta/End` events the SDK adapter already uses.
**How:**
- **New config:** `baseline_reasoning_max_tokens` (default 8192; 0
disables). Sent as `extra_body={"reasoning": {"max_tokens": N}}` only on
Anthropic routes — other providers drop the field, and
`is_anthropic_model()` already gates this.
- **Delta extraction:** `_extract_reasoning_delta()` handles all three
OpenRouter/provider variants in priority order — legacy
`delta.reasoning` (string), DeepSeek-style `delta.reasoning_content`,
and the structured `delta.reasoning_details` list (text/summary entries;
encrypted or unknown entries are skipped).
- **Event emission:** Reasoning uses the same state-machine rules the
SDK adapter uses — a text delta or tool_use delta arriving mid-stream
closes the open reasoning block first, so the AI SDK v5 transport keeps
reasoning / text / tool-use as distinct UI parts. On stream end, any
still-open reasoning block gets a matching `reasoning-end` so a
reasoning-only turn still finalises the frontend collapse.
- **Scope:** Live streaming only. Reasoning is not persisted to
`ChatMessage` rows or the transcript builder in this PR (SDK path does
so via `content_blocks=[{type: 'thinking', ...}]`, but that round-trip
requires Anthropic signature plumbing baseline doesn't have today).
Reload will still not show reasoning on baseline sessions — can follow
up if we decide it's worth the signature handling.
### Changes
- `backend/copilot/config.py` — new `baseline_reasoning_max_tokens`
field.
- `backend/copilot/baseline/service.py` — new
`_extract_reasoning_delta()` helper; reasoning block state on
`_BaselineStreamState`; `reasoning` gated into `extra_body`; chunk loop
emits `StreamReasoning*` events with text/tool_use transition rules;
stream-end closes any open reasoning block.
- `backend/copilot/baseline/service_unit_test.py` — 11 new tests
covering extractor variants (legacy string, deepseek alias, structured
list with text/summary aliases, encrypted-skip, empty), paired event
ordering (reasoning-end before text-start), reasoning-only streams, and
that the `reasoning` request param is correctly gated by model route
(Anthropic vs non-Anthropic) and by the config flag.
### Checklist
For code changes:
- [x] I have clearly listed my changes in the PR description
- [x] I have made a test plan
- [ ] I have tested my changes according to the test plan:
- [x] `poetry run pytest backend/copilot/baseline/service_unit_test.py
backend/copilot/baseline/transcript_integration_test.py` — 103 passed
- [ ] Manual: with `CHAT_USE_CLAUDE_AGENT_SDK=false` and
`CHAT_MODEL=anthropic/claude-sonnet-4-6`, send a multi-step prompt on
fast mode and confirm a Reasoning collapse appears alongside the final
text
- [ ] Manual: flip `CHAT_BASELINE_REASONING_MAX_TOKENS=0` and confirm
baseline responses revert to text-only (no reasoning param, no reasoning
UI)
- [ ] Manual: with a non-Anthropic baseline model (`openai/gpt-4o`),
confirm the request does NOT include `reasoning` and nothing regresses
For configuration changes:
- [x] `.env.default` is compatible — new setting falls back to the
pydantic default
## Summary
Four cost-reduction changes for the copilot feature. Consolidated into
one PR at user request; each commit is self-contained and bisectable.
### 1. SDK: full cross-user cache on every turn (CLI 2.1.116 bump)
Previous behavior: CLI 2.1.97 crashed when `excludeDynamicSections=True`
was combined with `--resume`, so the code fell back to a raw
`system_prompt` string on resume, losing Claude Code's default prompt
and all cache markers. Every Turn 2+ of an SDK session wrote ~33K tokens
to cache instead of reading.
Fix: install `@anthropic-ai/claude-code@2.1.116` in the backend Docker
image and point the SDK at it via
`CHAT_CLAUDE_AGENT_CLI_PATH=/usr/bin/claude`. CLI 2.1.98+ fixes the
crash, so we can use the preset with `exclude_dynamic_sections=True` on
every turn — Turn 1, 2, 3+ all share the same static prefix and hit the
**cross-user** prompt cache.
**Local dev requirement:** if `CHAT_CLAUDE_AGENT_CLI_PATH` is unset, the
bundled 2.1.97 fallback will crash on `--resume`. Install the CLI
globally (`npm install -g @anthropic-ai/claude-code@2.1.116`) or set the
env var.
### 2. Baseline: add `cache_control` markers (commit `756b3ecd9` +
follow-ups)
Baseline path had zero `cache_control` across `backend/copilot/**`.
Every turn was full uncached input (~18.6K tokens, ~$0.058). Two
ephemeral markers — on the system message (content-blocks form) and the
last tool schema — plus `anthropic-beta: prompt-caching-2024-07-31` via
`extra_headers` as defense-in-depth. Helpers split into `_mark_tools_*`
(precomputed once per session) and `_mark_system_*` (per-round, O(1)).
Repeat hellos: ~$0.058 → ~$0.006.
### 3. Drop `get_baseline_supplement()` (commit `6e6c4d791`)
`_generate_tool_documentation()` emitted ~4.3K tokens of `(tool_name,
description)` pairs that exactly duplicated the tools array already in
the same request. Deleted. `SHARED_TOOL_NOTES` (cross-tool workflow
rules) is preserved. Baseline "hello" input: ~18.7K → ~14.4K tokens.
### 4. Langfuse "CoPilot Prompt" v26 (published under `review` label)
Separate, out-of-repo change. v25 had three duplicate "Example Response"
blocks + a 10-step "Internal Reasoning Process" section. v26 collapses
to one example + bullet-form reasoning. Char count 20,481 → 7,075 (rough
4 chars/token → ~5,100 → ~1,770 tokens).
- v26 is published with label `review` (NOT `production`); v25 remains
active.
- Promote via `mcp__langfuse__updatePromptLabels(name="CoPilot Prompt",
version=26, newLabels=["production"])` after smoke-test.
- Rollback: relabel v25 `production`.
## Test plan
- [x] Unit tests for `_build_system_prompt_value` (fresh vs resumed
turns emit identical preset dict)
- [x] SDK compat tests pass including
`test_bundled_cli_version_is_known_good_against_openrouter`
- [x] `cli_openrouter_compat_test.py` passes against CLI 2.1.116
(locally verified with
`CHAT_CLAUDE_AGENT_CLI_PATH=/opt/homebrew/bin/claude`)
- [x] 8 new `_mark_*` unit tests + identity regression test for
`_fresh_*` helpers
- [x] `SHARED_TOOL_NOTES` public-constant test passes; 5 old tool-docs
tests removed
- [ ] **Manual cost verification (commit 1):** send two consecutive SDK
turns; Turn 2 and Turn 3 should both show `cacheReadTokens` ≈ 33K (full
cross-user cache hits).
- [ ] **Manual cost verification (commit 2):** send two "hello" turns on
baseline <5 min apart; Turn 2 reports `cacheReadTokens` ≈ 18K and cost ≈
$0.006.
- [ ] **Regression sweep for commit 3:** one turn per tool family —
`search_agents`, `run_agent`,
`add_memory`/`forget_memory`/`search_memory`, `search_docs`,
`read_workspace_file` — to verify no tool-selection regression from
dropping the prose tool docs.
- [ ] **Langfuse v26 smoke test:** 5-10 varied turns after relabelling
to `production`; compare responses vs v25 for regression on persona,
concision, capability-gap handling, credential security flows.
## Deployment notes
- Production Docker image now installs CLI 2.1.116 (~20 MB added).
- `CHAT_CLAUDE_AGENT_CLI_PATH=/usr/bin/claude` set in the Dockerfile;
runtime can override via env.
- First deploy after this merge needs a fresh image rebuild to pick up
the new CLI.
## Summary
Fixes a bug where a chat session gets silently stuck after the user
presses Stop mid-turn.
**Root cause:** the cancel endpoint marks the session `failed` after
polling 5s, but the cluster lock held by the still-running task is only
released by `on_run_done` when the task actually finishes. If the task
hangs past the 5s poll (slow LLM call, agent-browser step, etc.), the
lock lingers for up to 5 min — `stream_chat_post`'s `is_turn_in_flight`
check sees the flipped meta (`failed`) and enqueues a new turn, but the
run handler sees the stale lock and drops the user's message at
`manager.py:379` (`reject+requeue=False`). The new SSE stream hangs
until its 60s idle timeout.
### Fix
Two cooperating changes:
1. **`mark_session_completed` force-releases the cluster lock** in the
same transaction that flips status to `completed`/`failed`.
Unconditional delete — by the time we're declaring the session dead, we
don't care who the current lock holder is; the lock has to go so the
next enqueued turn can acquire. This is what closes the stuck-session
window.
2. **`ClusterLock.release()` is now owner-checked** (Lua CAS — `GET ==
token ? DEL : noop` atomically). Force-release means another pod may
legitimately own the key by the time the original task's `on_run_done`
eventually fires. Without the CAS, that late `release()` would wipe the
successor's lock. With it, the late `release()` is a safe no-op when the
owner has changed.
Together: prompt release on completion (via force-delete) + safe cleanup
when on_run_done catches up (via CAS). That re-syncs the API-level
`is_turn_in_flight` check with the actual lock state, so the contention
window disappears.
No changes to the worker-level contention handler: `stream_chat_post`
already queues incoming messages into the pending buffer when a turn is
in flight (via `queue_pending_for_http`). With these fixes, the worker
never sees contention in the common case; if it does (true multi-pod
race), the pre-existing `reject+requeue=False` behaviour still applies —
we'll revisit that path with its own PR if it becomes a production
symptom.
### Verification
- Reproduced the original stuck-session symptom locally (Stop mid-turn →
send new message → backend logs `Session … already running on pod …`,
user message silently lost, SSE stream idle 60s then closes).
- After the fix: cancel → new message → turn starts normally (lock
released by `mark_session_completed`).
- `poetry run pyright` — 0 errors on edited files.
- `pytest backend/copilot/stream_registry_test.py
backend/executor/cluster_lock_test.py` — 33 passed (includes the
successor-not-wiped test).
## Changes
- `autogpt_platform/backend/backend/copilot/executor/utils.py` — extract
`get_session_lock_key(session_id)` helper so the lock-key format has a
single source of truth.
- `autogpt_platform/backend/backend/copilot/executor/manager.py` — use
the helper where the cluster lock is created.
- `autogpt_platform/backend/backend/copilot/stream_registry.py` —
`mark_session_completed` deletes the lock key after the atomic status
swap (force-release).
- `autogpt_platform/backend/backend/executor/cluster_lock.py` —
`ClusterLock.release()` (sync + async) uses a Lua CAS to only delete
when `GET == token`, protecting against wiping a successor after a
force-release.
## Test plan
- [ ] Send a message in /copilot that triggers a long turn (e.g.
`run_agent`), press Stop before it finishes, then send another message.
Expect: new turn starts promptly (no 5-min wait for lock TTL).
- [ ] Happy path regression — send a normal message, verify turn
completes and the session lock key is deleted after completion.
- [ ] Successor protection — unit test
`test_release_does_not_wipe_successor_lock` covers: A acquires, external
DEL, B acquires, A.release() is a no-op, B's lock intact.
## Why
After d7653acd0 removed cost estimation, most baseline turns log with
`tracking_type="tokens"` and no authoritative USD figure (see: dashboard
flipped from `cost_usd` to `tokens` after 4/14/2026). Rate-limit
counters were also token-weighted with hand-rolled cache discounts
(cache_read @ 10%, cache_create @ 25%) and a 5× Opus multiplier — a
proxy for cost that drifts from real OpenRouter billing.
This PR wires real generation cost from OpenRouter into both the
cost-tracking log and the rate limiter, and hides raw spend figures from
the user-facing API so clients can't reverse-engineer per-turn cost or
platform margins.
## What
1. **Real cost from OpenRouter** — baseline passes `extra_body={"usage":
{"include": True}}` and reads `chunk.usage.cost` from the final
streaming chunk. `x-total-cost` header path removed. Missing cost logs
an error and skips the counter update (vs the old estimator that
silently under-counted).
2. **Cost-based rate limiting** — `record_token_usage(...)` →
`record_cost_usage(cost_microdollars)`. The weighted-token math, cache
discount factors, and `_OPUS_COST_MULTIPLIER` are gone; real USD already
reflects model + cache pricing.
3. **Redis key migration** — `copilot:usage:*` → `copilot:cost:*` so
stale token counters can't be misinterpreted as microdollars.
4. **LD flags + config** — renamed to
`copilot-daily-cost-limit-microdollars` /
`copilot-weekly-cost-limit-microdollars` (unit in the LD key so values
can't accidentally be set in dollars or cents).
5. **Public `/usage` hides raw $$** — new `CoPilotUsagePublic` /
`UsageWindowPublic` schemas expose only `percent_used` (0-100) +
`resets_at` + `tier` + `reset_cost`. Admin endpoint keeps raw
microdollars for debugging.
6. **Admin API contract** — `UserRateLimitResponse` fields renamed
`daily/weekly_token_limit` → `daily/weekly_cost_limit_microdollars`,
`daily/weekly_tokens_used` → `daily/weekly_cost_used_microdollars`.
Admin UI displays `$X.XX`.
## How
- `baseline/service.py` — pass `extra_body`, extract cost from
`chunk.usage.cost`, drop the `x-total-cost` header fallback entirely.
- `rate_limit.py` — rewritten around `record_cost_usage`,
`check_rate_limit(daily_cost_limit, weekly_cost_limit)`, new Redis key
prefix. Adds `CoPilotUsagePublic.from_status()` projector for the public
API.
- `token_tracking.py` — converts `cost_usd` → microdollars via
`usd_to_microdollars` and calls `record_cost_usage` only when cost is
present.
- `sdk/service.py` — deletes `_OPUS_COST_MULTIPLIER` and simplifies
`_resolve_model_and_multiplier` to `_resolve_sdk_model_for_request`.
- Chat routes: `/usage` and `/usage/reset` return `CoPilotUsagePublic`.
Internal server-side limit checks still use the raw microdollar
`CoPilotUsageStatus`.
- Admin routes: unchanged response shape (renamed fields only).
- Frontend: `UsagePanelContent`, `UsageLimits`, `CopilotPage`,
`BriefingTabContent`, `credits/page.tsx` consume the new public schema
and render "N% used" + progress bar. Admin `RateLimitDisplay` /
`UsageBar` keep `$X.XX`. Helper `formatMicrodollarsAsUsd` retained for
admin use.
- Tests + snapshots rewritten; new assertions explicitly check that raw
`used`/`limit` keys are absent from the public payload.
## Deploy notes
1. **Before rolling this out, create the new LD flags:**
`copilot-daily-cost-limit-microdollars` (default `500000`) and
`copilot-weekly-cost-limit-microdollars` (default `2500000`). Old
`copilot-*-token-limit` flags can stay in LD for rollback.
2. **One-time Redis cleanup (optional):** token-based counters under
`copilot:usage:*` are orphaned and will TTL out within 7 days. Safe to
ignore or delete manually.
## Test plan
- [x] `poetry run test` — all impacted backend tests pass (182/182 in
targeted scope)
- [x] `pnpm test:unit` — all 1628 integration tests pass
- [x] `poetry run format` / `pnpm format` / `pnpm types` clean
- [x] Manual sanity against dev env — Baseline turn logged $0.1221 for
40K/139 tokens on Sonnet 4 (matches expected pricing)
- [ ] `/pr-test --fix` end-to-end against local native stack
### Why / What / How
**Why:** Only downgrades to FREE were scheduled at period end; paid→paid
downgrades (e.g. BUSINESS→PRO) applied immediately via Stripe proration.
The asymmetry meant users lost their higher tier mid-cycle in exchange
for a Stripe credit voucher only redeemable on a future subscription — a
confusing pattern that produces negative-value paths for users actually
cancelling. There was also no way to cancel a pending downgrade or
paid→FREE cancellation once scheduled.
**What:** Standardize on "upgrade = immediate, downgrade = next cycle"
and let users cancel a pending change by clicking their current tier.
Harden the new code against conflicting subscription state, concurrent
tab races, flaky Stripe calls, and hot-path latency regressions.
**How:**
Subscription state machine:
- **Upgrade** (PRO→BUSINESS) — `stripe.Subscription.modify` with
immediate proration (unchanged). If a downgrade schedule is already
attached, release it first so the upgrade wins.
- **Paid→paid downgrade** (BUSINESS→PRO) — creates a
`stripe.SubscriptionSchedule` with two phases (current tier until
`current_period_end`, target tier after). No mid-cycle tier demotion.
Defensive pre-clear: existing schedule → release;
`cancel_at_period_end=True` → set to False.
- **Paid→FREE** — unchanged: `cancel_at_period_end=True`.
- **Same-tier update** — reuses the existing `POST
/credits/subscription` route. When `target_tier == current_tier`,
backend calls `release_pending_subscription_schedule` (idempotent) and
returns status. No dedicated cancel-pending endpoint — "Keep my current
tier" IS the cancel operation.
- `release_pending_subscription_schedule` is idempotent on
terminal-state schedules and clears both `schedule` and
`cancel_at_period_end` atomically per call.
API surface:
- New fields on `SubscriptionStatusResponse`: `pending_tier` +
`pending_tier_effective_at` (pulled from the schedule's next-phase
`start_date` so dashboard-authored schedules report the correct
timestamp).
- `POST /credits/subscription` now returns `SubscriptionStatusResponse`
(previously `SubscriptionCheckoutResponse`); the response still carries
`url` for checkout flows and adds the status fields inline.
- `get_pending_subscription_change` is cached with a 30s TTL — avoids
hammering Stripe on every home-page load.
- Webhook dispatches
`subscription_schedule.{released,completed,updated}` through the main
`sync_subscription_from_stripe` flow so both event sources converge to
the same DB state.
Implementation notes:
- New Stripe calls use native async (`stripe.Subscription.list_async`
etc.) and typed attribute access — no `run_in_threadpool` wrapping in
the new helpers.
- Shared `_get_active_subscription` helper collapses the "list
active/trialing subs, take first" pattern used by 4 callers.
Frontend:
- `PendingChangeBanner` sub-component above the tier grid with formatted
effective date + "Keep [CurrentTier]" button. `aria-live="polite"` for
screen readers; locale pinned to `en-US` to avoid SSR/CSR hydration
mismatch.
- "Keep [CurrentTier]" also available as a button on the current tier
card.
- Other tier buttons disabled while a change is pending — user must
resolve pending first to prevent stacked schedules.
- `cancelPendingChange` reuses `useUpdateSubscriptionTier` with `tier:
current_tier`; awaits `refetch()` on both success and error paths so the
UI reconciles even if the server succeeded but the client didn't receive
the response.
### Changes
**Backend (`credit.py`, `v1.py`)**
- Tier-ordering helpers (`is_tier_upgrade`/`is_tier_downgrade`).
- `modify_stripe_subscription_for_tier` routes downgrades through
`_schedule_downgrade_at_period_end`; upgrade path releases any pending
schedule first.
- `_schedule_downgrade_at_period_end` defensively releases pre-existing
schedules and clears `cancel_at_period_end` before creating the new
schedule.
- `release_pending_subscription_schedule` idempotent on terminal-state
schedules; logs partial-failure outcomes.
- `_next_phase_tier_and_start` returns both tier and phase-start
timestamp; warns on unknown prices.
- `get_pending_subscription_change` cached (30s TTL), narrow exception
handling.
- `sync_subscription_schedule_from_stripe` delegates to
`sync_subscription_from_stripe` for convergence with the main webhook
path.
- Shared `_get_active_subscription` +
`_release_schedule_ignoring_terminal` helpers.
- `POST /credits/subscription` absorbs the same-tier "cancel pending
change" branch.
**Frontend (`SubscriptionTierSection/*`)**
- `PendingChangeBanner` new sub-component (a11y, locale-pinned date,
paid→FREE vs paid→paid copy split, non-null effective-date assertion, no
`dark:` utilities).
- "Keep [CurrentTier]" button on current tier card.
- `useSubscriptionTierSection` — `cancelPendingChange` reuses the
update-tier mutation.
- Copy: downgrade dialog + status hint updated.
- `helpers.ts` extracted from the main component.
**Tests**
- Backend: +24 tests (95/95 passing): upgrade-releases-pending-schedule,
schedule-releases-existing-schedule, cancel-at-period-end collision,
terminal-state release idempotency, unknown-price logging, status
response population, same-tier-POST-with-pending, webhook delegation.
- Frontend: +5 integration tests (21/21 passing): banner render/hide,
Keep-button click from banner + current card, paid→paid dialog copy.
### Checklist
- [x] Backend unit tests: 95 pass
- [x] Frontend integration tests: 21 pass
- [x] `poetry run format` / `poetry run lint` clean
- [x] `pnpm format` / `pnpm lint` / `pnpm types` clean
- [ ] Manual E2E on live Stripe (dev env) — pending deploy: BUSINESS→PRO
creates schedule, DB tier unchanged until period end
- [ ] Manual E2E: "Keep BUSINESS" in banner releases schedule
- [ ] Manual E2E: cancel pending paid→FREE flips `cancel_at_period_end`
back to false
- [ ] Manual E2E: BUSINESS→PRO (scheduled) then attempt BUSINESS→FREE
clears the PRO schedule, sets cancel_at_period_end
- [ ] Manual E2E: BUSINESS→PRO (scheduled) then upgrade back to BUSINESS
releases the schedule
Closes#12861 · [OPEN-3096](https://linear.app/autogpt/issue/OPEN-3096)
## Why
Four related copilot UX / stability issues surfaced on dev once action
tools started rendering inline in the chat (see #12813):
### 1. Duplicate bash_exec row
`GenericTool` rendered two rows saying the same thing for every
completed tool call — a muted subtitle line ("Command exited with code
1" / "Ran: sleep 20") **and** a `ToolAccordion` with the command echoed
in its description. Previously hidden inside the "Show reasoning" /
"Show steps" collapse, now visibly duplicated.
### 2. `bash_exec` capped at 120s via advisory text
The tool schema said `"Max seconds (default 30, max 120)"`; the model
obeyed, so long-running scripts got clipped at 120s with a vague `Timed
out after 120s` even though the E2B sandbox has no such limit. Confirmed
via Langfuse traces — the model picks `120` for long scripts because
that's what the schema told it the max was. E2B path never had a
server-side clamp.
Originally added in #12103 (default 30) and tightened to "max 120"
advisory in #12398 (token-reduction pass).
### 3. 30s default was too aggressive
`pip install`, small data-processing scripts, etc. routinely cross 30s
and got killed before the model thought to retry with a bigger timeout.
### 4. Stop + edit + resend → "The assistant encountered an error"
([OPEN-3096](https://linear.app/autogpt/issue/OPEN-3096))
Two independent bugs both land on the same banner — fixing only one
leaves the other visible on the next action.
**4a. Stream lock never released on Stop** *(the error in the ticket
screenshot)*. The executor's `async for chunk in
stream_and_publish(...)` broke out on `cancel.is_set()` without calling
`aclose()` on the wrapper. `async for` does NOT auto-close iterators on
`break`, so `stream_chat_completion_sdk` stayed suspended at its current
`await` — still holding the per-session Redis lock (TTL 120s) until GC
eventually closed it. The next `POST /stream` hit `lock.try_acquire()`
at
[sdk/service.py](autogpt_platform/backend/backend/copilot/sdk/service.py)
and yielded `StreamError("Another stream is already active for this
session. Please wait or stop it.")`. The `except GeneratorExit →
lock.release()` handler written exactly for this case never fired
because nothing sent GeneratorExit.
**4b. Orphan `tool_use` after stop-mid-tool.** Even with the lock
released, the stop path persists the session ending on an assistant row
whose `tool_calls` have no matching `role="tool"` row. On the next turn,
`_session_messages_to_transcript` hands Claude CLI `--resume` a JSONL
with a `tool_use` and no paired `tool_result`, and the SDK raises a
vague error — same banner. The ticket's "Open questions" explicitly
flags this.
## What
**Frontend — `GenericTool.tsx`** split responsibilities between the two
rows so they don't duplicate:
- **Subtitle row** (always visible, muted): *what ran* — `Ran: sleep
120`. Never the exit code.
- **Accordion description**: *how it ended* — `completed` / `status code
127 · bash: missing-bin: command not found` / `Timed out after 120s` /
(fallback to command preview for legacy rows missing `exit_code` /
`timed_out`). Pulled from the first non-empty line of `stdout` /
`stderr` when available.
- **Expanded accordion**: full command + stdout + stderr code blocks
(unchanged).
**Backend — `bash_exec.py`**:
- Drop the "max 120" advisory from the schema description.
- Bump default `timeout: 30 → 120`.
- Clean up the result message — `"Command executed with status code 0"`
(no "on E2B", no parens).
**Backend — `executor/processor.py` + `stream_registry.py` (OPEN-3096
#4a)**: wrap the consumer `async for` in `try/finally: await
stream.aclose()`. Close now propagates through `stream_and_publish` into
`stream_chat_completion_sdk`, whose existing `except GeneratorExit →
lock.release()` releases the Redis lock immediately on cancel. Stream
types tightened to `AsyncGenerator[StreamBaseResponse, None]` so the
defensive `getattr(stream, "aclose", None)` goes away.
**Backend — `session_cleanup.py` (OPEN-3096 #4b)**: new
`prune_orphan_tool_calls()` helper walks the trailing session tail and
drops any trailing assistant row whose `tool_calls` have unresolved ids
(plus everything after it) and any trailing `STOPPED_BY_USER_MARKER`
system-stop row. Single backward pass — tolerates the marker being
present or absent. Called from the existing turn-start cleanup in both
`sdk/service.py` and `baseline/service.py`; takes an optional
`log_prefix` so both paths emit the same INFO log when something was
popped. In-memory only — the DB save path is append-only via
`start_sequence`.
## Test plan
- [x] `pnpm exec vitest run src/app/(platform)/copilot/tools/GenericTool
src/app/(platform)/copilot/components/ChatMessagesContainer` — 105 pass
(6 new for GenericTool subtitle/description variants + legacy-fallback
case).
- [x] `pnpm format` / `pnpm lint` / `pnpm types` — clean.
- [x] `poetry run pytest
backend/copilot/sdk/session_persistence_test.py` — 17 pass (6 + 3 new
covering the orphan-tool-call prune and its optional-log-prefix branch).
- [x] `poetry run pytest backend/copilot/stream_registry_test.py
backend/copilot/executor/processor_test.py` — 19 pass (2 for aclose
propagation on the `stream_and_publish` wrapper, 2 for `_execute_async`
aclose propagation on both exit paths, 1 for publish_chunk RedisError
warning ladder).
- [x] `poetry run ruff check` / `poetry run pyright` on touched files —
clean.
- [x] Manual: fire a `bash_exec` — one labelled row, accordion
description reads sensibly (`completed` / `status code 1 · …` / `Timed
out after 120s`).
- [x] Manual: script that needs >120s — no longer clipped.
- [x] Manual: Stop mid-tool + edit + resend — Autopilot resumes without
"Another stream is already active" and without the vague SDK error.
## Scope note
Does not touch `splitReasoningAndResponse` — re-collapsing action tools
back into "Show steps" is #12813's responsibility.
- Add WriteWorkspaceFileTool tests for quota, virus, scan, and conflict
error handling (the PR's stated motivation was untested)
- Add ZeroDivisionError guard on storage_limit in quota check
- Assert scan_content_safe not called on quota rejection (fast-path)
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
### Why / What / How
<img width="900" alt="Screenshot 2026-04-20 at 19 52 22"
src="https://github.com/user-attachments/assets/c30d5f18-2842-4a8a-ac3d-5bfee18fcd56"
/>
**Why:** The "Spent this month" tile in the Agent Briefing Panel on the
Library page always showed `$0`, even for users with real execution
usage. The tile is meant to give a quick sense of monthly spend across
all agents.
**What:** Compute `monthlySpend` from actual execution data and format
it as currency.
**How:**
- `useLibraryFleetSummary` now sums `stats.cost` (cents) across every
execution whose `started_at` falls within the current calendar month.
Previously `monthlySpend` was hardcoded to `0`.
- `FleetSummary.monthlySpend` is documented as being in cents
(consistent with backend + `formatCents`).
- `StatsGrid` now uses `formatCents` from the copilot usage helpers to
render the tile (e.g. `$12.34` instead of the broken `$0`).
### Changes 🏗️
-
`autogpt_platform/frontend/src/app/(platform)/library/hooks/useLibraryFleetSummary.ts`:
aggregate `stats.cost` across executions started in the current calendar
month; add `toTimestamp` and `startOfCurrentMonth` helpers.
-
`autogpt_platform/frontend/src/app/(platform)/library/components/AgentBriefingPanel/StatsGrid.tsx`:
format the "Spent this month" tile via shared `formatCents` helper.
- `autogpt_platform/frontend/src/app/(platform)/library/types.ts`:
document that `FleetSummary.monthlySpend` is in cents.
### Checklist 📋
#### For code changes:
- [x] I have clearly listed my changes in the PR description
- [x] I have made a test plan
- [ ] I have tested my changes according to the test plan:
- [ ] Load `/library` with the `AGENT_BRIEFING` flag enabled and at
least one completed execution in the current month — the "Spent this
month" tile shows the correct cumulative cost.
- [ ] With no executions this month, the tile shows `$0.00`.
- [ ] Type-check (`pnpm types`), lint (`pnpm lint`), and integration
tests (`pnpm test:unit`) pass locally.
---------
Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
## Why
Four related issues that surfaced when queued follow-ups hit an
extended_thinking turn:
1. **Mid-turn promote stalled the SSE stream.** `pollBackendAndPromote`
used `setMessages((prev) => [...prev, bubble])` — Vercel AI SDK's
`useChat` streams SSE deltas into `messages[-1]`, so once a user bubble
ended up there, every subsequent chunk silently landed on the wrong
message. Chat sat frozen until a page refresh, even though the backend's
stream completed cleanly.
2. **Thinking-only final turn looked identical to a frozen UI.** When
Claude's last LLM call after a tool_result produced only a
`ThinkingBlock` (no `TextBlock`, no `ToolUseBlock`), the response
adapter silently dropped it and the UI hung on "Thought for Xs" with no
response text.
3. **Reasoning was invisible.** `ThinkingBlock` was dropped live and
never persisted in a way the frontend could render — sessions on reload
/ shared links showed no thinking, a confusing UX gap ("display for
nothing").
4. **Cross-pod Redis replay dropped reasoning events.** The
`stream_registry._reconstruct_chunk` type map had no entries for
`reasoning-*` types, so any client that subscribed mid-stream (share,
reload, cross-pod) silently dropped them with `Unknown chunk type:
reasoning-delta`.
## What
### Mid-turn promote — splice before the trailing assistant
In `useCopilotPendingChips.ts::pollBackendAndPromote`:
```ts
setMessages((prev) => {
const bubble = makePromotedUserBubble(drained, "midturn", crypto.randomUUID());
const lastIdx = prev.length - 1;
if (lastIdx >= 0 && prev[lastIdx].role === "assistant") {
return [...prev.slice(0, lastIdx), bubble, prev[lastIdx]];
}
return [...prev, bubble];
});
```
Streaming assistant stays at `messages[-1]`, AI SDK deltas keep routing
correctly. `useHydrateOnStreamEnd` snaps the bubble to the DB-canonical
position when the stream ends.
### Reasoning — end-to-end visibility (live + persisted)
- **Wire protocol**: new `StreamReasoningStart` / `StreamReasoningDelta`
/ `StreamReasoningEnd` events matching AI SDK v5's `reasoning-*` wire
names, so `useChat` accumulates them into a `type: 'reasoning'`
UIMessage part natively.
- **Response adapter**: every `ThinkingBlock` now emits reasoning
events; text/tool_use transitions close the open reasoning block so AI
SDK doesn't merge distinct parts.
- **Stream registry**: added `reasoning-*` types to
`_reconstruct_chunk`'s type_to_class map so Redis replay no longer drops
them on cross-pod / reload / share.
- **Persistence** (new): each `StreamReasoningStart` opens a
`ChatMessage(role="reasoning")` row in `session.messages`; deltas
accumulate into its content; `StreamReasoningEnd` closes it. No schema
migration — `ChatMessage.role` is already `String`.
`extract_context_messages` filters `role="reasoning"` out of LLM context
(the `--resume` CLI session already carries thinking separately) so the
model never re-ingests prior reasoning.
- **Frontend conversion**: `convertChatSessionMessagesToUiMessages` maps
`role="reasoning"` DB rows into `{type: "reasoning", text}` parts on the
surrounding assistant bubble, so reload / shared-link sessions render
reasoning identically to live stream.
### Steps / Reasoning UX — modal + accordion split
- **`StepsCollapse`** (new): a Dialog-backed "Show steps" modal wraps
the pre-final-answer group (tool timeline + per-block reasoning). Modal
keeps the steps visually grouped and out of the reading flow.
- **`ReasoningCollapse`** (rewritten): inline accordion with "Show
reasoning" / "Hide reasoning" toggle — no longer a modal, so it expands
*inside* the Steps modal without stacking two dialogs. Reasoning text
appears indented with a left border.
- **`splitReasoningAndResponse`**: reasoning parts now stay in the
reasoning group (instead of being pinned out), so they show up inside
the Steps modal alongside the tool-use timeline.
### Thinking-only final turn — synthesize a closing line
(belt-and-suspenders)
- **Prompt rule** (`_USER_FOLLOW_UP_NOTE`): "Every turn MUST end with at
least one short user-facing text sentence."
- **Adapter fallback**: tracks `_text_since_last_tool_result`; at
`ResultMessage success` with tools run + zero text since, opens a fresh
step (`UserMessage` already closed the previous one) and injects `"(Done
— no further commentary.)"` before `StreamFinish`. Only fires for the
pathological case — pure-text turns untouched.
## Test plan
- [x] `pnpm vitest run` on copilot files — all 638 prior tests pass;
**17 new tests** added covering:
- `convertChatSessionToUiMessages`: reasoning row alone / merged with
assistant text / multi-row / empty skip / duration capture
- `ReasoningCollapse`: initial collapsed, toggle, `rotate-90`,
`aria-expanded`
- `StepsCollapse`: trigger + dialog open renders children
- `MessagePartRenderer`: reasoning → `<pre>` inside collapse,
whitespace/missing text → null
- `splitReasoningAndResponse`: reasoning-stays-in-reasoning regression
- [x] `poetry run pytest backend/copilot/sdk/response_adapter_test.py` —
36 pass (7 new: 4 reasoning streaming, 3 thinking-only fallback)
- [x] Manual: reasoning streams live and persists across reload on a
fresh session
- [x] Manual: previously-created sessions (pre-persistence) don't have
`role="reasoning"` rows — behaves as a clean no-op (no reasoning shown,
no error), new sessions render reasoning inside Steps modal
## Notes
- No DB migration — `ChatMessage.role` is already an open `String`;
`role="reasoning"` is simply filtered out of LLM context builds but
rendered by the frontend.
- Addresses /pr-review blockers: (a) stream_registry missing reasoning
types in Redis round-trip, (b) fallback text emitted outside a step, (c)
dead `case "thinking"` in renderer (now uses the live `reasoning` type
uniformly).
## Why
Users and tools can target a copilot session that already has a turn
running. Before this PR there was no uniform behaviour for that case —
the UI manually routed to a separate queue endpoint, `run_sub_session`
and the AutoPilot block raced the cluster lock, and in-turn follow-ups
only reached the model at turn-end via auto-continue. Outcome: dropped
messages, duplicate tool rows, missed mid-turn intent, latent
correctness bugs in block execution.
## What
A single "message arrived → turn already running?" primitive, shared by
every caller:
1. **POST `/stream`** (UI chat): self-defensive. Session idle → SSE as
today; session busy → `202 application/json` with `{buffer_length,
max_buffer_length, turn_in_flight}`. The deprecated `POST
/messages/pending` endpoint is removed (`GET /messages/pending` peek
stays).
2. **`run_copilot_turn_via_queue`** (shared primitive from #12841, used
by `run_sub_session` + `AutoPilotBlock`): gains the same busy-check.
Busy session → push to pending buffer, return `("queued",
SessionResult(queued=True, pending_buffer_length=N))` without creating a
stream registry session or enqueueing a RabbitMQ job. All callers
inherit queueing.
3. **Mid-turn delivery**: drained follow-ups are attached to every
tool_result's `additionalContext` via the SDK's `PostToolUse` hook —
covers both MCP and built-in tools (WebSearch/Read/Agent/etc.), not just
`run_block`. Claude reads the queued text on the next LLM round of the
same turn.
4. **UI observability**: chips promote to a proper user bubble at the
correct chronological position (after the tool_result row that consumed
them). Auto-continue handles end-of-turn drainage; mid-turn backend poll
handles the tool-boundary drainage path.
## How
**Data plane**
- `backend/copilot/pending_messages.py` — Redis list per session
(LPOP-count for atomic drain), TTL, fire-and-forget pub/sub notify. MAX
10 per session.
- `backend/copilot/pending_message_helpers.py` — `is_turn_in_flight`,
`queue_user_message`, `drain_and_format_for_injection`,
`persist_pending_as_user_rows` (shared persist+rollback used by both
baseline and SDK paths).
- `backend/data/redis_helpers.py` — centralised `incr_with_ttl`,
`capped_rpush`, `hash_compare_and_set`; every Lua script and pipeline
atomicity lives in one place.
**Injection sites**
- `backend/copilot/sdk/security_hooks.py::post_tool_use_hook` — drains +
returns `additionalContext`. Single hook covers built-in + MCP tools.
- `backend/copilot/sdk/service.py` — `StreamToolOutputAvailable`
dispatch persists the drained follow-up as a real user row right after
the tool_result (UI bubble at the right index).
`state.midturn_user_rows` keeps the CLI upload watermark honest.
- `backend/copilot/baseline/service.py` — same drain at round
boundaries, uses the shared `persist_pending_as_user_rows` helper so
baseline + SDK code paths don't diverge.
**Dispatch**
- `backend/copilot/sdk/session_waiter.py::run_copilot_turn_via_queue` —
`is_turn_in_flight` short-circuit; `SessionResult` gains `queued` +
`pending_buffer_length`; `SessionOutcome` gains `"queued"`.
- `backend/api/features/chat/routes.py::stream_chat_post` — busy-check
returns 202 with `QueuePendingMessageResponse`; `POST /messages/pending`
deleted.
- `backend/copilot/tools/run_sub_session.py` / `models.py` —
`SubSessionStatusResponse.status` gains `"queued"`;
`response_from_outcome` renders a clear queued-state message with the
pending-buffer depth and a link to watch live.
- `backend/blocks/autopilot.py::execute_copilot` — surfaces queued state
as descriptive response text + empty `tool_calls`/history when
`result.queued`.
**Frontend**
- `src/app/(platform)/copilot/useCopilotPendingChips.ts` — hook owning
the chip lifecycle: backend peek on session load, auto-continue
promotion when a second assistant id appears, mid-turn poll that
promotes when the backend count drops.
- `src/app/(platform)/copilot/useHydrateOnStreamEnd.ts` —
force-hydrate-waits-for-fresh-reference dance extracted.
- `src/app/(platform)/copilot/helpers/stripReplayPrefix.ts` — pure
function with drop / strip / streaming-catch-up cases + helper
decomposition.
- `src/app/(platform)/copilot/helpers/makePromotedBubble.ts` — one-line
helper for the promoted bubble shape.
- `src/app/(platform)/copilot/helpers/queueFollowUpMessage.ts` — thin
`fetch` wrapper for the 202 path (AI SDK's `useChat` fetcher only
handles SSE, so we can't reuse `sendMessage` for the queued response).
## Test plan
Backend unit + integration (`poetry run pytest backend/copilot
backend/api/features/chat`):
- [x] 107 tests pass — pending buffer, drain helpers, routes,
session_waiter queue branch, run_sub_session outcome rendering,
autopilot block
- [x] New `session_waiter_test.py` proves the queue branch
short-circuits `stream_registry.create_session` + `enqueue_copilot_turn`
- [x] Mid-turn persist has a rollback-and-re-queue path tested for when
`session.messages` persist silently fails to back-fill sequences
Frontend unit (`pnpm vitest run`):
- [x] 630 tests pass incl. 22 new for extracted helpers + hooks
- [x] Frontend coverage on touched copilot files: 91%+ (patch 87.37%)
Manual (once merged):
- [ ] Queue two chips while a tool is running; Claude acknowledges both
on the next round, UI shows bubbles in typing order after the tool
output
- [ ] Hand AutoPilot block an existing session_id that has a live turn;
block returns queued status, in-flight turn drains the message on its
next round
- [ ] `run_sub_session` against a busy sub — status=`queued`,
`sub_autopilot_session_link` lets user watch live
---------
Co-authored-by: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
### Why / What / How
**Why:** the 10-min stream-level idle timeout was killing legitimate
long-running tool calls — notably sub-AutoPilot runs via
`run_block(AutoPilotBlock)`, which routinely take 15–45 min. The symptom
users saw was `"A tool call appears to be stuck"` even though AutoPilot
was actively working. A second long-standing rough edge was shipped
alongside: agents often skipped `get_agent_building_guide` when
generating agent JSON, producing schemas that failed validation and
burned turns on auto-fix loops.
**What:** three threaded pieces.
1. **Async sub-AutoPilot via `run_sub_session`.** New copilot tool that
delegates a task to a fresh (or resumed) sub-AutoPilot, and its
companion `get_sub_session_result` for polling/cancelling. The agent
starts with `run_sub_session(prompt, wait_for_result≤300s)` and, if the
sub isn't done inside the cap, receives a handle + polls via
`get_sub_session_result(wait_if_running≤300s)`. No single MCP call ever
blocks the stream for more than 5 min, so the 10-min stream-idle timer
stays simple and effective (derived as `MAX_TOOL_WAIT_SECONDS * 2`).
2. **Queue-backed copilot turn dispatch** — one code path for all three
callers.
- `run_sub_session` enqueues a `CoPilotExecutionEntry` on the existing
`copilot_execution` exchange instead of spawning an in-process
`asyncio.Task`.
- `AutoPilotBlock.execute_copilot` (graph block) now uses the **same
queue** instead of `collect_copilot_response` inline.
- The HTTP SSE endpoint was already queue-backed.
- All three share a single primitive: `run_copilot_turn_via_queue` →
`create_session` → `enqueue_copilot_turn` → `wait_for_session_result`.
The event-aggregation logic (`EventAccumulator`/`process_event`) is a
shared module used by both the direct-stream path and the cross-process
waiter.
- Benefits: **deploy/crash resilience** (RabbitMQ redelivery survives
worker restarts), **natural load balancing** across copilot_executor
workers, **sessions as first-class resources** (UI users can
`/copilot?sessionId=<inner>` into any sub or AutoPilot block's session),
and every future stream-level feature (pending-messages drain #12737,
compaction policies, etc.) applies uniformly instead of bypassing
graph-block sessions.
3. **Guide-read gate on agent-generation tools.** `create_agent` /
`edit_agent` / `validate_agent_graph` / `fix_agent_graph` refuse until
the session has called `get_agent_building_guide`. The pre-existing soft
hint was routinely ignored; the gate makes the dependency enforceable.
All four tool descriptions advertise the requirement in one tightened
sentence ("Requires get_agent_building_guide first (refuses
otherwise).") that stays under the 32000-char schema budget.
**How:**
#### Queue-backed sub-AutoPilot + AutoPilotBlock
- `sdk/session_waiter.py` — new module. `SessionResult` dataclass
mirrors `CopilotResult`. `wait_for_session_result` subscribes to
`stream_registry`, drains events via shared `process_event`, returns
`(outcome, result)`. `wait_for_session_completion` is the cheaper
outcome-only variant. `run_copilot_turn_via_queue` is the canonical
three-step dispatch. Every exit path unsubscribes the listener.
- `sdk/stream_accumulator.py` — new module. `EventAccumulator`,
`ToolCallEntry`, `process_event` extracted from `collect.py`. Both the
direct-stream and cross-process paths now use the same fold logic.
- `tools/run_sub_session.py` / `tools/get_sub_session_result.py` —
rewritten around the shared primitive. `sub_session_id` is now the sub's
`ChatSession` id directly (no separate registry handle). Ownership
re-verified on every call via `get_chat_session`. Cancel via
`enqueue_cancel_task` on the existing `copilot_cancel` fan-out exchange.
- `blocks/autopilot.py` — `execute_copilot` replaced its inline
`collect_copilot_response` with `run_copilot_turn_via_queue`.
`SessionResult` carries response text, tool calls, and token usage back
from the worker so no DB round-trip is needed. The block's public I/O
contract (inputs, outputs, `ToolCallEntry` shape) is unchanged.
- `CoPilotExecutionEntry` gains a `permissions: CopilotPermissions |
None` field forwarded to the worker's `stream_fn` so the sub's
capability filter survives the queue hop. The processor passes it
through to `stream_chat_completion_sdk` /
`stream_chat_completion_baseline`.
- **Deleted**: `sdk/sub_session_registry.py` (module-level dict,
done-callback, abandoned-task cap, `notify_shutdown_and_cancel_all`,
`_reset_for_test`), plus the shutdown-notifier hook in
`copilot_executor.processor.cleanup` — redundant under queue-backed
execution.
#### Run_block single-tool cap (3)
- `tools/helpers.execute_block` caps block execution at
`MAX_TOOL_WAIT_SECONDS = 5 min` via `asyncio.wait_for` around the
generator consumption.
- On timeout: logs `copilot_tool_timeout tool=run_block block=…
block_id=… input_keys=… user=… session=… cap_s=…` (grep-friendly) and
returns an `ErrorResponse` that redirects the LLM to `run_agent` /
`run_sub_session`.
- Billing protection: `_charge_block_credits` is called in a `finally`
guarded by `asyncio.shield` and marked `charge_handled` **before** the
await so cancel-mid-charge doesn't double-bill and
cancel-mid-generator-before-charge still settles via the finally.
#### Guide-read gate
- `helpers.require_guide_read(session, tool_name)` scans
`session.messages` for any prior assistant tool call named
`get_agent_building_guide` (handles both OpenAI and flat shapes).
Applied at the top of `_execute` in `create_agent`, `edit_agent`,
`validate_agent_graph`, `fix_agent_graph`. Tool descriptions advertise
the requirement.
#### Shared timing constants
- `MAX_TOOL_WAIT_SECONDS = 5 * 60` + `STREAM_IDLE_TIMEOUT_SECONDS = 2 *
MAX_TOOL_WAIT_SECONDS` in `constants.py`. Every long-running tool
(`run_agent`, `view_agent_output`, `run_sub_session`,
`get_sub_session_result`, `run_block`) imports from one place; no more
hardcoded 300 / `10*60` literals drifting apart. Stream-idle invariant
("no single tool blocks close to the idle timeout") holds by
construction.
### Frontend
- Friendlier tool-card labels: `run_sub_session` → "Sub-AutoPilot",
`get_sub_session_result` → "Sub-AutoPilot result", `run_block` →
"Action" (matches the builder UI's own naming), `run_agent` → "Agent".
Fixes the double-verb "Running Run …" phrasing.
- `SubSessionStatusResponse.sub_autopilot_session_link` surfaces
`/copilot?sessionId=<inner>` so users can click into any sub's session
from the tool-call card — same pattern as `run_agent`'s
`library_agent_link`.
### Changes 🏗️
- **New modules**: `sdk/session_waiter.py`, `sdk/stream_accumulator.py`,
`tools/run_sub_session.py`, `tools/get_sub_session_result.py`,
`tools/sub_session_test.py`, `tools/agent_guide_gate_test.py`.
- **New response types**: `SubSessionStatusResponse`,
`SubSessionProgressSnapshot`, `SessionResult`.
- **New gate helper**: `require_guide_read` in `tools/helpers.py`.
- **Queue protocol**: `permissions` field on `CoPilotExecutionEntry`,
threaded through `processor.py` → `stream_fn`.
- **Hidden**: `AUTOPILOT_BLOCK_ID` in `COPILOT_EXCLUDED_BLOCK_IDS`
(run_block can't execute AutoPilotBlock; agents use `run_sub_session`
instead).
- **Deleted**: `sdk/sub_session_registry.py`, processor
shutdown-notifier hook.
- **Regenerated**: `openapi.json` for the new response types; block-docs
for the updated `ToolName` Literal.
- **Tool descriptions**: tightened the guide-gate hint across the four
agent-builder tools to stay under the 32000-char schema budget.
- **40+ tests** across sub_session, execute_block cap + billing races,
stream_accumulator, agent_guide_gate, frontend helpers.
### Checklist 📋
#### For code changes:
- [x] I have clearly listed my changes in the PR description
- [x] I have made a test plan
- [x] I have tested my changes according to the test plan:
- [x] Unit suite green on the full copilot tree; `poetry run format` +
`pyright` clean
- [x] Schema character budget test passes (tool descriptions trimmed to
stay under 32000)
- [x] Native UI E2E (`poetry run app` + `pnpm dev`):
`run_sub_session(wait_for_result=60)` returns `status="completed"` +
`sub_autopilot_session_link` inline;
`run_sub_session(wait_for_result=1)` returns `status="running"` +
handle, `get_sub_session_result(wait_if_running=60)` observes `running →
completed` transition
- [x] AutoPilotBlock (graph) goes through `copilot_executor` queue
end-to-end (verified via logs: ExecutionManager's AutoPilotBlock node
spawned session `f6de335b-…`, a different `CoPilotExecutor` worker
acquired its cluster lock and ran the SDK stream)
- [x] Guide gate: `create_agent` without a prior
`get_agent_building_guide` returns the refusal; agent reads the guide
and retries successfully
- Fix prettier formatting in SubscriptionTierSection
- Add tests for WorkspaceStorageSection rendering (shown/hidden states)
- Mock useWorkspaceStorage in UsagePanelContent test suite
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
- Resolve merge conflict in SubscriptionTierSection tier descriptions
(merge dev's improved copy with our storage limit additions)
- Use asyncio.gather for sequential awaits in get_storage_usage endpoint
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
## Problem
The CoPilot system prompt contains a `gh auth status` instruction in the
E2B-specific `GitHub CLI` section, but models pattern-match to
`connect_integration` from the **Tool Discovery Priority** section —
which is where the actual decision to call an external service is made.
Because the GitHub auth check lives in a separate, later section, it's
not salient at the point of decision-making. This causes the model to
call `connect_integration(provider='github')` even when `gh` is already
authenticated via `GH_TOKEN`, unnecessarily prompting the user.
## Fix
Add a 3-line callout directly inside the **Tool Discovery Priority**
section:
```
> 🔑 **GitHub exception:** Before calling `connect_integration` for GitHub,
> always run `gh auth status` first. If it shows `Logged in`, proceed
> directly with `gh`/`git` — no integration connection needed.
```
This places the rule at the exact location where the model decides which
tool path to take, preventing the miss.
## Why this works
- **Placement over repetition**: The existing instruction isn't wrong —
it's just in the wrong spot relative to where the decision is made
- **Negative framing**: Explicitly says "before calling
`connect_integration`" which directly intercepts the incorrect reflex
- **Minimal change**: 4 lines added, zero removed
Co-authored-by: Toran Bruce Richards <22963551+Torantulino@users.noreply.github.com>
## Summary
Fixes#11041
When an `AgentExecutorBlock` is placed in the builder, it initially
displays the agent's name (e.g., "Researcher v2"). After saving and
reloading the page, the title reverts to the generic "Agent Executor."
## Root Cause
The backend correctly persists `agent_name` and `graph_version` in
`hardcodedValues` (via `input_default` in `AgentExecutorBlock`).
However, `NodeHeader.tsx` always resolves the display title from
`data.title` (the generic block name), ignoring the persisted agent
name.
## Fix
Modified the title resolution chain in `NodeHeader.tsx` to check
`data.hardcodedValues.agent_name` between the user's custom name and the
generic block title:
1. `data.metadata.customized_name` (user's manual rename) — highest
priority
2. `agent_name` + ` v{graph_version}` from `hardcodedValues` — **new**
3. `data.title` (generic block name) — fallback
This is a frontend-only change. No backend modifications needed.
## Files Changed
-
`autogpt_platform/frontend/src/app/(platform)/build/components/FlowEditor/nodes/CustomNode/components/NodeHeader.tsx`
(+11, -1)
## Test Plan
- [x] Place an AgentExecutorBlock, select an agent — title shows agent
name
- [x] Save graph, reload page — title still shows agent name (was "Agent
Executor" before)
- [x] Double-click to rename — custom name takes priority over agent
name
- [x] Clear custom name — falls back to agent name
- [x] Non-AgentExecutor blocks — unaffected, show generic title as
before
---------
Co-authored-by: Zamil Majdy <zamil.majdy@agpt.co>
## Why / What / How
**Why:** PR #12796 changed completed copilot sessions to load messages
from sequence 0 forward (ascending), which broke the standard chat UX —
users now land at the beginning of the conversation instead of the most
recent messages. Reported in Discord.
**What:** Reverts the forward pagination approach and replaces it with a
visibility guarantee that ensures every page contains at least one
user/assistant message.
**How:**
- **Backend**: Removed after_sequence, from_start, forward_paginated,
newest_sequence — always use backward (newest-first) pagination. Added
_expand_for_visibility() helper: after fetching, if the entire page is
tool messages (invisible in UI), expand backward up to 200 messages
until a visible user/assistant message is found.
- **Frontend**: Removed all forwardPaginated/newestSequence plumbing
from hooks and components. Removed bottom LoadMoreSentinel. Simplified
message merge to always prepend paged messages.
### Changes
- routes.py: Reverted to simple backward pagination, removed TOCTOU
re-fetch logic
- db.py: Removed forward mode, extracted _expand_tool_boundary() and
added _expand_for_visibility()
- SessionDetailResponse: Removed newest_sequence and forward_paginated
fields
- openapi.json: Removed after_sequence param and forward pagination
response fields
- Frontend hooks/components: Removed forward pagination props and logic
(-1000 lines)
- Updated all tests (backend: 63 pass, frontend: 1517 pass)
### Checklist
- [x] I have clearly listed my changes in the PR description
- [x] Backend unit tests: 63 pass
- [x] Frontend unit tests: 1517 pass
- [x] Frontend lint + types: clean
- [x] Backend format + pyright: clean
### Why / What / How
<!-- Why: Why does this PR exist? What problem does it solve, or what's
broken/missing without it? -->
This PR converts inline Python comments in code examples within
`block-sdk-guide.md` into MkDocs `!!! note` admonitions. This makes code
examples cleaner and more copy-paste friendly while preserving all
explanatory content.
<!-- What: What does this PR change? Summarize the changes at a high
level. -->
Converts inline comments in code blocks to admonitions following the
pattern established in PR #12396 (new_blocks.md) and PR #12313.
<!-- How: How does it work? Describe the approach, key implementation
details, or architecture decisions. -->
- Wrapped code examples with `!!! note` admonitions
- Removed inline comments from code blocks for clean copy-paste
- Added explanatory admonitions after each code block
### Changes 🏗️
- Provider configuration examples (API key and OAuth)
- Block class Input/Output schema annotations
- Block initialization parameters
- Test configuration
- OAuth and webhook handler implementations
- Authentication types and file handling patterns
### Checklist 📋
#### For documentation changes:
- [x] Follows the admonition pattern from PR #12396
- [x] No code changes, documentation only
- [x] Admonition syntax verified correct
#### For configuration changes:
- [ ] `.env.default` is updated or already compatible with my changes
- [ ] `docker-compose.yml` is updated or already compatible with my
changes
---
**Related Issues**: Closes#8946
Co-authored-by: slepybear <slepybear@users.noreply.github.com>
Co-authored-by: Zamil Majdy <zamil.majdy@agpt.co>
## Why
Debug console.log statements were left in production code, which can
leak
sensitive information and pollute browser developer consoles.
## What
Removed console.log from 4 non-legacy frontend components:
- useNavbar.ts: isLoggedIn debug log
- WalletRefill.tsx: autoRefillForm debug log
- EditAgentForm.tsx: category field debug log
- TimezoneForm.tsx: currentTimezone debug log
## How
Simply deleted the console.log lines as they served no purpose
other than debugging during development.
## Checklist
- [x] Code follows project conventions
- [x] Only frontend changes (4 files, 6 lines removed)
- [x] No functionality changes
Co-authored-by: slepybear <slepybear@users.noreply.github.com>
## Why
Scheduled agents weren't well-surfaced in the Library and Copilot
briefings:
- The Library fleet summary didn't count agents that are scheduled
purely via the scheduler (only those with a `recommended_schedule_cron`
set at the agent level).
- Sitrep items didn't distinguish scheduled or listening (trigger-based)
agents, so they often fell back to a generic "idle" state.
- Scheduled chips showed a generic message with no indication of when
the next run would happen.
- The Copilot Agent Briefing surfaced every scheduled agent regardless
of how far out the next run was — an agent scheduled a month away would
take a slot from something actually happening soon.
- Long sitrep messages overflowed the row.
## What
- Add `is_scheduled` to `LibraryAgent` (sourced from the scheduler) so
the frontend can reliably detect schedule-only agents.
- Count scheduled agents in `useLibraryFleetSummary`.
- Include scheduled and listening agents in sitrep items, with a
priority ordering (error → running → stale → success → listening →
scheduled → idle).
- Show a relative next-run time on scheduled sitrep chips (e.g.
"Scheduled to run in 2h" / "in 3d").
- Filter the Copilot Agent Briefing to scheduled agents whose next run
is within the next 3 days.
- Truncate long sitrep messages to 1 line with `OverflowText` and show
the full text in a tooltip on hover.
## How
- Scheduler → `LibraryAgent` mapping populates `is_scheduled` /
`next_scheduled_run`.
- `useSitrepItems` gains an optional `scheduledWithinMs` parameter.
Copilot's `usePulseChips` passes `3 * 24 * 60 * 60 * 1000`; the Library
briefing omits it to keep its existing (unbounded) behavior.
- Scheduled config-based sitrep items are skipped when
`next_scheduled_run` is missing or outside the window.
- `SitrepItem` wraps the message in `OverflowText` so a single-line
ellipsis + hover tooltip replaces raw overflow.
## Test plan
- [ ] `/library` — scheduled and listening agents appear in the sitrep
with accurate copy; fleet summary counts scheduled agents correctly;
long messages truncate with a tooltip on hover.
- [ ] `/copilot` — on an empty session with the `AGENT_BRIEFING` flag
on, the briefing only shows scheduled agents whose next run is within 3
days; agents scheduled further out no longer appear as "scheduled"
chips.
- [ ] Scheduled chip text reads "Scheduled to run in {Nm|Nh|Nd}"
matching `next_scheduled_run`.
🤖 Generated with [Claude Code](https://claude.com/claude-code)
---------
Co-authored-by: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
## Why
The Redis dedup lock (`chat:msg_dedup:{session}:{content_hash}`, 30s
TTL) was solving the wrong problem:
- Its purpose: block infra/nginx retries from calling
`append_and_save_message` twice after a client disconnect, writing a
duplicate user message to the DB.
- The approach: deliberately hold the lock for 30s on `GeneratorExit`.
- Why unnecessary: the executor's cluster lock already prevents
duplicate *execution*. The only real gap was duplicate *DB writes* in
the ~1s before the executor picks up the turn.
## What
- **Deleted** `message_dedup.py` and `message_dedup_test.py` (~150 lines
removed).
- **Removed** all dedup lock code from `routes.py` (~40 lines removed).
- **`append_and_save_message`** is now idempotent and self-contained:
- Uses redis-py's built-in `Lock(timeout=10, blocking_timeout=2)` —
Lua-script atomic acquire/release, no manual poll/sleep loop.
- Lock context manager yields `bool` (`True` = acquired, `False` =
degraded). When degraded (Redis down or 2s timeout), reads from DB
directly instead of cache to avoid stale-state duplicates.
- Idempotency check: if `session.messages[-1]` already matches the
incoming role+content, returns `None` instead of the session.
- Lock released explicitly as soon as the write completes; `try/except`
in `finally` so a cleanup error after a successful write never surfaces
a false 500.
- On cache-write failure, the stale cache entry is invalidated so future
reads fall back to the authoritative DB.
- **`routes.py`** uses the `None` signal: `is_duplicate_message = (await
append_and_save_message(...)) is None`
- Skips `create_session` and `enqueue_copilot_turn` for duplicates —
client re-attaches to the existing turn's Redis stream.
- `track_user_message` and `turn_id` generation only happen when
`is_duplicate_message` is false.
- **`subscribe_to_session`** retry window increased from 1×50ms to
3×100ms — covers the window where a duplicate request subscribes before
the original's `create_session` hset completes.
- **Cleaned up** `routes_test.py`: removed 5 dedup-specific tests and
the `mock_redis` setup from `_mock_stream_internals`; added
duplicate-skips-enqueue test.
## How
The idempotency guard distinguishes legit same-text messages from
retries via the **assistant turn between them**: if the user said "yes",
got a response, and says "yes" again, `session.messages[-1]` is the
assistant reply, so the role check fails and the second message goes
through. A retry (no response yet) sees the user message as the last
entry and is blocked.
```python
if (
session.messages
and session.messages[-1].role == message.role
and session.messages[-1].content == message.content
):
return None # duplicate — caller skips enqueue
```
The Redis lock ensures this check always sees authoritative state even
in multi-replica deployments. When the lock is unavailable (Redis down
or contention), reading from DB directly (bypassing potentially stale
cache) provides the same safety guarantee at the cost of a DB
round-trip.
## Checklist
- [x] PR targets `dev`
- [x] Conventional commit title with scope
- [x] Tests added/updated (duplicate detection, lock degradation, DB
error, cache invalidation paths)
- [x] `poetry run format` and `poetry run pyright` pass clean
- [x] No new linter suppressors
### Why / What / How
**Why:** Completed copilot sessions with many messages showed a
completely empty chat view. A user reported a 158-message session that
appeared blank on reload.
**What:** Two bugs fixed:
1. **Backend** — initial page load always returned the newest 50
messages in DESC order. For sessions heavy in tool calls, the user's
original messages (seq 0–5) were never included; all 50 slots consumed
by mid-session tool outputs.
2. **Frontend** — convertChatSessionToUiMessages silently dropped user
messages with null/empty content.
**How:** For completed sessions (no active stream), the backend now
loads from sequence 0 in ASC order. Active/streaming sessions keep
newest-first for streaming context. A new after_sequence forward cursor
enables infinite-scroll for subsequent pages (sentinel moves to bottom).
The frontend wires forward_paginated + newest_sequence end-to-end.
### Changes 🏗️
- db.py: added from_start (ASC) and after_sequence (forward cursor)
modes; added newest_sequence to PaginatedMessages
- routes.py: detect completed vs active on initial load; pass
from_start=True for completed; expose newest_sequence +
forward_paginated; accept after_sequence param
- convertChatSessionToUiMessages.ts: never drop user messages with empty
content
- useLoadMoreMessages.ts: forward pagination via after_sequence; append
pages to end
- ChatMessagesContainer.tsx: LoadMoreSentinel at bottom for
forward-paginated sessions
- Wire newestSequence + forwardPaginated end-to-end through
useChatSession/useCopilotPage/ChatContainer
- openapi.json: add after_sequence + newest_sequence/forward_paginated;
regenerate types
- db_test.py: 9 new unit tests for from_start and after_sequence modes
### Checklist 📋
#### For code changes:
- [x] I have clearly listed my changes in the PR description
- [x] I have made a test plan
- [x] I have tested my changes according to the test plan:
- [x] Open a completed session with many messages — first user message
visible on initial load
- [x] Scroll to bottom of completed session — load more appends next
page
- [x] Open active/streaming session — newest messages shown first,
streaming unaffected
- [x] Backend unit tests: all 28 pass
- [x] Frontend lint/format: clean, no new type errors
---------
Co-authored-by: chernistry <73943355+chernistry@users.noreply.github.com>
Co-authored-by: Nicholas Tindle <nicholas.tindle@agpt.co>
## Summary
- **LibrarySortMenu / AgentFilterMenu**: Force `!bg-transparent` and
neutralise legacy `SelectTrigger` styles (`m-0.5`, `ring-offset-white`,
`shadow-sm`) that caused a white background around the trigger
- **EditNameDialog**: Replace client-side `supabase.auth.updateUser()`
with server-side `PUT /api/auth/user` route — fixes "Auth session
missing!" error caused by `httpOnly` cookies being inaccessible to
browser JS
- **StatsGrid**: Swap label `Text` for `OverflowText` so tile labels
truncate with `…` and show a tooltip instead of wrapping when the grid
is squeezed
- **PulseChips**: Set fixed `15rem` chip width with `shrink-0`,
horizontal scroll, and styled thin scrollbar
- **Tests**: Updated `EditNameDialog` tests to use MSW instead of
mocking Supabase client; added 7 new `PulseChips` integration tests
## Test plan
- [x] `pnpm test:unit` — all 1495 tests pass (91 files)
- [x] `pnpm format && pnpm lint` — clean
- [x] `pnpm types` — no new errors (pre-existing only)
- [ ] QA `/library?sort=updatedAt` — sort menu trigger has no white bg
- [ ] QA `/library` — StatsGrid labels truncate with tooltip on narrow
viewports
- [ ] QA `/copilot` — PulseChips scroll horizontally at fixed width
- [ ] QA `/copilot` — Edit name dialog saves successfully (no "Auth
session missing!")
🤖 Generated with [Claude Code](https://claude.com/claude-code)
---------
Co-authored-by: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Requested by @Torantulino
Adds the 2 xAI Grok 4.20 models available on OpenRouter that are missing
from the platform.
## Why
`x-ai/grok-4.20` and `x-ai/grok-4.20-multi-agent` are xAI's current
flagship models (released March 2026) and are available via OpenRouter,
but weren't accessible from the platform's LLM blocks.
## Changes
**`autogpt_platform/backend/backend/blocks/llm.py`**
- Added `GROK_4_20` and `GROK_4_20_MULTI_AGENT` enum members
- Added corresponding `MODEL_METADATA` entries (open_router provider, 2M
context window, price tier 3)
**`autogpt_platform/backend/backend/data/block_cost_config.py`**
- Added `MODEL_COST` entries at 5 credits each (flagship tier, $2/M in)
**`docs/integrations/block-integrations/llm.md`**
- Added new model IDs to all LLM block tables
| Model | Pricing | Context |
|-------|---------|---------|
| `x-ai/grok-4.20` | $2/M in, $6/M out | 2M |
| `x-ai/grok-4.20-multi-agent` | $2/M in, $6/M out | 2M |
Both models use the standard OpenRouter chat completions API — no
special handling needed.
Resolves: SECRT-2196
---------
Co-authored-by: Torantulino <22963551+Torantulino@users.noreply.github.com>
Co-authored-by: Toran Bruce Richards <Torantulino@users.noreply.github.com>
Co-authored-by: Otto (AGPT) <otto@agpt.co>
## Why
Introducing paid subscription tiers (PRO, BUSINESS) so we can charge for
AutoPilot capacity beyond the free tier. Without a billing integration,
all users share the same rate limits regardless of their willingness to
pay for additional capacity.
## What
End-to-end subscription billing system using Stripe Checkout Sessions:
**Backend:**
- `SubscriptionTier` enum (`FREE`, `PRO`, `BUSINESS`, `ENTERPRISE`) on
the `User` model
- `POST /credits/subscription` — creates a Stripe Checkout Session for
paid upgrades; for FREE tier or when `ENABLE_PLATFORM_PAYMENT` is off,
sets tier directly
- `GET /credits/subscription` — returns current tier, monthly cost
(cents), and all tier costs
- `POST /credits/stripe_webhook` — handles
`customer.subscription.created/updated/deleted`,
`checkout.session.completed`, `charge.dispute.*`, `refund.created`
- `sync_subscription_from_stripe()` — keeps `User.subscriptionTier` in
sync from webhook events; guards against out-of-order delivery
(cancelled event after new sub created), ENTERPRISE overwrite, and
duplicate webhook replay
- Open-redirect protection on `success_url`/`cancel_url` via
`_validate_checkout_redirect_url()`
- `_cancel_customer_subscriptions()` — cancels both active and trialing
subs; propagates errors so callers can avoid updating DB tier on Stripe
failure
- `_cleanup_stale_subscriptions()` — best-effort cancellation of old
subs when a new one becomes active (paid-to-paid upgrade), to prevent
double-billing
- `get_stripe_customer_id()` with idempotency key to prevent duplicate
Stripe customers on concurrent requests
- `cache_none=False` sentinel fix in `@cached` decorator so Stripe price
lookups retry on transient error instead of poisoning the cache with
`None`
- Stripe Price IDs read from LaunchDarkly (`stripe-price-id-pro`,
`stripe-price-id-business`). If not configured, upgrade returns 422.
**Frontend:**
- `SubscriptionTierSection` component on billing page: tier cards
(FREE/PRO/BUSINESS), upgrade/downgrade buttons, per-tier cost display,
Stripe redirect on upgrade
- Confirmation dialog for downgrades
- ENTERPRISE users see a read-only admin-managed banner
- Success toast on return from Stripe Checkout (`?subscription=success`)
- Uses generated `useGetSubscriptionStatus` /
`useUpdateSubscriptionTier` hooks
## How
- Paid upgrades use Stripe Checkout Sessions (not server-side
subscription creation) — Stripe handles PCI-compliant card collection
and the subscription lifecycle
- Tier is synced back via webhook on
`customer.subscription.created/updated/deleted`
- Downgrade to FREE cancels via Stripe API immediately; a
`stripe.StripeError` during cancellation returns 502 with a generic
message (no Stripe detail leakage)
- LaunchDarkly flags: `stripe-price-id-pro` (string),
`stripe-price-id-business` (string), `enable-platform-payment` (bool)
- `ENABLE_PLATFORM_PAYMENT=false` bypasses Stripe for beta/internal
access (sets tier directly)
## Checklist 📋
#### For code changes:
- [x] I have clearly listed my changes in the PR description
- [x] I have made a test plan
- [x] I have tested my changes according to the test plan:
- [x] `ENABLE_PLATFORM_PAYMENT=false` → tier change updates directly, no
Stripe redirect
- [x] `ENABLE_PLATFORM_PAYMENT=true` with price IDs configured → paid
upgrade redirects to Stripe Checkout
- [x] Stripe webhook `customer.subscription.created` →
`User.subscriptionTier` updated
- [x] Unrecognised price ID in webhook → logs warning, tier unchanged
- [x] ENTERPRISE user webhook event → tier not overwritten
- [x] Empty `STRIPE_WEBHOOK_SECRET` → 503 (prevents HMAC bypass)
- [x] Open-redirect attack on `success_url`/`cancel_url` → 422
#### For configuration changes:
- [x] No `.env` or `docker-compose.yml` changes
- [x] LaunchDarkly flags to create: `stripe-price-id-pro` (string),
`stripe-price-id-business` (string), `enable-platform-payment` (bool)
---------
Co-authored-by: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Co-authored-by: majdyz <majdy.zamil@gmail.com>
## Why
After PR #12804 was squashed into dev, two module-level helper functions
in `backend/copilot/sdk/service.py` remained private (`_`-prefixed)
while being directly imported by name in `sdk/transcript_test.py`.
Pyright reports `reportAttributeAccessIssue` when tests (even those
excluded from CI lint) import private symbols from outside their
defining module.
## What
Rename two helpers to remove the underscore prefix:
- `_process_cli_restore` → `process_cli_restore`
- `_read_cli_session_from_disk` → `read_cli_session_from_disk`
Update call sites in `service.py` and imports/calls/docstrings in
`sdk/transcript_test.py`.
## How
Pure rename — no logic change. Both functions were already module-level
helpers with no reason to be private; the underscore was convention
carried over during the refactor but they are directly unit-tested and
should be public.
All 66 `sdk/transcript_test.py` tests pass after the rename.
## Checklist
- [x] Tests pass (`poetry run pytest
backend/copilot/sdk/transcript_test.py`)
- [x] No `_`-prefixed symbols imported across module boundaries
- [x] No linter suppressors added
Conflicts in baseline/service.py, baseline/transcript_integration_test.py,
and transcript.py arose because dev-only commit 0cd0a76305
(baseline upload fix) overlapped with the same fix in PR #12804 which
landed in master. Took master's version for all three files — it is the
complete, reviewed implementation.
### Why / What / How
**Why:** The copilot had two separate GCS paths (`cli-sessions/` and
`chat-transcripts/`), redundant function names
(`upload_cli_session`/`restore_cli_session`), and no shared context
strategy between modes. When switching from baseline→SDK or
SDK→baseline, the receiving mode discarded the stored transcript and
fell back to full DB reconstruction — loading all raw messages instead
of the compacted form — causing inflated context, wasted tokens, and
loss of CLI compaction summaries.
**What:**
- Single GCS path (`cli-sessions/`) for both modes — `chat-transcripts/`
removed
- Unified public API: `upload_transcript` / `download_transcript` /
`TranscriptDownload`
- `TranscriptMode = Literal["sdk", "baseline"]` persisted in
`.meta.json` — SDK skips `--resume` when `mode != "sdk"`
(baseline-written JSONL has stripped fields / synthetic IDs)
- `extract_context_messages(download, session_messages)` — shared
context primitive used by **both SDK and baseline**: reads compacted
transcript content + fills only the DB gap (messages after watermark),
so CLI compaction summaries are preserved across mode switches
- Watermark fix: `_jsonl_covered = transcript_msg_count + 2` when a real
transcript is present, preventing false gap detection after `--resume`
- Baseline gap-fill: `_append_gap_to_builder` converts `ChatMessage` →
JSONL entries; no more silently discarded stale transcripts
**How:**
```
SDK turn (mode="sdk" transcript available):
──► --resume [full CLI session restored natively]
──► inject gap prefix if DB has messages after watermark
SDK turn (mode="baseline" transcript available):
──► cannot --resume (synthetic CLI IDs)
──► extract_context_messages(download, session_messages):
returns transcript JSONL (compacted, isCompactSummary preserved) + gap
excludes session_messages[-1] (current turn — caller injects it separately)
──► format as <conversation_history> + "Now, the user says: {current}"
Baseline turn (any transcript):
──► _load_prior_transcript → TranscriptDownload
──► extract_context_messages(download, session_messages) + session_messages[-1]
replaces full session.messages DB read
──► LLM messages: [compacted history + gap] + [current user turn]
Transcript unavailable — both SDK (use_resume=False) and baseline:
──► extract_context_messages(None, session_messages) returns session_messages[:-1]
(all prior DB messages except the current user turn at [-1])
──► graceful fallback — no crash, no empty context
──► covers: first turn, GCS error, corrupt JSONL, missing .meta.json
──► next successful response uploads a fresh transcript
```
`extract_context_messages` is the shared primitive — both modes call the
same function, which handles:
- `download=None` (first turn, GCS unavailable) → falls back to
`session_messages[:-1]`
- Empty/corrupt content → falls back to `session_messages[:-1]`
- `bytes` content (raw GCS) or `str` content (pre-decoded baseline path)
- `isCompactSummary=True` entries → preserved so CLI compaction survives
mode switches
- Missing/corrupt `.meta.json` → `message_count` defaults to `0`, `mode`
defaults to `"sdk"`
**Why `[:-1]` and not all messages?** `session_messages[-1]` is always
the current user turn being handled right now. Both callers inject it
separately — SDK wraps it as `"Now, the user says: ..."`, baseline
appends it as the final message in the LLM array. Returning it inside
`extract_context_messages` would double-inject it.
### Changes 🏗️
- **`transcript.py`**: `CliSessionRestore` → `TranscriptDownload` +
`mode` field; `upload_cli_session` → `upload_transcript`;
`restore_cli_session` → `download_transcript`; add `TranscriptMode`,
`detect_gap`, `extract_context_messages`; import `ChatMessage` via
relative path to match `service.py` style
- **`sdk/service.py`**: mode-check before `--resume`; `_RestoreResult`
carries `baseline_download` + `context_messages` + `transcript_content`;
`_build_query_message` accepts `prior_messages` override;
`_restore_cli_session_for_turn` populates `context_messages` via
`extract_context_messages` and sets `transcript_content` to prevent
duplicate DB reconstruction; watermark fix (`_jsonl_covered =
transcript_msg_count + 2`)
- **`baseline/service.py`**: `_load_prior_transcript` returns `(bool,
TranscriptDownload | None)`; LLM context replaced with
`extract_context_messages(download, messages)`; `_append_gap_to_builder`
+ `detect_gap` call; `upload_transcript(mode="baseline")`
- **`sdk/transcript.py`**: updated re-exports, old aliases removed
- **`scripts/download_transcripts.py`**: updated for `bytes | str`
content type
- **Test files**: 179 tests total; `transcript_test.py`,
`baseline/transcript_integration_test.py`,
`sdk/service_helpers_test.py`, `sdk/test_transcript_watermark.py`,
`test/copilot/test_transcript_watermark.py` all updated/added
### Checklist 📋
#### For code changes:
- [x] I have clearly listed my changes in the PR description
- [x] I have made a test plan
- [x] I have tested my changes according to the test plan:
- [x] 179 unit tests pass — `transcript_test`,
`baseline/transcript_integration_test`, `sdk/service_helpers_test`,
`sdk/test_transcript_watermark`
- [x] pyright 0 errors on all changed files
- [x] SDK `--resume` path still works when `mode="sdk"` transcript is
present
- [x] SDK fallback uses `extract_context_messages` (compacted baseline
content + gap) when `mode="baseline"` transcript is stored — no more
full DB reconstruction
- [x] Baseline uses `extract_context_messages` per turn instead of full
`session.messages` DB read
- [x] `isCompactSummary=True` entries preserved across mode switches
- [x] Watermark (`_jsonl_covered`) fix prevents false gap detection
after `--resume`
- [x] Baseline gap detection no longer silently discards stale
transcripts
- [x] `TranscriptDownload.content` accepts `bytes | str` — backward
compatible
- [x] Transcript unavailable (GCS error, first turn, corrupt file)
gracefully falls back to `session_messages[:-1]` without crash — applies
to both SDK and baseline paths
---------
Co-authored-by: chernistry <73943355+chernistry@users.noreply.github.com>
Co-authored-by: Nicholas Tindle <nicholas.tindle@agpt.co>
_load_prior_transcript was returning False for missing/invalid transcripts,
which caused should_upload_transcript to suppress the upload. The original
intent was to protect against overwriting a *newer* GCS version — but a
missing or corrupt file is not 'newer'. Only stale (watermark ahead) and
download errors (unknown GCS state) should suppress upload.
Also renames transcript_covers_prefix → transcript_upload_safe throughout
to accurately describe what the flag means.
Added instruction to check GitHub authentication status before prompting
user. This prevents repeated, unnecessary asking of the user to add
their GitHub credentials when they're already added, which is currently
a prevalent bug.
### Changes 🏗️
- Added one line to
`autogpt_platform/backend/backend/copilot/prompting.py` instructing
AutoPilot to run `gh auth status` before prompting the user to connect
their GitHub account.
Co-authored-by: Toran Bruce Richards <22963551+Torantulino@users.noreply.github.com>
Reduced minZoom on the builder canvas from 0.1 to 0.05 to allow zooming
out further when working with large agent graphs.
Fixes#9325
Co-authored-by: Nicholas Tindle <nicholas.tindle@agpt.co>
- Use asyncio.gather for independent storage_limit + current_usage lookups
- Fix formatBytes unit boundary rollup to use 1024 threshold (binary units)
- Add parametrized formatBytes tests covering all unit boundaries
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
### Why / What / How
**Why:** `fffbe0aad8` changed both `ChatConfig.model` and
`ChatConfig.claude_agent_fallback_model` to `claude-sonnet-4-6`. The
Claude Code CLI rejects this with `Error: Fallback model cannot be the
same as the main model`, causing every standard-mode copilot turn to
fail with exit code 1 — the session "completes" in ~30s but produces no
response and drops the transcript.
**What:** Set `claude_agent_fallback_model` default to `""`.
`_resolve_fallback_model()` already returns `None` on empty string,
which means the `--fallback-model` flag is simply not passed to the CLI.
On 529 overload errors the turn will surface normally instead of
silently retrying with a fallback.
**How:** One-line config change + test update.
### Changes 🏗️
- `ChatConfig.claude_agent_fallback_model` default:
`"claude-sonnet-4-6"` → `""`
- Update `test_fallback_model_default` to assert the empty default
### Checklist 📋
#### For code changes:
- [x] I have clearly listed my changes in the PR description
- [x] I have made a test plan
- [x] I have tested my changes according to the test plan:
- [x] `poetry run pytest backend/copilot/sdk/p0_guardrails_test.py`
#### For configuration changes:
- [x] `.env.default` is updated or already compatible with my changes
- [x] `docker-compose.yml` is updated or already compatible with my
changes
### Why / What / How
Why: Copilot/Autopilot standard requests were still defaulting to Claude
Sonnet 4, while the expected default for this path is Sonnet 4.6.
What: This PR updates the backend Copilot defaults so the
standard/default path and fast path use Sonnet 4.6, and aligns the SDK
fallback model and related test expectations.
How: It changes `ChatConfig.model`, `ChatConfig.fast_model`, and
`ChatConfig.claude_agent_fallback_model` to Sonnet 4.6 values, then
updates backend tests that assert the default Sonnet model strings.
### Changes 🏗️
- Switch `ChatConfig.model` from `anthropic/claude-sonnet-4` to
`anthropic/claude-sonnet-4-6`
- Switch `ChatConfig.fast_model` from `anthropic/claude-sonnet-4` to
`anthropic/claude-sonnet-4-6`
- Switch `ChatConfig.claude_agent_fallback_model` from
`claude-sonnet-4-20250514` to `claude-sonnet-4-6`
- Update backend Copilot tests that assert the default Sonnet model
strings
- Configuration changes:
- No new environment variables or docker-compose changes are required
- Existing `.env.default` and compose files remain compatible because
this only changes backend default model values in code
### Checklist 📋
#### For code changes:
- [x] I have clearly listed my changes in the PR description
- [x] I have made a test plan
- [x] I have tested my changes according to the test plan:
- [x] `poetry run format`
- [x] `poetry run pytest
backend/copilot/baseline/transcript_integration_test.py`
- [x] `poetry run pytest backend/copilot/sdk/service_helpers_test.py`
- [x] `poetry run pytest backend/copilot/sdk/service_test.py`
- [x] `poetry run pytest backend/copilot/sdk/p0_guardrails_test.py`
<details>
<summary>Example test plan</summary>
- [ ] Create from scratch and execute an agent with at least 3 blocks
- [ ] Import an agent from file upload, and confirm it executes
correctly
- [ ] Upload agent to marketplace
- [ ] Import an agent from marketplace and confirm it executes correctly
- [ ] Edit an agent from monitor, and confirm it executes correctly
</details>
#### For configuration changes:
- [x] `.env.default` is updated or already compatible with my changes
- [x] `docker-compose.yml` is updated or already compatible with my
changes
- [x] I have included a list of my configuration changes in the PR
description (under **Changes**)
<details>
<summary>Examples of configuration changes</summary>
- Changing ports
- Adding new services that need to communicate with each other
- Secrets or environment variable changes
- New or infrastructure changes such as databases
</details>
<!-- CURSOR_SUMMARY -->
---
> [!NOTE]
> **Medium Risk**
> Changes default/fallback LLM model identifiers for Copilot requests,
which can affect runtime behavior, cost, and availability
characteristics across both baseline and SDK paths. Risk is mitigated by
being a small, config-only change with updated tests.
>
> **Overview**
> Updates Copilot backend defaults so both the standard (`model`) and
fast (`fast_model`) paths use `anthropic/claude-sonnet-4-6`, and aligns
the Claude Agent SDK fallback model to `claude-sonnet-4-6`.
>
> Adjusts related test expectations in baseline transcript integration
and SDK helper tests to match the new Sonnet 4.6 model strings.
>
> <sup>Reviewed by [Cursor Bugbot](https://cursor.com/bugbot) for commit
563361ac11. Bugbot is set up for automated
code reviews on this repo. Configure
[here](https://www.cursor.com/dashboard/bugbot).</sup>
<!-- /CURSOR_SUMMARY -->
The Claude Code CLI auto-compacts its native session JSONL when the context
approaches the model's token limit (~200K for Sonnet). After compaction the
detailed conversation history is replaced by a ~27K-token summary, causing
the silent context loss users see as memory failures in long sessions.
Root cause identified from production logs for session 93ecf7c9:
- T6 CLI session: 233KB / ~207K tokens (near Sonnet limit)
- T7 CLI compacted session -> ~167KB / ~47K tokens (PreCompact hook missed)
- T12 second compaction -> ~176KB / ~27K tokens (just system prompt + summary)
- T14-T21: cache_read=26714 constantly -- only system prompt visible to Claude
The same stripping we already apply to our transcript (stale thinking blocks,
progress/metadata entries) now also runs on the CLI native session file. At
~2x the size of the stripped transcript, unstripped sessions routinely hit the
compaction threshold within 6-10 turns of a heavy Opus/thinking session.
After stripping:
- same-pod turns reuse the stripped local file (no compaction trigger)
- cross-pod turns restore the stripped GCS file (same benefit)
When a user switches from baseline (fast) mode to SDK (extended_thinking)
mode mid-session, the first SDK turn has has_history=True (prior baseline
messages in DB) but no CLI session file in storage.
The old code gated session_id on `not has_history`, so mode-switch T1
never received a session_id — the CLI generated a random ID that wasn't
uploaded under the expected key. Every subsequent SDK turn would fail to
restore the CLI session and run without --resume, injecting the full
compressed history on each turn, causing model confusion.
Fix: set session_id whenever not using --resume (the `else` branch),
covering T1 fresh, mode-switch T1, and T2+ fallback turns. The retry
path is updated to use `"session_id" in sdk_options_kwargs` as the
discriminator (instead of `not has_history`) so mode-switch T1 retries
also keep the session_id while T2+ retries (where T1 restored a session
file via restore_cli_session) still remove it to avoid "Session ID
already in use".
### Why / What / How
**Why:** CoPilot's Graphiti memory system needed structured metadata to
distinguish memory types (rules, procedures, facts, preferences),
support scoped retrieval, enable targeted deletion, and track memory
costs under the AutoPilot billing account separately from the platform.
**What:** Adds the MemoryEnvelope metadata model, structured
rule/procedure memory types, a derived-finding lane for
assistant-distilled knowledge, two-step forget tools, scope-aware
retrieval filtering, AutoPilot-dedicated API key routing, and several
reliability fixes (streaming socket leaks, event-loop-scoped caches,
ingestion hardening).
**How:** MemoryEnvelope wraps every stored episode with typed metadata
(source_kind, memory_kind, scope, status, confidence) serialized as
JSON. Retrieval filters by scope at the context layer. The forget flow
uses a search-then-confirm two-step pattern. Ingestion queues and client
caches are scoped per event loop via WeakKeyDictionary to prevent
cross-loop RuntimeErrors in multi-worker deployments. API key resolution
falls back to AutoPilot-dedicated keys (CHAT_API_KEY,
CHAT_OPENAI_API_KEY) before platform-wide keys.
### Changes 🏗️
**New: MemoryEnvelope metadata model** (`memory_model.py`)
- Typed memory categories: fact, preference, rule, finding, plan, event,
procedure
- Source tracking: user_asserted, assistant_derived, tool_observed
- Scope namespacing: `real:global`, `project:<name>`, `book:<title>`,
`session:<id>`
- Status lifecycle: active, tentative, superseded, contradicted
- Structured `RuleMemory` and `ProcedureMemory` models for complex
instructions
**New: Targeted forget tools** (`graphiti_forget.py`)
- `memory_forget_search`: returns candidate facts with UUIDs for user
confirmation
- `memory_forget_confirm`: deletes specific edges by UUID after
confirmation
**New: Architecture test** (`architecture_test.py`)
- Validates no new `@cached(...)` usage around event-loop-bound async
clients
- Allowlists pre-existing violations for future cleanup
**Enhanced: memory_store tool** (`graphiti_store.py`)
- Accepts MemoryEnvelope metadata fields (source_kind, scope,
memory_kind, rule, procedure)
- Wraps content in MemoryEnvelope before ingestion
**Enhanced: memory_search tool** (`graphiti_search.py`)
- Scope-aware retrieval with hard filtering on group_id
**Enhanced: Ingestion pipeline** (`ingest.py`)
- Derived-finding lane: distills substantive assistant responses into
tentative findings
- Event-loop-scoped queues and workers via WeakKeyDictionary (fixes
multi-worker RuntimeError)
- Improved error handling and dropped-episode reporting
**Enhanced: Client cache** (`client.py`)
- Per-loop client cache and lock via WeakKeyDictionary (fixes "Future
attached to a different loop")
**Enhanced: Warm context** (`context.py`)
- Filters out non-global-scope episodes from warm context
**Fix: Streaming socket leak** (`baseline/service.py`)
- try/finally around async stream iteration to release httpx connections
on early exit
**Config: AutoPilot key routing** (`config.py`, `.env.default`)
- LLM key fallback: GRAPHITI_LLM_API_KEY → CHAT_API_KEY →
OPEN_ROUTER_API_KEY
- Embedder key fallback: GRAPHITI_EMBEDDER_API_KEY → CHAT_OPENAI_API_KEY
→ OPENAI_API_KEY
- Backwards-compatible: existing behavior unchanged until new keys are
provisioned
### Checklist 📋
#### For code changes:
- [x] I have clearly listed my changes in the PR description
- [x] I have made a test plan
- [x] I have tested my changes according to the test plan:
- [x] `poetry run pytest backend/copilot/graphiti/config_test.py` — 16
tests pass (key fallback priority)
- [x] `poetry run pytest backend/copilot/tools/graphiti_store_test.py` —
store envelope tests pass
- [x] `poetry run pytest backend/copilot/graphiti/ingest_test.py` —
ingestion tests pass
- [x] `poetry run pytest backend/util/architecture_test.py` — structural
validation passes
- [x] Verify memory store/retrieve/forget cycle via copilot chat
- [x] Run AgentProbe multi-session memory benchmark (31 scenarios x3
repeats)
- [x] Confirm no CLOSE_WAIT socket accumulation under sustained
streaming load
- [x] Verify multi-worker deployment doesn't produce loop-binding errors
#### For configuration changes:
- [x] `.env.default` is updated or already compatible with my changes
- [x] `docker-compose.yml` is updated or already compatible with my
changes
- Configuration changes:
- New optional env var `CHAT_OPENAI_API_KEY` — AutoPilot-dedicated
OpenAI key for Graphiti embeddings (falls back to `OPENAI_API_KEY` if
not set)
- `CHAT_API_KEY` now used as first fallback for Graphiti LLM calls (was
`OPEN_ROUTER_API_KEY`)
- Infra action needed: add `CHAT_OPENAI_API_KEY` sealed secret in
`autogpt-shared-config` values (dev + prod)
🤖 Generated with [Claude Code](https://claude.com/claude-code)
<!-- CURSOR_SUMMARY -->
---
> [!NOTE]
> **Medium Risk**
> Touches Graphiti memory ingestion/retrieval and introduces hard-delete
capabilities plus event-loop–scoped caching/queues; failures could
affect memory correctness or delete the wrong edges. Also changes
streaming resource cleanup and key routing, which could surface as
connection or billing/cost attribution issues if misconfigured.
>
> **Overview**
> **Graphiti memory is upgraded from plain text episodes to a structured
JSON `MemoryEnvelope`.** `memory_store` now wraps content with typed
metadata (source, kind, scope, status) and optional structured
`rule`/`procedure` payloads, and ingestion supports JSON episodes.
>
> **Memory retrieval and lifecycle controls are expanded.**
`memory_search` adds optional scope hard-filtering to prevent
cross-scope leakage, warm-context formatting drops non-global scoped
episodes (and avoids empty wrappers), and new two-step tools
(`memory_forget_search` → `memory_forget_confirm`) enable targeted soft-
or hard-deletion of specific graph edges by UUID.
>
> **Reliability and multi-worker safety improvements.** Graphiti client
caching and ingestion worker registries are now per-event-loop (avoiding
cross-loop `Future` errors), streaming chat completions explicitly close
async streams to prevent `CLOSE_WAIT` socket leaks, warm-context is
injected into the first user message to keep the system prompt
cacheable, and a new `architecture_test.py` blocks future process-wide
caching of event-loop–bound async clients. Config updates route Graphiti
LLM/embedder keys to AutoPilot-specific env vars first, and OpenAPI
schema exports include the new memory response types.
>
> <sup>Reviewed by [Cursor Bugbot](https://cursor.com/bugbot) for commit
5fb4bd0a43. Bugbot is set up for automated
code reviews on this repo. Configure
[here](https://www.cursor.com/dashboard/bugbot).</sup>
<!-- /CURSOR_SUMMARY -->
---------
Co-authored-by: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Co-authored-by: copilot-swe-agent[bot] <198982749+Copilot@users.noreply.github.com>
## Changes
**Root cause:** When a copilot session ends with a tool result as the
last saved message (`last_role=tool`), the next assistant response is
never persisted. This happens when:
1. An intermediate flush saves the session with `last_role=tool` (after
a tool call completes)
2. The Claude Agent SDK generates a text response for the next turn
3. The client disconnects (`GeneratorExit`) at the `yield
StreamStartStep` — the very first yield of the new turn
4. `_dispatch_response(StreamTextDelta)` is never called, so the
assistant message is never appended to `ctx.session.messages`
5. The session `finally` block persists the session still with
`last_role=tool`
**Fix:** In `_run_stream_attempt`, after `convert_message()` returns the
full list of adapter responses but *before* entering the yield loop,
pre-create the assistant message placeholder in `ctx.session.messages`
when:
- `acc.has_tool_results` is True (there are pending tool results)
- `acc.has_appended_assistant` is True (at least one prior message
exists)
- A `StreamTextDelta` is present in the batch (confirms this is a text
response turn)
This ensures that even if `GeneratorExit` fires at the first `yield`,
the placeholder assistant message is already in the session and will be
persisted by the `finally` block.
**Tests:** Added `session_persistence_test.py` with 7 unit tests
covering the pre-create condition logic and delta accumulation behavior.
**Confirmed:** Langfuse trace `e57ebd26` for session
`465bf5cf-7219-4313-a1f6-5194d2a44ff8` showed the final assistant
response was logged at 13:06:49 but never reached DB — session had 51
messages with `last_role=tool`.
## Checklist
- [x] My code follows the code style of this project
- [x] I have performed a self-review of my own code
- [x] I have commented my code, particularly in hard-to-understand areas
- [x] I have made corresponding changes to the documentation (N/A)
- [x] My changes generate no new warnings (Pyright warnings are
pre-existing)
- [x] I have added tests that prove my fix is effective
- [x] New and existing unit tests pass locally with my changes
---------
Co-authored-by: Zamil Majdy <zamilmajdy@gmail.com>
### Why / What / How
**Why:** Anthropic prompt caching keys on exact system prompt content.
Two sources of per-session dynamic data were leaking into the system
prompt, making it unique per session/user — causing a full 28K-token
cache write (~$0.10 on Sonnet) on *every* first message for *every*
session instead of once globally per model.
**What:**
1. `get_sdk_supplement` was embedding the session-specific working
directory (`/tmp/copilot-<uuid>`) in the system prompt text. Every
session has a different UUID, making every session's system prompt
unique, blocking cross-session cache hits.
2. Graphiti `warm_ctx` (user-personalised memory facts fetched on the
first turn) was appended directly to the system prompt, making it unique
per user per query.
**How:**
- `get_sdk_supplement` now uses the constant placeholder
`/tmp/copilot-<session-id>` in the supplement text and memoizes the
result. The actual `cwd` is still passed to `ClaudeAgentOptions.cwd` so
the CLI subprocess uses the correct session directory.
- `warm_ctx` is now injected into the first user message as a trusted
`<memory_context>` block (prepended before `inject_user_context` runs),
following the same pattern already used for business understanding. It
is persisted to DB and replayed correctly on `--resume`.
- `sanitize_user_supplied_context` now also strips user-supplied
`<memory_context>` tags, preventing context-spoofing via the new tag.
After this change the system prompt is byte-for-byte identical across
all users and sessions for a given model.
### Changes 🏗️
- `backend/copilot/prompting.py`: `get_sdk_supplement` ignores `cwd` and
uses a constant working-directory placeholder; result is memoized in
`_LOCAL_STORAGE_SUPPLEMENT`.
- `backend/copilot/sdk/service.py`: `warm_ctx` is saved to a local
variable instead of appended to `system_prompt`; on the first turn it is
prepended to `current_message` as a `<memory_context>` block before
`inject_user_context` is called.
- `backend/copilot/service.py`: `sanitize_user_supplied_context`
extended to strip `<memory_context>` blocks alongside `<user_context>`.
### Checklist 📋
#### For code changes:
- [x] I have clearly listed my changes in the PR description
- [x] I have made a test plan
- [x] I have tested my changes according to the test plan:
- [x] `poetry run pytest backend/copilot/prompting_test.py
backend/copilot/prompt_cache_test.py` — all passed
#### For configuration changes:
- [x] `.env.default` is updated or already compatible with my changes
- [x] `docker-compose.yml` is updated or already compatible with my
changes
- [x] I have included a list of my configuration changes in the PR
description (under **Changes**)
---------
Co-authored-by: Zamil Majdy <zamilmajdy@gmail.com>
### Why / What / How
Fixes two SECRT-2226 bugs in copilot chat pagination.
**Bug 1 — can't load older messages when the newest page fits on
screen.** The `IntersectionObserver` in `LoadMoreSentinel` bailed when
`scrollHeight <= clientHeight`, which happens routinely once reasoning +
tool groups collapse. With no scrollbar and no button, users were stuck.
Fix: remove the guard, cap auto-fill at 3 non-scrollable rounds (keeps
the original anti-loop intent), and add a manual "Load older messages"
button as the always-available escape hatch.
**Bug 2 — older loaded pages vanish after a new turn, then reloading
them produces duplicates.** After each stream `useCopilotStream`
invalidates the session query; the refetch returns a shifted
`oldest_sequence`, which `useLoadMoreMessages` used as a signal to wipe
`olderRawMessages` and reset the local cursor. Scroll-back history was
lost on every turn, and the next load fetched a page that overlapped
with AI SDK's retained `currentMessages` — the "loops" users reported.
Fix: once any older page is loaded, preserve `olderRawMessages` and the
local cursor across same-session refetches. Only reset on session
change. The gap between the new initial window and older pages is
covered by AI SDK's retained state.
### Changes 🏗️
- `ChatMessagesContainer.tsx`: drop the scrollability guard; add
`MAX_AUTO_FILL_ROUNDS = 3` counter; add "Load older messages" button
(`ghost`/`small`); distinguish observer-triggered vs. button-triggered
loads so the button bypasses the cap; export `LoadMoreSentinel` for
testing.
- `useLoadMoreMessages.ts`: remove the wipe-and-reset branch on
`initialOldestSequence` change; preserve local state mid-session; still
mirror parent's cursor while no older page is loaded.
- New integration test `__tests__/LoadMoreSentinel.test.tsx`.
No backend changes.
### Checklist 📋
#### For code changes:
- [x] I have clearly listed my changes in the PR description
- [x] I have made a test plan
- [x] I have tested my changes according to the test plan:
- [x] Short/collapsed newest page: "Load older messages" button loads
older pages, preserves scroll
- [x] Full-viewport newest page: scroll-to-top auto-pagination still
works (no regression)
- [x] `has_more_messages=false` hides the button; `isLoadingMore=true`
shows spinner instead
- [x] Bug 2 reproduced locally with temporary `limit=5`: before fix
older page vanished and next load duplicated AI SDK messages; after fix
older page stays and next load fetches cleanly further back
- [x] `pnpm format`, `pnpm lint`, `pnpm types`, `pnpm test:unit` all
pass (1208/1208)
#### For configuration changes:
- [x] `.env.default` is updated or already compatible with my changes
- [x] `docker-compose.yml` is updated or already compatible with my
changes
- [x] I have included a list of my configuration changes in the PR
description (under **Changes**) — N/A
---------
Co-authored-by: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
## Why
Two fixes bundled together:
1. **ModelToggleButton styling**: after merging the ModelToggleButton
feature, the "Standard" state was invisible — no background, no label —
while "Advanced" had a colored pill. This was inconsistent with
`ModeToggleButton` where both states (Fast / Thinking) always show a
colored background + label.
2. **Execution ID filter on platform cost admin page**: admins needed to
look up cost rows for a specific agent run but had no way to filter by
`graph_exec_id`. All other identifiers (user, model, provider, block,
tracking type) were already filterable.
## What
- **ModelToggleButton**: inactive (Standard) state now uses
`bg-neutral-100 text-neutral-700 hover:bg-neutral-200` (same palette as
ModeToggleButton inactive), always shows the "Standard" label.
- **Platform cost admin page**: added `graph_exec_id` query filter
across the full stack — backend service functions, FastAPI route
handlers, generated TypeScript params types, `usePlatformCostContent`
hook, and the filter UI in `PlatformCostContent`.
## How
### ModelToggleButton
Changed the inactive-state class from hover-only transparent to
always-visible neutral background, and added the "Standard" text label
(was empty before — only the CPU icon showed).
### Execution ID filter
Added `graph_exec_id: str | None = None` parameter to:
- `_build_prisma_where` — applies `where["graphExecId"] = graph_exec_id`
- `get_platform_cost_dashboard`, `get_platform_cost_logs`,
`get_platform_cost_logs_for_export`
- All three FastAPI route handlers (`/dashboard`, `/logs`,
`/logs/export`)
- Generated TypeScript params types
- `usePlatformCostContent`: new `executionIDInput` /
`setExecutionIDInput` state, wired into `filterParams`, `handleFilter`,
and `handleClear`
- `PlatformCostContent`: new Execution ID input field in the filter bar
## Changes
- [x] I have explained why I made the changes, not just what I changed
- [x] There are no unrelated changes in this PR
- [x] I have run the relevant linters and tests before submitting
---------
Co-authored-by: Zamil Majdy <zamilmajdy@gmail.com>
## Why
When a user switches from **baseline** (fast) mode to **SDK**
(extended_thinking) mode mid-session, every subsequent SDK turn started
fresh with no memory of prior conversation.
Root cause: two complementary bugs on mode-switch T1 (first SDK turn
after baseline turns):
1. `session_id` was gated on `not has_history`. On mode-switch T1,
`has_history=True` (prior baseline turns in DB) so no `session_id` was
set. The CLI generated a random ID and could not upload the session file
under a predictable path → `--resume` failed on every following SDK
turn.
2. Even if `session_id` were set, the upload guard `(not has_history or
state.use_resume)` would block the session file upload on mode-switch T1
(`has_history=True`, `use_resume=False`), so the next turn still cannot
`--resume`.
Together these caused every SDK turn to re-inject the full compressed
history, causing model confusion (proactive tool calls, forgetting
context) observed in session `8237a27b-45d0-4688-af20-c185379e926f`.
## What
- **`service.py`**: Change `elif not has_history:` → `else:` for the
`session_id` assignment — set it whenever `--resume` is not active.
Covers T1 fresh, mode-switch T1 (`has_history=True` but no CLI session
exists), and T2+ fallback turns where restore failed.
- **`service.py` retry path**: Replace `not has_history` with
`"session_id" in sdk_options_kwargs` as the discriminator, so
mode-switch T1 retries also keep `session_id` while T2+ retries (where
`restore_cli_session` put a file on disk) correctly remove it to avoid
"Session ID already in use".
- **`service.py` upload guard**: Remove `and not skip_transcript_upload`
and `and (not has_history or state.use_resume)` from the
`upload_cli_session` guard. The CLI session file is independent of the
JSONL transcript; and upload must run on mode-switch T1 so the next turn
can `--resume`. `upload_cli_session` silently skips when the file is
absent, so unconditional upload is always safe.
## How
| Scenario | Before | After |
|---|---|---|
| T1 fresh (`has_history=False`) | `session_id` set ✓ | `session_id` set
✓ |
| Mode-switch T1 (`has_history=True`, no CLI session) | ❌ not set —
**bug** | `session_id` set ✓ |
| T2+ with `--resume` | `resume` set ✓ | `resume` set ✓ |
| T2+ retry after `--resume` failed | `session_id` removed ✓ |
`session_id` removed ✓ |
| Mode-switch T1 retry | `session_id` removed ❌ | `session_id` kept ✓ |
| Upload on mode-switch T1 | ❌ blocked by guard — **bug** | uploaded ✓ |
7 new unit tests in `TestSdkSessionIdSelection` document all session_id
cases.
6 new tests in `mode_switch_context_test.py` cover transcript bridging
for both fast→SDK and SDK→fast switches.
## Checklist
- [x] I have read the contributing guidelines
- [x] My changes are covered by tests
- [x] `poetry run format` passes
---------
Co-authored-by: Zamil Majdy <zamilmajdy@gmail.com>
## Why
After merging #12782 to dev, a k8s rolling deployment triggered
infrastructure-level POST retries — nginx detected the old pod's
connection reset mid-stream and resent the same POST to a new pod. Both
pods independently saved the user message and ran the executor,
producing duplicate entries in the DB (seq 159, 161, 163) and a
duplicate response in the chat. The model saw the same question 3× in
its context window and spent its response commenting on that instead of
answering.
Two compounding issues:
1. **No backend idempotency**: `append_and_save_message` saves
unconditionally — k8s/nginx retries silently produce duplicate turns.
2. **Frontend dedup cleared after success**:
`lastSubmittedMsgRef.current = null` after every completed turn wipes
the dedup guard, so any rapid re-submit of the same text (from a stalled
UI or user double-click) slips through.
## What
**Backend** — Redis idempotency gate in `stream_chat_post`:
- Before saving the user message, compute `sha256(session_id +
message)[:16]` and `SET NX ex=30` in Redis
- If key already exists → duplicate: return empty SSE (`StreamFinish +
[DONE]`) immediately, skip save + executor enqueue
- User messages only (`is_user_message=True`); system/assistant messages
bypass the check
**Frontend** — Keep `lastSubmittedMsgRef` populated after success:
- Remove `lastSubmittedMsgRef.current = null` on stream complete
- `getSendSuppressionReason` already has a two-condition check: `ref ===
text AND lastUserMsg === text` — so legitimate re-asks (after a
different question was answered) still work; only rapid re-sends of the
exact same text while it's still the last user message are blocked
## How
- 30 s Redis TTL covers infrastructure retry windows (k8s SIGTERM →
connection reset → ingress retry typically < 5 s)
- Empty SSE response is well-formed (StreamFinish + [DONE]) — frontend
AI SDK marks the turn complete without rendering a ghost message
- Frontend ref kept live means: submit "foo" → success → submit "foo"
again instantly → suppressed. Submit "foo" → success → submit "bar" →
proceeds (different text updates the ref).
## Tests
- 3 new backend route tests: duplicate blocked, first POST proceeds,
non-user messages bypass
- 5 new frontend `getSendSuppressionReason` unit tests: fresh ref,
reconnecting, duplicate suppressed, different-turn re-ask allowed,
different text allowed
## Checklist
- [x] I have read the [AutoGPT Contributing
Guide](https://github.com/Significant-Gravitas/AutoGPT/blob/master/CONTRIBUTING.md)
- [x] I have performed a self-review of my code
- [x] I have added tests that prove the fix is effective
- [x] I have run `poetry run format` and `pnpm format` + `pnpm lint`
## Summary
Fixes stream disconnection bugs where the UI shows "running" with no
output when users switch between copilot chat sessions. The root cause
is that the old SSE fetch is not aborted and backend XREAD listeners
keep running until timeout when switching sessions.
### Changes
**Frontend (`useCopilotStream.ts`, `helpers.ts`)**
- Call `sdkStop()` on session switch to abort the in-flight SSE fetch
from the old session's transport
- Fire-and-forget `DELETE` to new backend disconnect endpoint so
server-side listeners release immediately
- Store `resumeStream` and `sdkStop` in refs to fix stale closure bugs
in:
- Wake re-sync visibility handler (could call stale `resumeStream` after
tab sleep)
- Reconnect timer callback (could target wrong session's transport)
- Resume effect (captured stale `resumeStream` during rapid session
switches)
**Backend (`stream_registry.py`, `routes.py`)**
- Add `disconnect_all_listeners(session_id)` to stream registry —
iterates active listener tasks, cancels any matching the session
- Add `DELETE /sessions/{session_id}/stream` endpoint — auth-protected,
calls `disconnect_all_listeners`, returns 204
### Why
Reported by multiple team members: when using Autopilot for anything
serious, the frontend loses the SSE connection — particularly when
switching between conversations. The backend completes fine (refreshing
shows full output), but the UI gets stuck showing "running". This is the
worst UX bug we have right now because real users will never know to
refresh.
### How to test
1. Start a long-running autopilot task (e.g., "build a snake game")
2. While it's streaming, switch to a different chat session
3. Switch back — the UI should correctly show the completed output or
resume the stream
4. Verify no "stuck running" state
## Test plan
- [ ] Manual: switch sessions during active stream — no stuck "running"
state
- [ ] Manual: background tab for >30s during stream, return — wake
re-sync works
- [ ] Manual: trigger reconnect (kill network briefly) — reconnects to
correct session
- [ ] Verify: `pnpm lint`, `pnpm types`, `poetry run lint` all pass
🤖 Generated with [Claude Code](https://claude.com/claude-code)
---------
Co-authored-by: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Co-authored-by: majdyz <zamil.majdy@agpt.co>
## Why
Users have different task complexity needs. Sonnet is fast and cheap for
most queries; Opus is more capable for hard reasoning tasks. Exposing
this as a simple toggle gives users control without requiring
infrastructure complexity.
Opus costs 5× more than Sonnet per Anthropic pricing ($15/$75 vs $3/$15
per M tokens). Rather than adding a separate entitlement gate, the
rate-limit multiplier (5×) ensures Opus turns deplete the daily/weekly
quota proportionally faster — users self-limit via their existing
budget.
## What
- **Standard/Advanced model toggle** in the chat input toolbar (sky-blue
star icon, label only when active — matches the simulation
DryRunToggleButton pattern but visually distinct)
- **`CopilotLlmModel = Literal["standard", "advanced"]`** —
model-agnostic tier names (not tied to Anthropic model names)
- **Backend model resolution**: `"advanced"` → `claude-opus-4-6`,
`"standard"` → `config.model` (currently Sonnet)
- **Rate-limit multiplier**: Opus turns count as 5× in Redis token
counters (daily + weekly limits). Does **not** affect `PlatformCostLog`
or `cost_usd` — those use real API-reported values
- **localStorage persistence** via `Key.COPILOT_MODEL` so preference
survives page refresh
- **`claude_agent_max_budget_usd`** reduced from $15 to $10
## How
### Backend
- `CopilotLlmModel` type added to `config.py`, imported in
routes/executor/service
- `stream_chat_completion_sdk` accepts `model: CopilotLlmModel | None`
- Model tier resolved early in the SDK path; `_normalize_model_name`
strips the OpenRouter provider prefix
- `model_cost_multiplier` (1.0 or 5.0) computed from final resolved
model name, passed to `persist_and_record_usage` → `record_token_usage`
(Redis only)
- No separate LD flag needed — rate limit is the gate
### Frontend
- `ModelToggleButton` component: sky-blue, star icon, "Advanced" label
when active
- `copilotModel` state in `useCopilotUIStore` with localStorage
hydration
- `copilotModelRef` pattern in `useCopilotStream` (avoids recreating
`DefaultChatTransport`)
- Toggle gated behind `showModeToggle && !isStreaming` in `ChatInput`
## Checklist
- [x] Tests added/updated (ModelToggleButton.test.tsx,
service_helpers_test.py, token_tracking_test.py)
- [x] Rate-limit multiplier only affects Redis counters, not cost
tracking
- [x] No new LD flag needed
## Why
When the agent generator LLM builds a graph, it may need to look up
schema details for graph-only blocks like `AgentInputBlock`,
`AgentOutputBlock`, or `OrchestratorBlock`. These blocks are correctly
hidden from regular CoPilot `find_block` results (they can't run
standalone), but that same filter was also preventing the LLM from
discovering them when composing an agent graph.
## What
Added a `for_agent_generation: bool = False` parameter to
`FindBlockTool`.
## How
- `for_agent_generation=false` (default): existing behaviour unchanged —
graph-only blocks are filtered from both UUID lookups and text search
results.
- `for_agent_generation=true`: bypasses `COPILOT_EXCLUDED_BLOCK_TYPES` /
`COPILOT_EXCLUDED_BLOCK_IDS` so the LLM can find and inspect schemas for
INPUT, OUTPUT, ORCHESTRATOR, WEBHOOK, etc. blocks when building agent
JSON.
- MCP_TOOL blocks are still excluded even with
`for_agent_generation=true` (they go through `run_mcp_tool`, not
`find_block`).
## Checklist
- [x] No new dependencies
- [x] Backward compatible (default `false` preserves existing behaviour)
- [x] No frontend changes
## Why
OpenRouter occasionally returns `null` (not `0`) for
`cache_read_input_tokens` and `cache_creation_input_tokens` on the
initial streaming event, before real token counts are available.
Python's `dict.get(key, 0)` only falls back to `0` when the key is
**missing** — when the key exists with a `null` value, `.get(key, 0)`
returns `None`. This causes `TypeError: unsupported operand type(s) for
+=: 'int' and 'NoneType'` in the usage accumulator on the first
streaming chunk from OpenRouter models.
## What
- Replace `.get(key, 0)` with `.get(key) or 0` for all four token fields
in `_run_stream_attempt`
- Add `TestTokenUsageNullSafety` unit tests in `service_helpers_test.py`
## How
Minimal targeted fix — only the four `+=` accumulation lines changed. No
behaviour change for Anthropic-native models (they never emit null
values).
## Checklist
- [x] Tests cover null event, real event, absent keys, and multi-turn
accumulation
- [x] No behaviour change for Anthropic-native models
- [x] No API changes
### Why / What / How
**Why:** The OrchestratorBlock in agent mode makes multiple LLM calls in
a single node execution (one per iteration of the tool-calling loop),
but the executor was only charging the user once per run via
`_charge_usage`. Tools spawned by the orchestrator also bypassed
`_charge_usage` entirely — they execute via `on_node_execution()`
directly without going through the main execution queue, producing free
internal block executions.
**What:**
1. Charge `base_cost * (llm_call_count - 1)` extra credits after the
orchestrator block completes — covers the additional iterations beyond
the first (which is already paid for upfront).
2. Charge user credits for tools executed inside the orchestrator, the
same way queue-driven node executions are charged.
**How:**
**1. Per-iteration LLM charging**
- New `Block.extra_runtime_cost(execution_stats)` virtual method
(default returns `0`)
- `OrchestratorBlock` overrides it to return `max(0, llm_call_count -
1)`
- New `resolve_block_cost` free function in `billing.py` centralises the
block-lookup + cost-calculation pattern (used by both `charge_usage` and
`charge_extra_runtime_cost`)
- New `billing.charge_extra_runtime_cost(node_exec, extra_count)`
function that debits `base_cost * min(extra_count,
_MAX_EXTRA_RUNTIME_COST)` via `spend_credits()`, running synchronously
in a thread-pool worker
- After `_on_node_execution` completes with COMPLETED status,
`on_node_execution` calls `charge_extra_runtime_cost` if
`extra_runtime_cost > 0` and not a dry run
- `InsufficientBalanceError` from post-hoc charging is treated as a
billing leak: logged at ERROR with `billing_leak: True` structured
fields, user is notified via `_handle_insufficient_funds_notif`, but the
run status stays COMPLETED (work already done)
**2. Tool execution charging**
- New public async `ExecutionProcessor.charge_node_usage(node_exec)`
wrapper around `charge_usage` (with `execution_count=0` to avoid
inflating execution-tier counters); also calls `_handle_low_balance`
internally
- `OrchestratorBlock._execute_single_tool_with_manager` calls
`charge_node_usage` after successful tool execution (skipped for dry
runs and failed/cancelled tool runs)
- Tool cost is added to the orchestrator's `extra_cost` so it shows up
in graph stats display
- `InsufficientBalanceError` from tool charging is re-raised (not
downgraded to a tool error) in all three execution paths:
`_execute_single_tool_with_manager`, `_agent_mode_tool_executor`, and
`_execute_tools_sdk_mode`
**3. Billing module extraction**
- All billing logic extracted from `ExecutionProcessor` into
`backend/executor/billing.py` as free functions — keeps `manager.py` and
`service.py` focused on orchestration
- `ExecutionProcessor` retains thin delegation methods
(`charge_node_usage`, `charge_extra_runtime_cost`) for backward
compatibility with blocks that call them
**4. Structured error signalling**
- Tool error detection replaced brittle `text.startswith("Tool execution
failed:")` string check with a structured `_is_error` boolean field on
the tool response dict
### Changes
- `backend/blocks/_base.py`: Add
`Block.extra_runtime_cost(execution_stats) -> int` virtual method
(default `0`)
- `backend/blocks/orchestrator.py`: Override `extra_runtime_cost`; add
tool charging in `_execute_single_tool_with_manager`; add
`InsufficientBalanceError` re-raise carve-outs in all three execution
paths; replace string-prefix error detection with `_is_error` flag
- `backend/executor/billing.py` (new): Free functions
`resolve_block_cost`, `charge_usage`, `charge_extra_runtime_cost`,
`charge_node_usage`, `handle_post_execution_billing`,
`clear_insufficient_funds_notifications` — extracted from
`ExecutionProcessor`
- `backend/executor/manager.py`: Thin delegation to `billing.*`; remove
~500 lines of billing methods from `ExecutionProcessor`
- `backend/data/credit.py`: Update lazy import source from `manager` to
`billing`
- `backend/blocks/test/test_orchestrator.py`: Add `charge_node_usage`
mock + assertion
- `backend/blocks/test/test_orchestrator_dynamic_fields.py`: Add
`charge_node_usage` async mock
- `backend/blocks/test/test_orchestrator_responses_api.py`: Add
`charge_node_usage` async mock
- `backend/blocks/test/test_orchestrator_per_iteration_cost.py`: New
test file — `extra_runtime_cost` hook, `charge_extra_runtime_cost` math
(positive/zero/negative/capped/zero-cost/block-not-found/IBE),
`charge_node_usage` delegation, `on_node_execution` gate conditions
(COMPLETED/FAILED/zero-charges/dry-run/IBE), tool charging guards
(dry-run/failed/cancelled/IBE propagation)
### Checklist
#### For code changes:
- [x] I have clearly listed my changes in the PR description
- [x] I have made a test plan
- [ ] I have tested my changes according to the test plan:
- [ ] Run `poetry run pytest
backend/blocks/test/test_orchestrator_per_iteration_cost.py`
- [ ] Verify on dev: an OrchestratorBlock run with
`agent_mode_max_iterations=5` and 5 actual iterations is charged 5x the
base cost
- [ ] Verify tool executions inside the orchestrator are charged
🤖 Generated with [Claude Code](https://claude.com/claude-code)
---------
Co-authored-by: majdyz <majdy.zamil@gmail.com>
Co-authored-by: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Co-authored-by: majdyz <majdyz@users.noreply.github.com>
## Why
During a live production session, the copilot lost all conversation
context mid-session. The model stated \"I don't see any implementation
plan in our conversation\" despite 9 prior turns of context. Three
compounding bugs:
**Bug 1 — Self-perpetuating upload gate:** When `restore_cli_session`
fails on a T2+ turn, `state.use_resume=False`. The old gate `and (not
has_history or state.use_resume)` then skips the CLI session upload —
even though the T1 file may exist. Each turn without `use_resume` skips
upload → next turn can't restore → also skips → etc.
**Bug 2 — Blunt message-count cap on retries:** On `prompt-too-long`,
`_reduce_context` retried 3× but rebuilt the same oversized query each
time (transcript was empty, so all 3 attempts were identical). The
`max_fallback_messages` count-cap was a blunt instrument — it threw away
middle turns blindly instead of letting the compressor summarize
intelligently.
**Bug 3 — Gap-empty path returned zero context:** When a transcript
exists but no `--resume` (CLI session unavailable), and the gap is empty
(transcript is current), the code fell through to `return
current_message, False` — the model got no history at all.
## What
1. **Remove upload gate** — upload is attempted after every successful
turn; `upload_cli_session` silently skips when the file is absent.
2. **`transcript_msg_count` set on `cli_restored=False`** — enables the
gap path on the very next turn without waiting for a full upload cycle.
3. **Token-budget compression instead of message-count cap** —
`_reduce_context` now returns `target_tokens` (50K → 15K across
retries). `compress_context` decides what to drop via LLM summarize →
content truncate → middle-out delete → first/last trim. More context
preserved at any budget vs. blindly slicing the list.
4. **Fix gap-empty case** — when transcript is current but `--resume`
unavailable, fall through to full-session compression with the token
budget instead of returning no context.
5. **Transcript seeding after fallback** — after `use_resume=False` with
no stored transcript, compress DB messages to 30K tokens and serialise
as JSONL into `transcript_builder`. Next turn uses the gap path (inject
only new messages) instead of re-compressing full history. Only fires
once per broken session (`not transcript_content` guard).
6. **Seeding guard** — seeding skips when `skip_transcript_upload=True`
(avoids wasted compression work when the result won't be saved).
7. **Structured logging** — INFO/WARNING at every branch of
`_build_query_message` with path variables, context_bytes, compression
results.
## How
**Upload gate** (`sdk/service.py` finally-block): removed `and (not
has_history or state.use_resume)`; added INFO log showing
`use_resume`/`has_history` before upload.
**`transcript_msg_count`**: set from `dl.message_count` in the
`cli_restored=False` branch.
**`_build_query_message`**: `max_fallback_messages: int | None` →
`target_tokens: int | None`; gap-empty case falls through to
full-session compression rather than returning bare message.
**`_reduce_context`**: `_FALLBACK_MSG_LIMITS` → `_RETRY_TARGET_TOKENS =
(50_000, 15_000)`; returns `ReducedContext.target_tokens`.
**`_compress_messages` / `_run_compression`**: both now accept
`target_tokens: int | None` and thread it through to `compress_context`.
**Seeding block**: added `not skip_transcript_upload` guard; uses
`_SEED_TARGET_TOKENS = 30_000` so the seeded JSONL is always compact
enough to pass `validate_transcript`.
## Checklist
- [x] `poetry run format` passes
- [x] No new lint errors introduced (pre-existing pyright errors
unrelated)
- [x] Tests added for `attempt` parameter and `target_tokens` in
`_reduce_context`
- Use explicit except clauses for VirusDetectedError and VirusScanError
in CoPilot WriteWorkspaceFileTool instead of isinstance check
- Log VirusDetectedError as security event (warning level)
- Fix formatBytes rounding at unit boundaries (1024 KB → 1.0 MB)
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Replace custom fetch with generated useGetWorkspaceStorageUsage hook
and StorageUsageResponse type from Orval-generated API client.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Add file storage bar to the CoPilot usage limits popover showing
used/limit bytes, percentage, and file count. Fetches from the
existing GET /workspace/storage/usage endpoint.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
- Move duplicate-path check before virus scan for non-overwrite writes
(avoids slow scan on already-doomed requests)
- Add VirusScanError logging in CoPilot WriteWorkspaceFileTool so
infrastructure failures (ClamAV outage) are logged at ERROR level
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
- Fix VirusScanError (ClamAV outage) returning 409 instead of 500
- Fix duplicate virus scan in CoPilot WriteWorkspaceFileTool
- Fix overwrite quota miscalculation (subtract old file size from usage)
- Add parametrized tests for get_workspace_storage_limit_bytes (all tiers)
- Add test for quota rejection path in WorkspaceManager.write_file()
- Add test for 80% warning log
- Add test for VirusScanError → 500 mapping
- Add test for get_storage_usage endpoint with tier-based limit
- Extract projected_usage variable, remove redundant truthiness guard
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Replace the global `max_workspace_storage_mb` config with per-tier limits:
- FREE: 250 MB, PRO: 1 GB, BUSINESS: 5 GB, ENTERPRISE: 15 GB
Move quota enforcement into WorkspaceManager.write_file() so both REST
uploads and CoPilot agent writes are covered (previously only REST had it).
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
### Why / What / How
This PR simplifies frontend PR validation to one Playwright E2E suite,
moves redundant page-level browser coverage into Vitest integration
tests, and switches Playwright auth to deterministic seeded QA accounts.
It also folds in the follow-up fixes that came out of review and CI:
lint cleanup, CodeQL feedback, PR-local type regressions, and the flaky
Library run helper.
The approach is:
- keep Playwright focused on real browser and cross-page flows that
integration tests cannot prove well
- keep page-level render and mocked API behavior in Vitest
- remove the old PR-vs-full Playwright split from CI and run one
deterministic PR suite instead
- seed reusable auth states for fixed QA users so the browser suite is
less flaky and faster to bootstrap
### Changes 🏗️
- Removed the workflow indirection that selected different Playwright
suites for PRs vs other events
- Standardized frontend CI on a single command: `pnpm test:e2e:no-build`
- Consolidated the PR-gating Playwright suite around these happy-path
specs:
- `auth-happy-path.spec.ts`
- `settings-happy-path.spec.ts`
- `api-keys-happy-path.spec.ts`
- `builder-happy-path.spec.ts`
- `library-happy-path.spec.ts`
- `marketplace-happy-path.spec.ts`
- `publish-happy-path.spec.ts`
- `copilot-happy-path.spec.ts`
- Added the missing browser-only confidence checks to the PR suite:
- settings persistence across reload and re-login
- API key create, copy, and revoke
- schedule `Run now` from Library
- activity dropdown visibility for a real run
- creator dashboard verification after publish submission
- Increased Playwright CI workers from `6` to `8`
- Migrated redundant page-level browser coverage into Vitest
integration/unit tests where appropriate, including marketplace,
profile, settings, API keys, signup behavior, agent dashboard row
behavior, agent activity, and utility/auth helpers
- Seeded deterministic Playwright QA users in
`backend/test/e2e_test_data.py` and reused auth states from
`frontend/src/tests/credentials/`
- Fixed CodeQL insecure randomness feedback by replacing insecure
randomness in test auth utilities
- Fixed frontend lint issues in marketplace image rendering
- Fixed PR-local type regressions introduced during test migration
- Stabilized the Library E2E run helper to support the current Library
action states: `Setup your task`, `New task`, `Rerun task`, and `Run
now`
- Removed obsolete Playwright specs and the temporary migration planning
doc once the consolidation was complete
- Reverted unintended non-test backend source changes; only backend test
fixture changes remain in scope
### Checklist 📋
#### For code changes:
- [x] I have clearly listed my changes in the PR description
- [x] I have made a test plan
- [x] I have tested my changes according to the test plan:
- [x] `pnpm lint`
- [x] `pnpm types`
- [x] `pnpm test:unit`
- [x] `pnpm exec playwright test --list`
- [x] `pnpm test:e2e:no-build` locally
- [ ] PR CI green after the latest push
#### For configuration changes:
- [x] `.env.default` is updated or already compatible with my changes
- [x] `docker-compose.yml` is updated or already compatible with my
changes
- [x] I have included a list of my configuration changes in the PR
description (under **Changes**)
Notes:
- Current local Playwright run on this branch: `28 passed`, `0 flaky`,
`0 retries`, `3m 25s`.
- Latest Codecov report on this PR showed overall coverage `63.14% ->
63.61%` (`+0.47%`), with frontend coverage up `+2.32%` and frontend E2E
coverage up `+2.10%`.
- The backend change in this PR is limited to deterministic E2E test
data setup in `backend/test/e2e_test_data.py`.
- Playwright retries remain enabled in CI; this branch does not add
fail-on-flaky behavior.
---------
Co-authored-by: Zamil Majdy <zamil.majdy@agpt.co>
Co-authored-by: Zamil Majdy <majdy.zamil@gmail.com>
## Summary
- **Token breakdown in provider table**: Added separate Input Tokens and
Output Tokens columns to the By Provider table, making it easy to see
whether costs are driven by large contexts (input) or verbose
responses/thinking (output)
- **New summary cards (8 total)**: Added Avg Cost/Request, Avg Input
Tokens, Avg Output Tokens, and Total Tokens (in/out split) cards plus
P50/P75/P95/P99 cost percentile cards at the top of the dashboard for
at-a-glance cost analysis
- **Cost distribution histogram**: Added a cost distribution section
showing request count across configurable price buckets ($0–0.50,
$0.50–1, $1–2, $2–5, $5–10, $10+)
- **Per-user avg cost**: Added Avg Cost/Req column to the By User table
to identify users with unusually expensive requests
- **Backend aggregations**: Extended `PlatformCostDashboard` model with
`total_input_tokens`, `total_output_tokens`,
`avg_input_tokens_per_request`, `avg_output_tokens_per_request`,
`avg_cost_microdollars_per_request`,
`cost_p50/p75/p95/p99_microdollars`, and `cost_buckets` fields
- **Correct denominators**: Avg cost uses cost-bearing requests only;
avg token stats use token-bearing requests only — no artificial dilution
from non-cost/non-token rows
## Test plan
- [x] Verify the admin cost dashboard loads without errors at
`/admin/platform-costs`
- [x] Check that the new summary cards display correct values
- [x] Verify Input/Output Tokens columns appear in the By Provider table
- [x] Verify Avg Cost/Req column appears in the By User table
- [x] Confirm existing functionality (filters, export, rate overrides)
still works
- [x] Verify backward compatibility — new fields have defaults so old
API responses still work
## Why
`src/app/api/__generated__/` is listed in `.gitignore` but 4 model files
were committed before that rule existed, so git kept tracking them and
they showed up in every PR that touched the API schema.
## What
Run `git rm --cached` on all 4 tracked files so the existing gitignore
rule takes effect. No gitignore content changes needed — the rule was
already correct.
## How
The `check API types` CI job only diffs `openapi.json` against the
backend's exported schema — it does not diff the generated TypeScript
models. So removing these from tracking does not break any CI check.
After this merges, `pnpm generate:api` output will be gitignored
everywhere and future API-touching PRs won't include generated model
diffs.
## Summary
- Use `SystemPromptPreset` with `exclude_dynamic_sections=True` in the
SDK path so the Claude Code default prompt serves as a cacheable prefix
shared across all users, reducing input token cost by ~90%
- Add `claude_agent_cross_user_prompt_cache` config field (default
`True`) to make this configurable, with fallback to raw string when
disabled
- Extract `_build_system_prompt_value()` helper for testability, with
`_SystemPromptPreset` TypedDict for proper type annotation
> **Depends on #12747** — requires SDK >=0.1.58 which adds
`SystemPromptPreset` with `exclude_dynamic_sections`. Must be merged
after #12747.
## Changes
- **`config.py`**: New `claude_agent_cross_user_prompt_cache: bool =
True` field on `ChatConfig`
- **`sdk/service.py`**: `_SystemPromptPreset` TypedDict for type safety;
`_build_system_prompt_value()` helper that constructs the preset dict or
returns the raw string; call site uses the helper
- **`sdk/service_test.py`**: Tests exercise the production
`_build_system_prompt_value()` helper directly — verifying preset dict
structure (enabled), raw string fallback (disabled), and default config
value
## How it works
The Claude Code CLI supports `SystemPromptPreset` which uses the
built-in Claude Code default prompt as a static prefix. By setting
`exclude_dynamic_sections=True`, per-user dynamic sections (working dir,
git status, auto-memory) are stripped from that prefix so it stays
identical across users and benefits from Anthropic's prompt caching. Our
custom prompt (tool notes, supplements, graphiti context) is appended
after the cacheable prefix.
## Test plan
- [x] CI passes (formatting, linting, unit tests)
- [x] Verify `_build_system_prompt_value()` returns correct preset dict
when enabled
- [x] Verify fallback to raw string when
`CHAT_CLAUDE_AGENT_CROSS_USER_PROMPT_CACHE=false`
## Why
The baseline copilot path (OpenAI-compatible / OpenRouter) did not
record any cost when the `x-total-cost` response header was absent, even
though token counts were always available. The admin cost dashboard also
lacked cache token columns.
## What
- **`x-total-cost` header extraction**: Reads the OpenRouter cost header
per LLM call in the `finally` block (so cost is captured even when the
stream errors mid-way). Accumulated across multi-round tool-calling
turns.
- **Cache token extraction**: Extracts
`prompt_tokens_details.cached_tokens` and `cache_creation_input_tokens`
from streaming usage chunks and passes
`cache_read_tokens`/`cache_creation_tokens` through to
`persist_and_record_usage` for storage in `PlatformCostLog`.
- **Dashboard cache token display**: Adds cache read/write columns to
the Raw Logs and By User tables on the admin platform costs dashboard.
Adds `total_cache_read_tokens` and `total_cache_creation_tokens` to
`UserCostSummary`.
- **No cost estimation**: When `x-total-cost` is absent, `cost_usd` is
left as `None` and `persist_and_record_usage` records the entry under
`tracking_type="tokens"`. Token-based cost estimation was removed — the
platform dashboard already handles per-token cost display, and estimates
would introduce inaccuracy in the reported figures.
## How
- In `_baseline_llm_caller`: extract the `x-total-cost` header in the
`finally` block; accumulate to `state.cost_usd`.
- In `_BaselineStreamState`: add `turn_cache_read_tokens` /
`turn_cache_creation_tokens` counters, populated from streaming usage
chunks.
- In `persist_and_record_usage` / `record_cost_log`: pass through
`cache_read_tokens` and `cache_creation_tokens` to `PlatformCostEntry`.
- Frontend: add `total_cache_read_tokens` /
`total_cache_creation_tokens` fields to `UserCostSummary` and render
them as columns in the cost dashboard.
## Test plan
- [x] Verify baseline copilot sessions log cost when `x-total-cost`
header is present
- [x] Verify `cost_usd` stays `None` and token count is logged when
header is absent
- [x] Verify cache tokens appear in the dashboard logs table for
sessions using prompt caching
- [x] Verify the By User tab shows Cache Read and Cache Write columns
- [x] Unit tests: `test_cost_usd_extracted_from_response_header`,
`test_cost_usd_remains_none_when_header_missing`,
`test_cache_tokens_extracted_from_usage_details`
### Why / What / How
**Why:** The Claude Agent SDK's built-in Write and Edit tools have no
defence against output-token truncation. When the LLM generates a large
`content` or `new_string` argument, the API truncates the response
mid-JSON, causing Ajv to reject it with the opaque `"'file_path' is a
required property"` error. The user's work is silently lost, and
retrying with the same approach loops infinitely.
**What:** Replaces the SDK's built-in Write and Edit tools with unified
MCP equivalents that detect truncation and return actionable recovery
guidance. Adds a new `read_file` MCP tool with offset/limit pagination.
Consolidates all file-tool handlers into a single module
(`e2b_file_tools.py`) covering both E2B (sandbox) and non-E2B (local SDK
working directory) modes.
**How:**
- `file_path` is placed first in every JSON schema so truncation is more
likely to preserve the path
- `"required"` is intentionally omitted from all MCP schemas so the MCP
SDK delivers empty/truncated args to the handler instead of rejecting
them with an opaque error
- Handlers detect two truncation patterns: complete (`{}`) and partial
(other fields present but `file_path` missing), returning actionable
error messages in both cases
- Edit uses a per-path `asyncio.Lock` (keyed by resolved absolute path)
to prevent parallel read-modify-write races when MCP tools are
dispatched concurrently
- Both E2B and non-E2B paths validate via `is_allowed_local_path()` /
`is_within_allowed_dirs()` to block directory traversal
- The SDK built-in Write and Edit are added to `SDK_DISALLOWED_TOOLS`;
the SDK built-in Read remains allowed only for workspace-scoped paths
(tool-results/tool-outputs) via `WORKSPACE_SCOPED_TOOLS`
- E2B write/edit tools are registered with `readOnlyHint=False`
(`_MUTATING_ANNOTATION`) to prevent parallel dispatch
- `bridge_to_sandbox` copies host-side tool-result files into the E2B
sandbox on read so `bash_exec` can process them
### Changes 🏗️
- **`e2b_file_tools.py`** — unified file-tool handlers for Write, Read
(`read_file`), Edit, Glob, Grep covering both E2B and non-E2B modes;
per-path edit locking; truncation detection; sandbox symlink-escape
check; `bridge_to_sandbox` for SDK→E2B file bridging
- **`tool_adapter.py`** — registers unified Write/Edit/read_file MCP
tools (non-E2B only); adds `Read` tool for workspace-scoped SDK-internal
reads (both modes); E2B tools use `_MUTATING_ANNOTATION`;
`get_copilot_tool_names` / `get_sdk_disallowed_tools` updated for both
modes
- **`security_hooks.py`** — `WORKSPACE_SCOPED_TOOLS` checked before
`BLOCKED_TOOLS` so SDK internal Read is allowed on tool-results paths;
Write/Edit removed from workspace scope
- **`prompting.py`** — improved wording for large-file truncation
warning
- **`e2b_file_tools_test.py`** — comprehensive tests for non-E2B
Write/Read/Edit (path validation, truncation detection, offset/limit,
binary rejection, schema validation); E2B sandbox symlink-escape,
`bridge_to_sandbox`, and `_sandbox_write` tests
- **`security_hooks_test.py`** — updated tests for revised tool-blocking
and workspace-scoped Read behaviour
### Checklist 📋
#### For code changes:
- [x] I have clearly listed my changes in the PR description
- [x] I have made a test plan
- [x] I have tested my changes according to the test plan:
- [x] Read: normal read, offset/limit, file not found, path traversal
blocked, binary file handling, truncation detection
- [x] Edit: normal edit, old_string not found, old_string not unique,
replace_all, partial truncation, path traversal blocked
- [x] Write: existing tests unchanged; truncation detection, path
validation, large-content warning
- [x] Schema validation: file_path first, required fields intentionally
absent
- [x] CLI built-in Write and Edit are in `SDK_DISALLOWED_TOOLS`; Read is
workspace-scoped only
- [x] E2B write/edit use `_MUTATING_ANNOTATION` (not parallel)
- [x] `black`, `ruff`, `pyright` pass on all modified files
- [ ] CI pipeline passes
## Summary
- Hide the mode toggle button while streaming (instead of disabling it)
to avoid confusing partial-toggle UI
- Remove localStorage mode persistence — mode is now transient in-memory
state only (no stale overrides across sessions)
- The copilot mode indicator now correctly reflects the active session's
mode because it reads from Zustand store which is updated on session
switch
## Changes
- `ChatInput.tsx` — hide `<ModeToggleButton>` when `isStreaming` instead
of passing `isStreaming` prop and showing a disabled button
- `ModeToggleButton.tsx` — remove `isStreaming` prop, disabled state,
and streaming-specific tooltip
- `store.ts` — remove localStorage read/write for `copilotMode`; mode
now defaults to `extended_thinking` and resets on page load
- `local-storage.ts` — keep `COPILOT_MODE` enum entry for backward
compatibility; remove unused `COPILOT_SESSION_MODES`
- `store.test.ts` — update tests to assert mode is NOT persisted to
localStorage
- `ChatInput.test.tsx` / `ModeToggleButton.stories.tsx` — update to
match hide-not-disable behavior
## Test plan
- [x] Create a session in fast mode, create another in extended_thinking
mode
- [x] Switch between sessions and verify the mode indicator updates
correctly
- [x] Mode toggle is hidden (not disabled) while a response is streaming
- [x] Refreshing the page resets mode to extended_thinking (no stale
localStorage override)
### Why / What / How
**Why:** AutoPilot sessions were silently dying with no response. Root
cause: `AsyncSandbox.create()` in the E2B SDK uses
`httpx.AsyncClient(timeout=None)` — infinite wait. When the E2B API
stalled during sandbox provisioning, executor goroutines hung
indefinitely. After 1h42m the RabbitMQ consumer timeout
(`COPILOT_CONSUMER_TIMEOUT_SECONDS = 3600`) killed the pod and all
in-flight sessions were orphaned — user sees no response, no error.
**What:**
1. Added per-attempt timeout + retry loop to `AsyncSandbox.create()`
calls in `e2b_sandbox.py` — 30s/attempt × 3 retries with exponential
backoff (~93s worst case vs infinite)
2. Added recovery enqueue in `AutoPilotBlock.run()` — on unexpected
failure, re-enqueues the session to RabbitMQ so a fresh executor pod
picks it up on the next turn
3. Added `_is_deliberate_block()` guard so recursion-limit errors are
not re-enqueued (they are expected terminations)
4. Unit tests for both new mechanisms
**How:**
- `asyncio.wait_for(AsyncSandbox.create(), timeout=30)` wraps each
attempt; `TimeoutError` triggers retry
- Redis creation sentinel TTL bumped 60→120s to cover the full retry
window (prevents concurrent callers from seeing stale sentinel)
- `_enqueue_for_recovery` calls `enqueue_copilot_turn()` with the
original prompt so the session resumes where it left off; dry-run
sessions are skipped; enqueue failures are logged but never mask the
original error
- `CancelledError` is re-raised after yielding the error output
(cooperative cancellation)
### Changes 🏗️
**`backend/copilot/tools/e2b_sandbox.py`**
- Added `_SANDBOX_CREATE_TIMEOUT_SECONDS = 30`,
`_SANDBOX_CREATE_MAX_RETRIES = 3`
- Bumped `_CREATION_LOCK_TTL` 60 → 120s
- Replaced bare `AsyncSandbox.create()` with `asyncio.wait_for` + retry
loop
**`backend/blocks/autopilot.py`**
- Added `_is_deliberate_block(exc)` — returns True for recursion-limit
RuntimeErrors
- Added `_enqueue_for_recovery(session_id, user_id, message, dry_run)` —
re-enqueues to RabbitMQ; no-ops on dry_run
- Exception handler in `run()` calls `_enqueue_for_recovery` for
transient failures; inner try/except prevents enqueue failure from
masking the original error
**`backend/blocks/test/test_autopilot.py`**
- `TestIsDeliberateBlock` — 4 unit tests for `_is_deliberate_block`
- `TestRecoveryEnqueue` — 5 tests: transient error triggers enqueue,
recursion limit skips, dry_run passes flag through, enqueue failure
doesn't mask original error, `ctx.dry_run` is OR-ed in
### Checklist 📋
#### For code changes:
- [x] I have clearly listed my changes in the PR description
- [x] I have made a test plan
- [x] I have tested my changes according to the test plan:
- [x] `poetry run pytest backend/blocks/test/test_autopilot.py -xvs` —
24/24 pass
- [x] Verified retry logic constants: 30s × 3 retries + 1s + 2s = 93s
worst case, sentinel TTL 120s covers it
- [x] Verified `_enqueue_for_recovery` is no-op for dry_run=True (no
RabbitMQ publish)
- [x] Verified `CancelledError` re-raises after yield
## Why
The Claude CLI 2.1.97 (bundled in `claude-agent-sdk 0.1.58`) changed the
`--resume` flag to accept a **session UUID**, not a file path. Our
service was incorrectly passing a temp file path (from
`write_transcript_to_tempfile`), causing the CLI subprocess to crash
with exit code 1 on every T2+ message — breaking all multi-turn CoPilot
conversations.
Additionally, using a file-per-pod approach meant pod affinity was
required for `--resume` to work (the file only existed on the pod that
handled T1).
## What
- Add `upload_cli_session()` to `transcript.py`: after each turn, upload
the CLI's native session JSONL (at
`{projects_base}/{encoded_cwd}/{session_id}.jsonl`) to remote storage
- Add `restore_cli_session()` to `transcript.py`: before T2+, download
and restore the CLI native session file to the expected path
- Pass `--session-id {app_uuid}` via `ClaudeAgentOptions` so the CLI
uses the app session UUID as its session ID → predictable file path
- On T2+: call `restore_cli_session()` and if successful, pass `--resume
{session_uuid}` (UUID, not file path)
- Remove `write_transcript_to_tempfile` from the resume path in
service.py (it only exists in transcript.py for compaction use)
- Keep DB reconstruction as last-resort fallback (populates builder
state only, no `--resume`)
- Compaction retry path now runs without `--resume` (compacted content
cannot be written in CLI native format)
## How
**Normal multi-turn flow (fixed):**
1. T1: SDK runs with `--session-id {app_uuid}` → CLI writes session to
predictable path
2. T1 finally: `upload_cli_session()` uploads native session to storage
(GCS/local)
3. T2+: `restore_cli_session()` downloads and writes the native session
back to disk
4. T2+: `--resume {app_uuid}` → CLI reads the restored session → full
context preserved
**Cross-pod benefit:**
The native session file is now in remote storage, so any pod can restore
it before a turn. Pod affinity for CoPilot is no longer required.
**Backward compatibility:**
- First turn: no native session in storage → runs without `--resume`
(same as before)
- If `restore_cli_session` fails: falls back gracefully to no
`--resume`, logs a warning
- DB reconstruction still available as last resort when no transcript
exists at all
## Checklist
- [x] Tests updated (service_helpers_test, retry_scenarios_test,
transcript_test all pass)
- [x] `poetry run ruff check` clean
- [x] `poetry run black --check` clean
- [x] `poetry run pyright` 0 errors on changed files
## Why
`platform_cost_test.py` had an extra blank line between
`TestUsdToMicrodollars.test_large_value` and `class TestMaskEmail`,
causing black to flag it. This failure was appearing in the CI merge
checks of unrelated PRs that target `dev`.
## What
Remove the extra blank line (3 → 2) to satisfy black's formatting rules.
## How
Single-character diff — no logic changes.
## Why
When the AutoPilot copilot needed to connect credentials for an existing
agent, it was routing users to the Builder — flagged by @Pwuts in [the
AutoPilot Credential UX
thread](https://discord.com/channels/1126875755960336515/1492203735034892471/1492204936056930304).
Two root causes:
1. **Credential race-condition on the run/schedule path.**
`_check_prerequisites` only catches missing creds *before* the
executor/scheduler call. If creds are deleted (or drift) between the
prereq check and the actual call, the executor/scheduler raises
`GraphValidationError`. The tool returned a plain `ErrorResponse`, and
the LLM fell back to `create_agent`/`edit_agent` — whose
`AgentSavedResponse.agent_page_link=/build?flowID=...` is exactly the
Builder redirect the user saw.
2. **`GraphValidationError.node_errors` lost over RPC.** The scheduler
call goes through `get_scheduler_client()` (RPC). The server-side error
handler only preserved `exc.args` — the structured `node_errors` mapping
was stripped, making it impossible for the copilot to distinguish
credential failures from other validation errors on the schedule path.
## What
- **Race-condition handling for both run and schedule paths.**
`_run_agent` and `_schedule_agent` now catch `GraphValidationError`,
detect credential-flavoured node errors, and rebuild the inline
`SetupRequirementsResponse` so the credential setup card renders inline
without leaving chat. Mixed credential+structural errors fall through to
plain `ErrorResponse` so structural errors aren't hidden.
- **`GraphValidationError` round-trips over RPC.** `service.py` now
packs `node_errors` into a typed `RemoteCallExtras` field on
`RemoteCallError`, and the client-side handler re-threads it back into
the reconstructed exception.
- **Shared credential-error matcher.** The credential-string matching
logic is extracted to `is_credential_validation_error_message()` in
`backend/executor/utils.py`, backed by `CRED_ERR_*` module-level
constants that are referenced at both raise sites and in the matcher —
so adding a new credential error string doesn't silently break the
copilot fallback.
- **Tool-description guardrails.** `create_agent` and `edit_agent`
descriptions now explicitly say "Do NOT use this to connect credentials
— call run_agent instead." `agent_generation_guide.md` has the same
guardrail for the agent-building context.
## How
- `backend/copilot/tools/run_agent.py`: new
`_build_setup_requirements_from_validation_error()` helper; try/except
around `add_graph_execution` and `add_execution_schedule` in the
respective `_run_agent`/`_schedule_agent` paths; race-condition warnings
logged.
- `backend/executor/utils.py`: `CRED_ERR_*` constants +
`_CREDENTIAL_ERROR_MARKERS` typed tuple + public
`is_credential_validation_error_message()` exported; old private
`_is_credential_error` lambda replaced.
- `backend/util/service.py`: `RemoteCallExtras` Pydantic model with
`node_errors: Optional[dict[str, dict[str, str]]]`; server handler packs
it for `GraphValidationError`; client handler re-threads it;
`exception_class is GraphValidationError` identity check (not
`issubclass`).
- `backend/copilot/tools/create_agent.py`, `edit_agent.py`: added
credential-routing guardrail to tool descriptions.
- `backend/copilot/sdk/agent_generation_guide.md`: added
credential-routing guardrail.
## Test plan
- [x] Unit tests for `is_credential_validation_error_message` (all four
error templates matched, case-insensitive, non-credential messages
rejected).
- [x] Parity tests in `utils_test.py` that pin all `CRED_ERR_*`
constants against `is_credential_validation_error_message` — drift when
a new credential error is added fails immediately.
- [x] Unit tests for `_build_setup_requirements_from_validation_error`:
credential error → `SetupRequirementsResponse`; non-credential error →
`None`; mixed errors → `None`.
- [x] E2E test for `_schedule_agent` race path:
`get_scheduler_client().add_execution_schedule` mocked to raise
credential `GraphValidationError` → response is `setup_requirements`,
not generic error.
- [x] E2E test for `_run_agent` race path:
`execution_utils.add_graph_execution` mocked with `AsyncMock` to raise
credential `GraphValidationError` → response is `setup_requirements`.
- [x] `RemoteCallError` round-trip tests in `service_test.py`: server
handler packs `node_errors` into `extras`; client handler unpacks; full
round-trip preserves `node_errors`.
- [x] Backwards-compat test: old `RemoteCallError` without `extras`
still deserializes to `GraphValidationError` with empty `node_errors`.
## Summary
- Extract `ThinkingStripper` from `baseline/service.py` into a shared
`copilot/thinking_stripper.py` module
- Apply thinking-tag stripping to the SDK streaming path
(`_dispatch_response`) so `<internal_reasoning>` and `<thinking>` tags
emitted by non-extended-thinking models (e.g. Sonnet) are stripped
before reaching the SSE client
- Flush any buffered text from the stripper at stream end so no content
is lost
- Add unit tests for the shared `ThinkingStripper` and integration tests
for the SDK dispatch path
## Problem
When using Claude Sonnet (which doesn't have extended thinking), the
model sometimes outputs `<internal_reasoning>...</internal_reasoning>`
tags as visible text in the response stream. The baseline path already
stripped these, but the SDK path did not.
## Test plan
- [ ] CI passes (unit tests for ThinkingStripper and SDK dispatch
stripping)
- [ ] Manual test: send a message via Sonnet and verify no
`<internal_reasoning>` tags appear in the response
## Summary
- Fixes duplicate message content shown in CoPilot during SSE
reconnections (page visibility change, network hiccups, wake-resync)
- The `resume_session_stream` backend always replays from `"0-0"`
(beginning of Redis stream), and replayed `UIMessage` objects get new
generated IDs from `useChat`, bypassing the old adjacent-only content
dedup
- Extends `deduplicateMessages` to track all seen `role +
preceding-user-context + content` fingerprints globally, catching
replayed messages regardless of different IDs or position in the list
- Scopes fingerprints by preceding user message text to avoid false
positives when the assistant legitimately gives the same answer to
different prompts
## Test plan
- [ ] Verify new unit tests pass in CI (`helpers.test.ts` - 7 new dedup
test cases)
- [ ] Manual: start a long tool-use session, switch tabs, return - no
duplicate content
- [ ] Manual: refresh page during active session - content loads from DB
without duplicates
- [ ] Manual: ask the same question twice in different turns - both
answers preserved
## Why
We've been pinned at `claude-agent-sdk==0.1.45` (bundled CLI 2.1.63)
since PR #12294 because newer versions had two OpenRouter
incompatibilities:
1. **`tool_reference` content blocks** (CLI 2.1.69+) — OpenRouter's Zod
validation rejects them
2. **`context-management-2025-06-27` beta header** (CLI 2.1.91+) —
OpenRouter returns 400
Both are now resolved:
- **`tool_reference`: Fixed by CLI's built-in proxy detection.** CLI
2.1.70+ detects `ANTHROPIC_BASE_URL` pointing to a non-Anthropic
endpoint and disables `tool_reference` blocks automatically. Verified
working in CLI 2.1.97 — the bare CLI test only XFAILs on the beta
header, NOT on tool_reference.
- **`context-management` beta: Fixed by
`CLAUDE_CODE_DISABLE_EXPERIMENTAL_BETAS=1` env var.** Injected via
`build_sdk_env()` for all SDK subprocess calls. Verified in CI.
## What
- Upgrades `claude-agent-sdk` from **0.1.45 → 0.1.58** (bundled CLI
2.1.63 → 2.1.97)
- Injects `CLAUDE_CODE_DISABLE_EXPERIMENTAL_BETAS=1` in
`build_sdk_env()` (all modes)
- Adds `claude_agent_cli_path` config override with executable
validation
- Adds `claude_agent_max_thinking_tokens=8192` (was unlimited — 54% of
$14K/5-day spend was thinking tokens at $75/M)
- Lowers `max_budget_usd` from $100 → $15 and `max_turns` from 1000 → 50
### Features unlocked by the upgrade
| Feature | SDK | Impact |
|---|---|---|
| `exclude_dynamic_sections` | 0.1.57 | Cross-user prompt cache hits
(see #12758) |
| `AssistantMessage.usage` per-turn | 0.1.49 | Cost attribution per LLM
call |
| `task_budget` | 0.1.51 | Per-task cost ceiling at SDK level |
| `get_context_usage()` | 0.1.52 | Live context-window monitoring |
| MCP large-tool-result fix | 0.1.55 | No more silent truncation >50K
chars |
| MCP HTTP/SSE buffer leak fix | CLI 2.1.97 | Production memory creep
~50 MB/hr |
| 429 retry exponential backoff | CLI 2.1.97 | Rate-limit recovery (was
burning all retries in ~13s) |
| `--resume` cache miss fix | CLI 2.1.90 | Prompt cache works after
resume |
| SDK session quadratic-write fix | CLI 2.1.90 | No more slowdown on
long sessions |
| `max_thinking_tokens` | 0.1.57 | Cap extended thinking cost |
## How
- `build_sdk_env()` in `env.py` injects the env var unconditionally (all
3 auth modes)
- `service.py` passes `max_thinking_tokens` to `ClaudeAgentOptions`
- `config.py` adds 3 new fields with env var overrides
- Regression tests verify both OpenRouter compat issues are handled
## Test plan
- [x] CI green on all test matrices (3.11, 3.12, 3.13)
- [x] `test_disable_experimental_betas_env_var_strips_headers` passes —
verifies env var strips both patterns
- [x] `test_bare_cli_*` correctly XFAILs — documents the CLI regression
exists
- [x] `test_sdk_exposes_max_thinking_tokens_option` guards the new param
- [x] Config validation tests use real temp executables
### Why
LLM token costs are significant, especially for the copilot feature. The
system prompt and tool definitions are the two largest static components
of every request — caching them dramatically reduces input token costs
(cache reads cost 10% of the base input price).
Previously, user-specific context (business understanding) was embedded
directly in the system prompt, making it unique per user and preventing
cache sharing across users or sessions. Every request paid full price
for the system prompt even when the content was functionally identical.
A secondary security concern was identified during review: because the
LLM is instructed to parse `<user_context>` blocks, a user could type a
literal `<user_context>…</user_context>` tag in any message and
potentially spoof or suppress their own personalisation context. This PR
includes a full defence-in-depth fix for that injection vector on the
first turn (including new users with no stored understanding), plus
GET-endpoint stripping so injected context is never surfaced back to the
client.
### What
- **`copilot/service.py`**: Added `USER_CONTEXT_TAG` constant (shared by
writer and reader). Added `_USER_CONTEXT_ANYWHERE_RE` /
`_USER_CONTEXT_PREFIX_RE` regexes, `format_user_context_prefix`,
`strip_user_context_prefix`, `sanitize_user_supplied_context`, and
`_sanitize_user_context_field` helpers. Replaced the old
`_build_cacheable_system_prompt` / `_build_system_prompt` pair with a
single `_build_system_prompt` that returns `(static_prompt,
understanding)`. Added `inject_user_context` which sanitizes user input,
optionally wraps trusted understanding, and persists the result to DB.
- **`copilot/sdk/service.py`**: On first turn calls
`inject_user_context` before `_build_query_message` so the query sees
the prefixed content. Passes `user_id if not has_history else None` to
avoid redundant DB lookups on subsequent turns.
- **`copilot/baseline/service.py`**: Same pattern —
`inject_user_context` called before transcript append and OpenAI message
list construction; `openai_messages` loop patches the first user entry
after injection.
- **`blocks/llm.py`**: System prompt sent as a structured block with
`cache_control: {"type": "ephemeral"}`. `cache_control` placed on the
last tool in the tool list. Guards against empty/whitespace-only system
blocks (Anthropic rejects them). Fixed `anthropic.omit` →
`anthropic.NOT_GIVEN` sentinel for the no-tools case.
- **`api/features/chat/routes.py`**: Added `_strip_injected_context`
which returns a shallow copy of each message with the server-injected
`<user_context>` prefix stripped before the GET `/sessions/{id}`
response, so the prefix is invisible to the frontend.
- **`copilot/db.py`**: Added defence-in-depth `result > 1` error log in
`update_message_content_by_sequence`. Added authorization note
documenting why a `userId` join is not required.
- **`data/db_manager.py`**: Registered
`update_message_content_by_sequence` on both the sync and async DB
manager clients.
### How it works
**Static system prompt**: The system prompt is now identical for every
user. The LLM is instructed to look for a `<user_context>` block in the
first user message when present, and to greet new users warmly when no
context is provided.
**User context injection**: On the first turn of a new session, the
caller's business understanding is prepended to the user's message as
`<user_context>…</user_context>`. The prefixed content is also persisted
to the DB so resumed sessions and page reloads retain personalisation.
**`<user_context>` tag sanitization (security)**: `inject_user_context`
calls `sanitize_user_supplied_context` unconditionally — even when
`understanding` is `None` — so new users cannot smuggle a
`<user_context>` tag to the LLM on the first turn. Fields from the
stored `BusinessUnderstanding` object are escaped with
`_sanitize_user_context_field` so user-controlled free-text cannot break
out of the trusted block. The GET endpoint strips the injected prefix
before returning message history to the client.
**All-turn sanitization**: `strip_user_context_tags` (a public alias of
`sanitize_user_supplied_context`) is called unconditionally on every
incoming message in both the SDK and baseline paths — before
`maybe_append_user_message` — so `<user_context>` tags typed by a user
on any turn (not just the first) are stripped before reaching the LLM.
Lone unpaired tags (e.g. `<user_context>spoof` without a closing tag)
are also caught by a second-pass `_USER_CONTEXT_LONE_TAG_RE`
substitution. The system prompt explicitly states the tag is
server-injected, only trusted on the first message, and must be ignored
on subsequent turns.
**Cache placement**: Per Anthropic's caching model, placing
`cache_control` on the system prompt block caches everything up to and
including it. Placing `cache_control` on the last tool definition caches
all tool schemas as a single prefix. Both cache points are set so
repeated requests from any user can hit both caches.
**Langfuse compatibility**: `_build_system_prompt` calls
`prompt.compile(users_information="")` so existing Langfuse prompt
templates remain static and cacheable.
### Checklist 📋
#### For code changes:
- [x] I have clearly listed my changes in the PR description
- [x] I have made a test plan
- [x] I have tested my changes according to the test plan:
- [x] Verify system prompt no longer contains user-specific information
- [x] Verify `<user_context>` block appears in the first user message on
new sessions
- [x] Verify returning users still receive personalised responses via
user context
- [x] Verify Langfuse-sourced prompts compile correctly with empty
`users_information`
- [x] Verify Anthropic API calls include `cache_control` on system block
and last tool
- [x] Verify user-supplied `<user_context>` tags are stripped on the
first turn (including when understanding is None)
- [x] Verify user-supplied `<user_context>` tags are stripped on all
turns (turn 2+ sanitization via `strip_user_context_tags`)
- [x] Verify lone unpaired `<user_context>` tags (no closing tag) are
also stripped
- [x] Verify GET `/sessions/{id}` does not expose the injected
`<user_context>` prefix to the client
---------
Co-authored-by: majdyz <majdy.zamil@gmail.com>
Co-authored-by: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
## Why
The platform cost tracking system had several gaps that made the admin
dashboard less accurate and harder to reason about:
**Q: Do we have per-model granularity on the provider page?**
The `model` column was stored in `PlatformCostLog` but the SQL
aggregation grouped only by `(provider, tracking_type)`, so all models
for a given provider collapsed into one row. Now grouped by `(provider,
tracking_type, model)` — each model gets its own row.
**Q: Why does Anthropic show `per_run` for OrchestratorBlock?**
Bug: `OrchestratorBlock._call_llm()` was building `NodeExecutionStats`
with only `input_token_count` and `output_token_count` — it dropped
`resp.provider_cost` entirely. For OpenRouter calls this silently
discarded the `cost_usd`. For the SDK (autopilot) path,
`ResultMessage.total_cost_usd` was never read. When `provider_cost` is
None and token counts are 0 (e.g. SDK error path), `resolve_tracking`
falls through to `per_run`. Fixed by propagating all cost/cache fields.
**Q: Why can't we get `cost_usd` for Anthropic direct API calls?**
The Anthropic Messages API does not return a dollar amount — only token
counts. OpenRouter returns cost via response headers, so it uses
`cost_usd` directly. The Claude Agent SDK *does* compute
`total_cost_usd` internally, so SDK-mode OrchestratorBlock runs now get
`cost_usd` tracking. For direct Anthropic LLM blocks the estimate uses
per-token rates (see cache section below).
**Q: What about labeling by source (autopilot vs block)?**
Already tracked: `block_name` stores `copilot:SDK`, `copilot:Baseline`,
or the actual block name. Visible in the raw logs table. Not added to
the provider group-by (would explode row count); use the logs table
filter instead.
**Q: Is there double-counting between `tokens`, `per_run`, and
`cost_usd`?**
No. `resolve_tracking()` uses a strict preference hierarchy — exactly
one tracking type per execution: `cost_usd` > `tokens` > provider
heuristics > `per_run`. A single execution produces exactly one
`PlatformCostLog` row.
**Q: Should we track Anthropic prompt cache tokens (PR #12725)?**
Yes — PR #12725 adds `cache_control` markers to Anthropic API calls,
which causes the API to return `cache_read_input_tokens` and
`cache_creation_input_tokens` alongside regular `input_tokens`. These
have different billing rates:
- Cache reads: **10%** of base input rate (much cheaper)
- Cache writes: **125%** of base input rate (slightly more expensive,
one-time)
- Uncached input: **100%** of base rate
Without tracking them separately, a flat-rate estimate on
`total_input_tokens` would be wrong in both directions.
## What
- **Per-model provider table**: SQL now groups by `(provider,
tracking_type, model)`. `ProviderCostSummary` and the frontend
`ProviderTable` show a model column.
- **Cache token columns**: New `cacheReadTokens` and
`cacheCreationTokens` columns in `PlatformCostLog` with matching
migration.
- **LLM block cache tracking**: `LLMResponse` captures
`cache_read_input_tokens` / `cache_creation_input_tokens` from Anthropic
responses. `NodeExecutionStats` gains `cache_read_token_count` /
`cache_creation_token_count`. Both propagate to `PlatformCostEntry` and
the DB.
- **Copilot path**: `token_tracking.persist_and_record_usage` now writes
cache tokens as dedicated `PlatformCostEntry` fields (was
metadata-only).
- **OrchestratorBlock bug fix**: `_call_llm()` now includes
`resp.provider_cost`, `resp.cache_read_tokens`,
`resp.cache_creation_tokens` in the stats merge. SDK path captures
`ResultMessage.total_cost_usd` as `provider_cost`.
- **Accurate cost estimation**: `estimateCostForRow` uses
token-type-specific rates for `tokens` rows (uncached=100%, reads=10%,
writes=125% of configured base rate).
## How
`resolve_tracking` priority is unchanged. For Anthropic LLM blocks the
tracking type remains `tokens` (Anthropic API returns no dollar amount).
For OrchestratorBlock in SDK/autopilot mode it now correctly uses
`cost_usd` because the Claude Agent SDK computes and returns
`total_cost_usd`. For OpenRouter through OrchestratorBlock it now
correctly uses `cost_usd` (was silently dropped before).
## Checklist 📋
#### For code changes:
- [x] I have clearly listed my changes in the PR description
- [x] I have made a test plan
- [x] I have tested my changes according to the test plan:
- [x] `ProviderCostSummary` SQL updated
- [x] Cache token fields present in `PlatformCostEntry` and
`PlatformCostLogCreateInput`
- [x] Prisma client regenerated — all type checks pass
- [x] Frontend `helpers.test.ts` updated for new `rateKey` format
- [x] Pre-commit hooks pass (Black, Ruff, isort, tsc, Prisma generate)
### Why / What / How
**Why:** The `ask_question` copilot tool previously only accepted a
single question per invocation. When the LLM needs to ask multiple
clarifying questions simultaneously, it either crams them into one text
field (requiring users to format numbered answers manually) or makes
multiple sequential tool calls (slow and disruptive UX).
**What:** Replace the single `question`/`options`/`keyword` parameters
with a `questions` array parameter so the LLM can ask multiple questions
in one tool call, each rendered as its own input box.
**How:** Simplified the tool to accept only `questions` (array of
question objects). Each item has `question` (required), `options`, and
`keyword`. The frontend `ClarificationQuestionsCard` already supports
rendering multiple questions — no frontend changes needed.
### Changes 🏗️
- `backend/copilot/tools/ask_question.py`: Replaced dual
question/questions schema with single `questions` array. Extracted
parsing into module-level `_parse_questions` and `_parse_one` helpers.
Follows backend code style: early returns, list comprehensions, top-down
ordering, functions under 40 lines.
- `backend/copilot/tools/ask_question_test.py`: Rewritten with 18
focused tests covering happy paths, keyword handling, options filtering,
and invalid input handling.
### Checklist 📋
#### For code changes:
- [x] I have clearly listed my changes in the PR description
- [x] I have made a test plan
- [ ] I have tested my changes according to the test plan:
- [ ] Run `poetry run pytest backend/copilot/tools/ask_question_test.py`
— all tests pass
🤖 Generated with [Claude Code](https://claude.com/claude-code)
---------
Co-authored-by: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
- Import get_subscription_price_id in v1.py
- get_subscription_status now calls stripe.Price.retrieve for PRO/BUSINESS
tiers to return actual unit_amount instead of hardcoded zeros
- UI will now show correct monthly costs when LD price IDs are configured
- Fix Button import from __legacy__ to design system in SubscriptionTierSection
- Update subscription status tests to mock the new Stripe price lookup
Drop the dual question/questions schema in favor of a single
`questions` array parameter. This removes ~175 lines of complexity
(the _execute_single path, duplicate params, precedence logic).
Restructured per backend code style rules:
- Top-down ordering: public _execute first, helpers below
- Early return with guard clauses, no deep nesting
- List comprehensions via walrus operator in _parse_questions
- Helpers extracted as module-level functions (not methods)
- Functions under 40 lines each
The frontend ClarificationQuestionsCard already renders arrays of
any length — no UI changes needed.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Add isinstance narrowing in test_execute_multiple_questions_ignores_single_params
to fix Pyright type-check CI failure (reportAttributeAccessIssue).
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Tests that access `result.questions` without first narrowing the type
from `ToolResponseBase` to `ClarificationNeededResponse` cause Pyright
type-check failures. Added `assert isinstance(result,
ClarificationNeededResponse)` before accessing `.questions` in 4 tests.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
When navigating back to a cached session, appliedActionKeys was reset to empty
but messages were preserved. This caused previously applied actions to reappear
as unapplied in the UI, allowing them to be re-applied and creating duplicate
undo entries. Clearing messages unconditionally on navigation ensures the
displayed action buttons always reflect the actual applied state.
- Restore top-level `required: ["question"]` in schema for LLM tool-
calling compatibility; validation handles the questions-only path
- Fix keyword null bug: `item.get("keyword")` returning None now
correctly falls back to `question-{idx}` instead of producing "None"
- Filter empty-string options in _build_question (`str(o).strip()`)
to avoid artifacts like "Email, , Slack"
- Revert session type hint to `ChatSession` to match base class contract
- Add tests for null keyword and empty-string options filtering
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
- Remove top-level `required: ["question"]` from schema so the
`questions`-only calling convention is valid for schema-compliant LLMs
- Move logger assignment below all imports (PEP 8 / isort)
- Remove duplicated option filtering in `_execute_single`; let
`_build_question` own that responsibility
- Fix `session` type hint to `ChatSession | None` to match the guard
- Add test for `questions` as non-list type (falls back to single path)
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
- Fix falsy option filtering: use `if o is not None` instead of `if o`
so valid values like "0" are preserved
- Improve multi-question `message` field: join all questions with ";"
instead of only using the first question's text
- Add logging warnings for skipped invalid items in multi-question path
instead of silently dropping them
- Simplify schema: use `"required": ["question"]` instead of empty
required + anyOf (more LLM-friendly)
- Add missing test cases: session=None, single-item questions array,
duplicate keywords, falsy option values
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
The ask_question tool previously only accepted a single question per
invocation, forcing the LLM to cram multiple queries into one text box
or make multiple sequential tool calls. This adds a `questions` parameter
(list of question objects) so multiple input fields render at once.
Backward-compatible: the existing `question`/`options`/`keyword` params
still work. When `questions` (plural) is provided, they take precedence.
The frontend ClarificationQuestionsCard already supports rendering
multiple questions — no frontend changes needed.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Tests for GET/POST /credits/subscription covering:
- GET returns current tier (PRO, FREE default when None)
- POST FREE skips Stripe when payment disabled
- POST PRO sets tier directly for beta users (payment disabled)
- POST paid tier rejects missing success_url/cancel_url with 422
- POST paid tier creates Stripe Checkout Session and returns URL
- POST FREE with payment enabled cancels active Stripe subscription
- Remove useCallback from changeTier (not needed per project guidelines)
- Block self-service tier changes for ENTERPRISE users (admin-managed)
- Preserve current tier on unrecognized Stripe price_id instead of
defaulting to FREE (prevents accidental downgrades during price migration)
Tests for:
- Unknown/mismatched Stripe price_id defaults to FREE (not early return)
- None from LaunchDarkly price flags defaults to FREE
- BUSINESS tier mapping
- StripeError during cancel_stripe_subscription is logged, not raised
When sync_subscription_from_stripe encounters an unrecognized price_id
(e.g. LD flags unconfigured or price changed), it no longer returns early
leaving the user on a stale tier. Instead it defaults to FREE and logs a
warning, keeping the DB state consistent with Stripe's subscription status.
Also guard against None pro_price/biz_price from LaunchDarkly before
comparison to avoid silent mismatches.
EditAgentTool and RunAgentTool call useCopilotChatActions() which throws
if no provider is in the tree. Wrap the panel content with
CopilotChatActionsProvider wired to sendRawMessage so tool components
can send retry prompts without crashing.
The user message was saved to DB before the <user_context> prefix was added
to session.messages. Subsequent upsert_chat_session calls only append new
messages (slicing by existing_message_count), so the prefixed content was
never written to the DB. On page reload or --resume, the unprefixed version
was loaded, losing personalisation.
Fix: add update_message_content_by_sequence to db.py and call it after
injecting the prefix in both sdk/service.py and baseline/service.py.
Add customer.subscription.created to the sync handler so user tier is
upgraded immediately when the subscription is first created (not just on
subsequent updates/deletions).
dev branch already creates SubscriptionTier enum and subscriptionTier column in
20260326200000_add_rate_limit_tier. Remove duplicate DDL from our migration and
only add SUBSCRIPTION to CreditTransactionType using IF NOT EXISTS guard.
Add cancel_stripe_subscription() which lists and cancels all active Stripe
subscriptions for the customer, preventing continued billing after downgrade.
Call it from update_subscription_tier() when tier == FREE and payment is
enabled. Add two unit tests covering active and empty subscription scenarios.
Add tests for the /logs/export endpoint (success, truncated, filters, auth) and
fix missing import of get_platform_cost_logs_for_export in platform_cost_test.py.
Add upfront 422 validation when upgrading to a paid tier without providing
redirect URLs. Also catch stripe.StripeError alongside ValueError to return
a proper 422 instead of a 500 on Stripe API errors.
Pressing Escape while drafting a message was silently discarding the
user's text. Guard the handler so it only closes the panel when focus is
outside an editable element.
Remove get_subscription_cost (referenced deleted flags SUBSCRIPTION_COST_PRO/BUSINESS).
Subscription pricing is now handled by Stripe. Add GRAPHITI_MEMORY flag from dev.
- Remove `# type: ignore[attr-defined]` suppressors from `set_auto_top_up`
and `set_subscription_tier` — pyright resolves `CachedFunction.cache_delete`
through the import boundary without the suppressor
- Add `max(0, ...)` guard to `get_subscription_cost` to prevent negative
LaunchDarkly flag values from yielding negative costs
- Change `SubscriptionTierRequest.tier` from `str` to
`Literal["FREE", "PRO", "BUSINESS"]` so Pydantic rejects ENTERPRISE and
any unknown tier with a 422 at the schema layer
- Move `SubscriptionTier` and feature-flag imports from local function scope
to module-level in v1.py (top-level imports policy)
- Fix `test_sync_subscription_from_stripe_active` mock to use a proper async
`side_effect` function instead of calling an `AsyncMock` inline
POST /credits/subscription now returns {url} when Stripe checkout is needed.
Redirect user to Stripe on non-empty URL, refresh tier on empty URL (beta/FREE).
Remove credit-based tier validation; Stripe handles payment gating.
- Restore CreditTransactionType to original enum without SUBSCRIPTION
- Restore input/ctx fields in ValidationError schema
These changes were accidentally included from workspace drift; they are
not part of this PR and should come from their own respective PRs.
- Replace type: ignore suppressor with monkeypatch.setattr in AIConditionBlock test
- Replace bare sentry_sdk module with typed API imports in metrics/service/manager
- Fix CostLogRow field order: cache_read/creation_tokens after model
- Move /logs/export endpoint to correct position in paths (before analytics)
- Add model, block_name, tracking_type params to export endpoint schema
- Add PlatformCostExportResponse in correct schema position
- Add SUBSCRIPTION to CreditTransactionType enum
- Remove input/ctx from ValidationError schema
- Add model/block/type filter UI inputs and wire to hook/URL
- Make AnthropicIntegration and LaunchDarklyIntegration optional imports in metrics.py
- Add export CSV button wired to handleExport in LogsTable
The previous commit accidentally included SUBSCRIPTION in CreditTransactionType
via the local export-api-schema hook which used a Prisma client generated
from a different worktree schema. Restore to the correct pre-commit state.
- Add /api/admin/platform-costs/logs/export endpoint (100K row cap)
- Add cache_read_tokens and cache_creation_tokens to CostLogRow model
- Add CSV export button to LogsTable with buildCostLogsCsv helper
- Fix llm.py: persist total_provider_cost to stats even when all retries fail
- Update openapi.json: add PlatformCostExportResponse and export endpoint
- Guard Anthropic system block behind sysprompt.strip() to avoid 400 errors
when sysprompt is empty (Anthropic rejects empty text blocks with 400)
- Fix anthropic.omit -> anthropic.NOT_GIVEN in convert_openai_tool_fmt_to_anthropic
- Persist <user_context> prefix into session.messages and transcript on first
turn in both baseline and SDK paths so personalisation survives resume/reload
- Add test for empty-sysprompt -> system key omitted in Anthropic API call
- Re-add describe block for seed message sending (removed in 8b8eb80480):
- verifies sendMessage is called with buildSeedPrompt when isGraphLoaded=true
- verifies sendMessage is NOT called when isGraphLoaded=false (default)
- verifies the hasSentSeedMessageRef guard fires only once per session
- Add test for empty messages guard in prepareSendMessagesRequest
- Guard messages.at(-1) in prepareSendMessagesRequest with an early throw
so a runtime TypeError cannot occur if the AI SDK contract is violated
### Why
Admin user impersonation was silently broken for the copilot/autopilot
chat feature. The SSE stream requests and message feedback requests made
direct HTTP calls to the backend with only a Bearer token — missing the
`X-Act-As-User-Id` header that the impersonation feature requires.
This meant that when an admin impersonated a user and used copilot chat,
messages were processed and feedback was recorded under the admin's
identity, not the impersonated user's. The impersonation header was also
read inconsistently: `custom-mutator.ts` accessed `sessionStorage`
directly (breaking cross-tab impersonation), while other callers had no
impersonation support at all.
### What
- **`src/lib/impersonation.ts`**: Added `getSystemHeaders()` — a single
function that returns all cross-cutting request headers, currently
`X-Act-As-User-Id` when impersonation is active. Uses
`ImpersonationState.get()` which handles both `sessionStorage`
(same-tab) and cookie fallback (cross-tab). Added
`IMPERSONATION_COOKIE_NAME` constant to `constants.ts` to replace the
previously hardcoded local string.
- **`src/app/(platform)/copilot/helpers.ts`**: Added
`getCopilotAuthHeaders()` — combines `getWebSocketToken()` (JWT) with
`getSystemHeaders()` (impersonation) into a single async call for direct
backend requests.
- **`src/app/(platform)/copilot/useCopilotStream.ts`**: Replaced local
`getAuthHeaders()` (JWT only) with shared `getCopilotAuthHeaders()` in
both `prepareSendMessagesRequest` and `prepareReconnectToStreamRequest`.
-
**`src/app/(platform)/copilot/components/ChatMessagesContainer/useMessageFeedback.ts`**:
Switched from `getWebSocketToken()` to `getCopilotAuthHeaders()` for
feedback POST requests.
- **`src/app/api/mutators/custom-mutator.ts`**: Replaced raw
`sessionStorage.getItem(IMPERSONATION_STORAGE_KEY)` with
`getSystemHeaders()` (fixes cross-tab support for all generated API
calls).
- **Tests**: New unit tests for `getCopilotAuthHeaders` (4 cases),
`customMutator` impersonation header propagation (2 cases), and
`ImpersonationState`/`ImpersonationCookie`/`ImpersonationSession` (full
coverage across 3 describe blocks, 18 cases).
### How it works
`getSystemHeaders()` calls `ImpersonationState.get()` which reads
`sessionStorage` first and falls back to the impersonation cookie when
`sessionStorage` is empty (cross-tab scenario). The returned header map
is spread into every outbound request, so a single update to
`getSystemHeaders()` propagates to all callers automatically.
`getCopilotAuthHeaders()` wraps both the JWT fetch and the impersonation
header into one `async` call. Callers no longer need to know about
impersonation — they just spread the returned headers into their fetch
options.
### Checklist 📋
#### For code changes:
- [x] I have clearly listed my changes in the PR description
- [x] I have made a test plan
- [x] I have tested my changes according to the test plan:
- [x] As admin, impersonate a user and open copilot/autopilot chat —
messages processed in the context of the impersonated user
- [x] As admin, impersonate a user and submit feedback (upvote/downvote)
— feedback recorded against the impersonated user
- [x] Without impersonation active, copilot chat works normally
- [x] Frontend unit tests pass: `pnpm test:unit`
### Why
During autopilot sessions with \`dry_run=True\`, the LLM was leaking
awareness of simulation mode through three channels:
1. \`dry_run\` appeared as a required parameter in \`RunBlockTool\`'s
schema — the LLM could see and pass it.
2. \`is_dry_run: true\` appeared in the serialized MCP tool result JSON
the LLM received, causing it to narrate that execution was simulated.
3. The \`[DRY RUN]\` prefix on response messages told the LLM explicitly
that credentials were absent or execution was skipped.
This broke the illusion of a seamless preview experience: users watching
an autopilot dry-run would see the LLM comment on simulation rather than
treating the run as real.
### What
**Backend:**
- \`copilot/model.py\`: \`ChatSessionInfo.dry_run\` is the single source
of truth, stored in the \`metadata\` JSON column (no migration needed).
Set at session creation; never changes.
- \`copilot/tools/run_block.py\`: Removed \`dry_run\` from the tool
schema and \`_execute\` params entirely. Block always reads
\`session.dry_run\`.
- \`copilot/tools/run_agent.py\`: Kept \`dry_run\` as an **optional**
schema parameter (LLM may request a per-call test run in normal
sessions), but \`session.dry_run=True\` unconditionally forces it True.
Removed from \`required\`.
- \`copilot/tools/models.py\`: \`BlockOutputResponse.is_dry_run: bool |
None = None\` — field is absent from normal-run output (was always
\`false\`).
- \`copilot/tools/base.py\`: \`model_dump_json(exclude_none=True)\` —
omits \`None\` fields from serialized output, keeping payloads clean.
- \`copilot/sdk/tool_adapter.py\`: \`_strip_llm_fields\` removes
\`is_dry_run\` from MCP tool result JSON **after** stashing for the
frontend SSE stream. Stripping is conditional on \`session.dry_run\` —
in normal sessions \`is_dry_run\` remains visible so the LLM can reason
about individual simulated calls. Extracted \`_make_truncating_wrapper\`
(was \`_truncating\`) for direct unit testing.
- \`blocks/autopilot.py\`: \`dry_run\` propagates from
\`execution_context.dry_run\` so nested AutoPilot sessions inherit the
parent's simulation mode.
**Frontend:**
- \`useCopilotUIStore\`: Added \`isDryRun\` / \`setIsDryRun\` state
persisted to localStorage (\`COPILOT_DRY_RUN\` key).
- \`useChatSession\`: Accepts \`dryRun\` option; creates session with
\`dry_run: true\` when enabled; resets session when the toggle changes.
- \`DryRunToggleButton\`: New UI control for toggling dry_run mode.
- \`RunAgent.tsx\` / \`helpers.tsx\`: Added \`AgentOutputResponse\` type
handling and \`ExecutionStartedCard\` rendering for the \`agent_output\`
response type.
- OpenAPI: \`is_dry_run\` on \`BlockOutputResponse\` changed to
\`boolean | null\` (was \`boolean\`).
### How it works
**Three-layer defense:**
1. **Schema layer**: \`run_block\` exposes no \`dry_run\` parameter.
\`run_agent\` keeps it optional so the LLM can request test runs in
normal sessions, but \`session.dry_run\` always wins.
2. **Response layer**: \`is_dry_run: bool | None = None\` +
\`exclude_none=True\` means the field is absent from the serialized JSON
in non-dry-run mode — no leakage at rest.
3. **Transport layer**: When \`session.dry_run=True\`,
\`_strip_llm_fields\` removes \`is_dry_run\` from the MCP result before
the LLM sees it, while the stashed copy (for the frontend SSE stream)
retains the full payload.
**Stash-before-strip ordering**: \`_make_truncating_wrapper\` stashes
the full tool output *before* calling \`_strip_llm_fields\`. This
ensures \`StreamToolOutputAvailable\` events carry the complete payload
— so the frontend's "Simulated" badge renders correctly — while the LLM
only ever sees the stripped version.
**Session-level flag**: \`ChatSessionInfo.dry_run\` is set at session
creation and never changes. No LLM tool call can alter it.
**\`_strip_llm_fields\` fast path**: Stripping is skipped when none of
the \`_STRIP_FROM_LLM\` field names appear in the raw text (string scan
before JSON parse), keeping the common non-dry-run path allocation-free.
### Checklist 📋
#### For code changes:
- [x] I have clearly listed my changes in the PR description
- [x] I have made a test plan
- [x] I have tested my changes according to the test plan:
- [x] \`poetry run pytest backend/copilot/tools/test_dry_run.py\` — all
tests pass
- [x] \`poetry run pytest backend/copilot/sdk/tool_adapter_test.py\` —
all tests pass (including new \`TestStripLlmFields\` suite)
- [x] Pre-commit hooks pass (Ruff, Black, isort, pyright, tsc, OpenAPI
export + orval generate)
- [x] Verify LLM tool result JSON for a dry_run session does not contain
\`is_dry_run\`
- [x] Verify frontend SSE stream still delivers \`is_dry_run: true\` for
"Simulated" badge rendering
## Summary
Add Graphiti temporal knowledge graph memory to CoPilot, giving
AutoPilot persistent cross-session memory with entities, relationships,
and temporal validity tracking.
- **3 new CoPilot tools** (`graphiti_store`, `graphiti_search`,
`graphiti_delete_user_data`) as BaseTool implementations — automatically
available in both SDK and baseline/fast modes via existing TOOL_REGISTRY
bridge
- **FalkorDB** as graph database backend with per-user physical
isolation via `driver.clone(database=group_id)`
- **graphiti-core** Python library for in-process knowledge graph
operations (no separate MCP server needed)
- **MemoryEpisodeLog** append-only replay table for migration safety
- **LaunchDarkly flag** `graphiti-memory` for per-user rollout
- **OpenRouter** for extraction LLM, direct OpenAI for embeddings
### Memory Quality
- Episode body uses `"Speaker: content"` format matching graphiti's
extraction prompt expectations
- Only user messages ingested (Zep Cloud `ignore_roles` approach) —
assistant responses excluded from graph
- `custom_extraction_instructions` suppress meta-entity pollution (no
more "assistant", "human", block names as entities)
- `ep.content` attribute correctly surfaced in search results and warm
context
- Per-user asyncio.Queue serializes ingestion (graphiti-core
requirement)
### Architecture Decision
Custom BaseTool implementations over MCP — the existing
`create_copilot_mcp_server()` in `tool_adapter.py` already wraps every
BaseTool as MCP for the SDK path. One implementation serves both
execution paths with zero extra infrastructure.
## Test plan
- [x] Set LaunchDarkly flag `graphiti-memory` to true for test user
- [x] Verify FalkorDB is healthy: `docker compose up falkordb`
- [x] S1: Send message with user facts ("my assistant is Sarah, CC her
on client stuff, CRM is HubSpot")
- [x] Verify agent calls `graphiti_store` to save memories
- [x] S2 (new session): Ask "Who should I CC on outgoing client
proposals?"
- [x] Verify agent calls `graphiti_search` before answering
- [x] Verify agent answers correctly from memory (Sarah)
- [x] Verify graph entities are clean (no "assistant"/"human"/block
names)
- [x] Verify MemoryEpisodeLog has replay entries
- [ ] Verify `GRAPHITI_MEMORY=false` in LaunchDarkly → tools return "not
enabled" error
🤖 Generated with [Claude Code](https://claude.com/claude-code)
<!-- CURSOR_SUMMARY -->
---
> [!NOTE]
> **Medium Risk**
> Adds a new persistence layer and background ingestion flow for chat
memory plus new dependencies/services (FalkorDB, `graphiti-core`) and
prompt/tooling changes; rollout is gated by a LaunchDarkly flag but
failures could impact chat latency or resource usage.
>
> **Overview**
> Enables **optional, per-user Graphiti temporal memory** for CoPilot
(gated by LaunchDarkly `graphiti-memory`), including warm-start recall
on the first turn and background ingestion of user messages after each
turn in both `baseline` and SDK chat paths.
>
> Adds Graphiti infrastructure: new `memory_search`/`memory_store` tools
and response types, a per-user cached Graphiti client with safe
`group_id` derivation, a FalkorDB driver tweak for full-text queries,
and a serialized per-user ingestion queue with graceful failure/timeout
handling.
>
> Introduces new runtime configuration and local dev support
(`GRAPHITI_*` env vars, new `falkordb` docker service/volume), updates
permissions/OpenAPI enums, and adds dependencies (`graphiti-core`,
`falkordb`, `cachetools`) plus unit tests for the new modules.
>
> <sup>Reviewed by [Cursor Bugbot](https://cursor.com/bugbot) for commit
81eb14e30a. Bugbot is set up for automated
code reviews on this repo. Configure
[here](https://www.cursor.com/dashboard/bugbot).</sup>
<!-- /CURSOR_SUMMARY -->
---------
Co-authored-by: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
### Why
The copilot's Claude Code CLI integration had several production
reliability gaps reported from live deployments:
- **No transient retry**: 429 rate-limit errors, 5xx server errors, and
ECONNRESET connection resets surfaced immediately as failures — there
was no retry mechanism.
- **Subagent permission errors**: CLI subprocesses wrote temp files to
`/tmp/claude-0/` which was inaccessible inside E2B sandboxes, causing
subagent spawning to report "agent completed" without actually running.
- **Missing security hardening in non-OpenRouter modes**: Security env
vars (`CLAUDE_CODE_DISABLE_CLAUDE_MDS`,
`CLAUDE_CODE_SKIP_PROMPT_HISTORY`, `CLAUDE_CODE_DISABLE_AUTO_MEMORY`,
`CLAUDE_CODE_DISABLE_NONESSENTIAL_TRAFFIC`) were only applied in the
OpenRouter path, leaving subscription and direct Anthropic modes
unprotected in multi-tenant deployment.
- **No resource guardrails**: No per-query budget cap, turn limit, or
fallback model meant a single runaway query could burn unlimited
tokens/spend.
- **Lossy transcript reconstruction**: When no transcript file was
available (storage failure or compaction drop), the old code injected a
truncated plain-text summary that cut tool results at 500 chars and
dropped `tool_use`/`tool_result` structural linkage, causing the LLM to
lose conversation context.
### What
- **SDK guardrails** (`config.py`, `sdk/service.py`): Added
`fallback_model` (auto-failover on 529 overloaded), `max_turns=1000`
(runaway prevention), `max_budget_usd=100.0` (per-query cost cap). All
configurable via env-backed `ChatConfig` fields.
- **Transient retry** (`sdk/service.py`, `constants.py`): Exponential
backoff (1s, 2s, 4s) for 429/5xx/ECONNRESET errors, retried only when
`events_yielded == 0` to avoid breaking partial streams.
`_TRANSIENT_ERROR_PATTERNS` extended with status-code-specific patterns
to avoid false positives.
- **Workspace isolation** (`sdk/env.py`): `CLAUDE_CODE_TMPDIR` now set
in all auth modes so CLI subprocesses write to the per-session workspace
directory rather than `/tmp/`.
- **Security hardening** (`sdk/env.py`): Security env vars applied
uniformly across all three auth modes (subscription, direct Anthropic,
OpenRouter) via restructured `build_sdk_env()`.
- **Transcript reconstruction** (`sdk/service.py`):
`_session_messages_to_transcript()` converts `ChatMessage.tool_calls`
and `ChatMessage.tool_call_id` to proper `tool_use`/`tool_result` JSONL
blocks for `--resume`, restoring full structural fidelity.
- **Model normalization refactor** (`sdk/service.py`):
`_resolve_fallback_model()` and `_normalize_model_name()` extracted to
share prefix-stripping and dot→hyphen conversion logic between primary
and fallback model resolution.
### How it works
**Transient retry**: `_can_retry_transient()` checks the retry budget
and returns the next backoff delay (or `None` when exhausted). Retries
are gated on `events_yielded == 0` — if any events were already streamed
to the client, we cannot retry without breaking the SSE stream
mid-response. After all retries are exhausted, `FRIENDLY_TRANSIENT_MSG`
is surfaced to the user.
**Transcript reconstruction**: When `--resume` has no on-disk session
file, `_session_messages_to_transcript()` builds a JSONL transcript from
`session.messages`, emitting `tool_use` blocks for assistant tool calls
and `tool_result` blocks (with matching IDs) for their results. This
gives Claude CLI the same structural fidelity as an on-disk session —
preserving tool call/result pairing that the old plain-text injection
lost.
**`build_sdk_env()` restructure**: The three auth modes now share a
common "epilogue" block that applies workspace isolation and security
hardening env vars regardless of which mode is active, eliminating the
previous pattern of repeating `if sdk_cwd: env["CLAUDE_CODE_TMPDIR"] =
sdk_cwd` in each branch.
### Checklist 📋
#### For code changes:
- [x] I have clearly listed my changes in the PR description
- [x] I have made a test plan
- [x] I have tested my changes according to the test plan:
- [x] 729 unit tests passing: `env_test.py`, `p0_guardrails_test.py`,
`retry_scenarios_test.py` (incl. integration tests for both transient
retry paths), `service_test.py`, `sdk_compat_test.py`,
`response_adapter_test.py`
- [x] E2E tested: live copilot session (API + UI), multi-turn, security
env vars verified in all 3 auth modes, guardrail defaults confirmed
- [x] `_session_messages_to_transcript()`: 7 unit tests covering empty
input, tool_use blocks, tool_result blocks, no truncation (10K chars
preserved), parent UUID chain, malformed argument handling
- test_llm: rename test_retry_cost_uses_last_attempt_only → test_retry_cost_accumulates_across_attempts
and update assertion to expect sum of all attempt costs (0.03) instead of last-only (0.02).
- platform_cost_test: add 4th mock side effect for the separate total_agg_rows query
added in the previous commit; update await_count assertion from 3 → 4.
- test_orchestrator_dynamic_fields: explicitly set cache_read_tokens=0,
cache_creation_tokens=0, provider_cost=None on the mock LLM response to avoid
Pydantic validation errors when NodeExecutionStats is constructed from it.
Each retry attempt that gets a response from the provider incurs a cost.
Token counts were already accumulated per attempt, but provider_cost was
overwritten (last value only). Now total_provider_cost accumulates across
all attempts so no billed USD is lost when validation retries occur.
- Wrap ensure_subscription_paid in spend_credits with try/except (fails open like check_rate_limit)
- Invalidate get_user_by_id cache in set_auto_top_up to prevent stale auto top-up data
- Block ENTERPRISE tier self-service upgrades from POST /credits/subscription API
- Format error messages as \$X.XX/mo instead of raw cents
- Move get_feature_flag_value import to module level in credit.py
- Add explicit operation_id to subscription FastAPI routes
- Pass autoTopUpConfig as prop to SubscriptionTierSection (avoid duplicate fetch)
- Display fetch error in SubscriptionTierSection instead of silent null
- Add cache hit comment to rate_limit.py hot path
- Add tests: idempotency, free tier no-op, beta grant offset, tier upgrade validation
- Add SubscriptionTier enum (FREE/PRO/BUSINESS/ENTERPRISE) to schema
- Add SUBSCRIPTION CreditTransactionType for monthly charges
- Lazy monthly deduction via ensure_subscription_paid() — idempotent,
called from spend_credits() and rate-limit checks
- BetaUserCredit grant includes subscription offset so beta usage credits
are not reduced by subscription cost
- Auto top-up enforced >= subscription cost on tier upgrade and config update
- Subscription cost configurable via LaunchDarkly (subscription-cost-pro,
subscription-cost-business); 0 = feature off, no separate flag needed
- New endpoints: GET/POST /credits/subscription for tier management
- No proration: full month charged on upgrade, downgrade takes next cycle
- Frontend: SubscriptionTierSection component on billing page with tier
cards, upgrade/downgrade flow, and auto top-up guard
`_build_system_prompt` was renamed to `_build_cacheable_system_prompt`
in the SDK path as part of the prompt caching PR. Update the patch
target in `retry_scenarios_test.py` to match the new name so the tests
can find the attribute.
- Remove redundant cache_read/creation_tokens from metadata dict in
cost_tracking.py — now stored in dedicated DB columns only.
- Fix total_cost_usd accumulation in OrchestratorBlock SDK path: use
assignment not addition (ResultMessage is emitted once per run, so
summing double-counts if emitted multiple times).
- trackingValue now shows both read and write cache token counts:
"+Xr/Yw cached" instead of "+X cached".
- Add cache-aware estimateCostForRow test: validates 0.1x reads and
1.25x writes multipliers for Anthropic tokens.
Aligns fetch logic with injection logic: `should_inject_user_context`
now requires both `is_first_turn` and `is_user_message`, so
assistant-role calls (e.g. tool-result submissions) on the first turn
no longer trigger a needless `_build_cacheable_system_prompt(user_id)`
DB lookup.
Addresses coderabbitai nitpick from review 4082258841.
Keep caching changes (static system prompt + cache_control markers)
on top of dev's new features: transcript support, file attachments,
URL context in baseline path, and _update_title_async in SDK path.
- Group provider cost table by (provider, tracking_type, model) so each
model gets its own row with accurate usage and estimated cost.
- Add cacheReadTokens / cacheCreationTokens columns to PlatformCostLog.
- Capture Anthropic cache_read_input_tokens / cache_creation_input_tokens
from LLM block responses; propagate through NodeExecutionStats and
PlatformCostEntry to the DB.
- Use per-token-type rates in cost estimation: uncached=100%, reads=10%,
writes=125% of base rate — prevents overestimation when prompt caching
is active (PR #12725).
In the SDK path, pass user_id to _build_cacheable_system_prompt only
when has_history is False, matching the baseline path. Previously
user understanding was fetched from the DB on every turn even though
it is only injected into the first user message, causing an N+1 query.
Also add a defensive logger.warning in the baseline path when no user
message is found for context injection (guarded by is_first_turn, so
this edge case is nearly impossible but surfaces unexpected states).
Move user-specific context out of the system prompt into the first user
message, making the system prompt fully static across all users. Add
explicit Anthropic cache_control markers on both system prompt and tool
definitions in the direct API path (blocks/llm.py).
Import SEED_PROMPT_PREFIX in BuilderChatPanel and extend the
visibleMessages filter to exclude any user message whose text starts
with the prefix. Adds a regression test for the new filter.
- Fix visibleMessages filter: assistant messages with only dynamic-tool parts
(no text) were silently hidden — now included when any dynamic-tool part exists
- Normalize dynamic-tool parts to tool-{toolName} before rendering so
MessagePartRenderer routes them correctly: edit_agent and run_agent get their
existing copilot renderers, all other tools fall through to GenericTool
(collapsed accordion with icon, status text, expandable output)
Track effectFlowID at session creation start and compare against currentFlowIDRef
after the async postV2CreateSession resolves. If the user navigated to a different
graph before the response arrived, the old session ID is discarded instead of
being committed to the new graph's state, preventing chat history from being
crossed between graphs.
- Add DANGEROUS_KEYS blocklist (__proto__, constructor, prototype) checked before
the schema guard in handleApplyAction so schema-less nodes cannot be polluted
via AI-supplied keys
- Validate execution_id from run_agent tool output with /^[\w-]+$/i before
passing to setQueryStates, preventing URL-special characters from entering
query state
- Add tests for DANGEROUS_KEYS blocklist on schema-less nodes (three cases)
- Add tests for edit_agent tool call detection: verifies onGraphEdited fires on
output-available state, is suppressed during streaming, and is not called twice
for the same toolCallId (processedToolCallsRef deduplication)
- Add tests for session ID validation: verifies that path-traversal IDs
(../../admin) and IDs with spaces set sessionError and leave sessionId null
- Extract EMPTY_NODES module-level constant to give useShallow a stable
reference when the panel is closed, preventing spurious re-renders
- Wire isInitialLoadComplete as isGraphLoaded prop in Flow.tsx so the seed
message effect in useBuilderChatPanel actually fires once the graph is ready
- Add panelRef to BuilderChatPanel and pass it to the hook so the Escape key
listener only closes the panel when focus is inside it, preventing conflicts
with other dialogs or canvas keyboard handlers
- Update BuilderChatPanel test to use objectContaining for the hook call
assertion, accommodating the new panelRef argument
Replace the simplified ReactMarkdown block in BuilderChatPanel's MessageList
with MessagePartRenderer from the copilot panel, enabling proper rendering of
tool invocations, error markers, and system markers in addition to text parts.
- Restore isGraphLoaded prop and hasSentSeedMessageRef seed-message effect that
were removed in a prior external modification; all seed-message tests now pass
- Apply Object.prototype.hasOwnProperty.call() guard in inline handleApplyAction
for input-schema and handle validation (three sites), matching the extracted
helper functions; prototype-pollution tests now pass
- Export clearGraphSessionCacheForTesting() and call it in beforeEach to prevent
stale module-level graphSessionCache from leaking across tests (fixes flowID
reset test)
- Update BuilderChatPanel test to expect isGraphLoaded in useBuilderChatPanel call
- Remove unused Dispatch, SetStateAction, CustomEdge, CustomNode imports
- Remove auto-send seed message on chat open (user initiates context manually)
- Cache chat session per graph ID (module-level Map) so reopening the panel for
the same graph reuses the existing session and preserves conversation history
- Detect edit_agent tool completion → trigger graph refetch via onGraphEdited callback
- Detect run_agent tool completion → update flowExecutionID in URL to auto-follow run
- retrySession now evicts the stale cache entry so a fresh session is created
- Flow.tsx passes refetchGraph as onGraphEdited to BuilderChatPanel
- Fix prototype pollution bypass: use Object.prototype.hasOwnProperty.call instead of `in` operator for schema key validation, preventing __proto__/constructor injection through schema-validated nodes
- Extract applyUpdateNodeInput and applyConnectNodes as module-level helpers to reduce handleApplyAction from 165 lines to a 20-line dispatcher
- Add JSDoc to useBuilderChatPanel documenting session lifecycle, transport, seed message, action parsing, undo, and input responsibilities
- Add maxLength=4000 to PanelInput textarea to cap token usage
- Add prototype pollution tests (__proto__ and constructor keys rejected when inputSchema is present)
- Strengthen Send-button-disabled assertion in component test
Shows three bouncing dots in an assistant-style bubble while waiting
for the first response token (status submitted, no assistant text yet).
Disappears once streaming begins and text appears.
- Overlapping placeholders: add !seedMessage guard to empty-state block so the
"Ask me to explain…" and "Graph context sent" banners are mutually exclusive
- aria-modal without focus trap: replace role="dialog"/aria-modal="true" with
role="complementary" since this is a side panel, not a blocking modal
- Stale closure in handleApplyAction: use useNodeStore/useEdgeStore.getState()
for both validation and mutation so rapid applies see live data
- Gate nodes/edges Zustand subscriptions behind isOpen to prevent chat-panel
hook re-running on every node drag/resize when panel is closed
- inputValue not cleared on flowID change: add setInputValue("") to flowID reset
- ReactMarkdown links: add custom <a> component with target="_blank" and rel="noopener noreferrer"
- XML sanitization: apply sanitizeForXml() to n.id and edge handle names
- Regex statefulness: move JSON_BLOCK_REGEX inside parseGraphActions() to avoid
shared lastIndex state (eliminates fragile lastIndex=0 reset)
- Type guard soundness: add typeof p.text === "string" to extractTextFromParts filter
- Session ID validation: validate format before interpolating into streaming URL
- Shallow-copy undo snapshots: spread prevNodes/prevEdges so closures hold
independent arrays
- Set spread optimisation: use new Set(prev).add(key) instead of new Set([...prev, key])
- Tests: remove dead getGetV1GetSpecificGraphQueryKey mock, add markerEnd assertion
to connect_nodes tests, add transport prepareSendMessagesRequest coverage,
add Enter-with-empty-input and inputValue-reset-on-flowID-change tests
Closes branch gaps in platform_cost.py (lines 29-31 and 312→314) that
were introduced via the dev merge but not exercised by existing tests.
This also forces the backend CI to run so Codecov uploads fresh coverage
instead of carrying forward stale data from before the cost-tracking
feature landed on dev.
The AI SDK can return messages with undefined parts in certain error
scenarios. Accept null/undefined in extractTextFromParts and fall back
to an empty array to prevent a TypeError and component crash.
Adds a useEffect in useBuilderChatPanel that calls setMessages([]) whenever
the flowID query param changes, preventing old technical seed prompts from
the prior session briefly appearing when switching between agents.
Chat panel used setEdges directly without the markerEnd property that edgeStore.addEdge
sets automatically. Added MarkerType.ArrowClosed with strokeWidth=2, color="#555" to
match the standard edge appearance.
- Guard against duplicate connect_nodes edges: check prevEdges before applying,
mark as already-applied without duplicating if edge exists
- Cap undo stack at MAX_UNDO=20 to prevent unbounded memory growth for large graphs
- Fix React anti-pattern: call restore() before setUndoStack updater instead of
inside it (state updaters must be pure — no side effects)
- Add aria-modal="true" to dialog panel and aria-expanded to toggle button
- Extract IIFE nodeMap into ActionList sub-component (cleaner render path)
- Add 18 new tests: handleSend when canSend=false, Shift+Enter no-send,
schema-absent permissive paths (update + connect_nodes), sequential multi-undo
LIFO order, duplicate edge guard, undo stack size cap, empty stack no-op
Apply chat panel changes via setNodes/setEdges (bypassing history store)
so Ctrl+Z cannot revert them and leave the "Applied" badge stale.
Also hoist jsonBlockRegex to module scope, cap node description length
at 500 chars, and remove useShallow from single-value selectors.
Use setNodes/setEdges directly in undo restore closures instead of
updateNodeData/removeEdge which push to the history store. This prevents
the global Ctrl+Z from re-applying changes that the user already undid via
the chat panel's own undo button.
Also removes unused removeEdge selector from the hook.
- Replace fragile setTimeout double-toggle retry with dedicated retrySession()
callback that resets sessionError and lets the session-creation effect re-run
- Remove invalidateQueries after apply actions — caused server refetch to
overwrite local Zustand state changes (sentry HIGH severity bug)
- Deep-clone prevHardcoded before undo capture so sequential applies to the
same node each have an independent snapshot
- Remove unsolicited "What does this agent do?" question from seed prompt;
invite user to initiate instead
- Remove useCallback from handleUndoLastAction per project convention
- Remove unused sendMessage and status from hook return
- Remove JSDoc comment from BuilderChatPanel per project convention
- Hoist nodeMap construction from ActionItem to parent parsedActions.map
to avoid N identical Maps per render cycle
- Make useChat mock configurable (mockChatMessages/mockChatStatus) and add
tests for parsedActions integration, Escape key handler, retrySession,
and handleSend input-clearing behavior
Requested by @majdyz
## Why
As we enforce patch coverage targets via Codecov (see #12694), the
`pr-address` skill needs to guide agents to verify test coverage when
they write new code while addressing review comments. Without this, an
agent could address a comment by adding untested code and create a new
CI failure to fix.
## What
Adds a **Coverage** section to `.claude/skills/pr-address/SKILL.md`
with:
- The `pytest --cov` command to check coverage locally on changed files
- Clear rules: new code needs tests, don't remove existing tests, clean
up dead test code when deleting code
## Impact
Agents using `/pr-address` will now run coverage checks as part of their
workflow and won't land untested new code.
Linear: SECRT-2217
Co-authored-by: Zamil Majdy <zamil.majdy@agpt.co>
## Why
`POST /api/graphs` was returning **500** when an agent graph contained
an Agent Input block without a `name` field.
Root cause: `GraphModel._generate_schema` calls
`model_construct(**input_default)` (which skips Pydantic validation) to
build a list of field objects. If `input_default` doesn't include
`name`, the constructed `Input` object has no `name` attribute. The
subsequent dict comprehension (`p.name: {...}`) then raises
`AttributeError`, which is not handled and falls through to the generic
`Exception → 500` catch-all in `rest_api.py`. The `ValueError → 400`
handler already exists but is never reached.
## What
- In `_generate_schema`, wrap the `return {…}` block in `try/except
AttributeError` and re-raise as `ValueError`.
- Added a unit test that directly exercises
`GraphModel._generate_schema` with a nameless `AgentInputBlock.Input`
and asserts `ValueError` is raised.
## How
`rest_api.py` already has:
```python
app.add_exception_handler(ValueError, handle_internal_http_error(400))
```
The only change needed was to ensure `AttributeError` gets converted
before it propagates. The fix is a single `try/except` block — no new
exception types, no new handlers.
**Note:** In Pydantic v2, `ValidationError` is _not_ a subclass of
`ValueError` — they are separate hierarchies. `pydantic.ValidationError`
inherits directly from `Exception`. The existing separate handler for
`pydantic.ValidationError` is correct and unrelated to this fix.
## Checklist
- [x] My changes follow the project coding style
- [x] I've written/updated tests for the changes
- [x] Tests pass locally (`poetry run pytest
backend/data/graph_test.py::test_generate_schema_raises_value_error_when_name_missing`)
- sanitizeForXml now escapes &, ", ' in addition to < and >
- connect_nodes actions now push an undo snapshot (removeEdge) so they can be reverted like update_node_input
- useBuilderChatPanel.test.ts adds removeEdge mock and test for undo of connect_nodes
## Changes
- Wrap `metadata` field in `SafeJson()` when calling
`PrismaLog.prisma().create()` in `log_platform_cost`
- Add `platform_cost_integration_test.py` with DB round-trip tests for
the fix
## Why
`PrismaLog.prisma().create()` was silently failing with a `DataError`
because passing a plain Python `dict` to a `Json?`-typed Prisma field is
not allowed:
```
DataError: Invalid argument type. `metadata` should be of type NullableJsonNullValueInput or Json
```
The error was swallowed silently by `logger.exception` in the background
task, so **no rows ever landed in `PlatformCostLog`** — which is why the
dev admin cost dashboard showed no data after #12696 was merged.
## How
Wrap `entry.metadata` in `SafeJson()` (already used throughout the
codebase, lives in `backend/util/json.py`) before passing it to the
Prisma create call. `SafeJson` extends `prisma.Json`, sanitizes
PostgreSQL-incompatible control characters, and handles Pydantic-model
conversion.
Add two integration tests in `platform_cost_integration_test.py`
(following the `credit_integration_test.py` pattern) that write a record
to a real DB and read it back — confirming both metadata round-trip and
NULL metadata work correctly.
## Test plan
- [x] Integration tests verify metadata persists/reads correctly via
Prisma
- [x] Unit tests updated: `isinstance(data["metadata"], Json)` confirms
the field is wrapped
- [x] Verified on dev executor pod: cost rows now appear in the admin
dashboard after fix
## Why
A Sentry `AttributeError: 'dict' object has no attribute 'timezone'` was
traced to the scheduler accessing `user.timezone` on a value that was a
raw `dict` instead of a typed `User` model.
**Root cause (two-part):**
1. `User.model_config` had `extra='forbid'`. During a rolling deploy,
the database manager (newer pod) can return fields that the client
(older pod) doesn't yet know about. `extra='forbid'` caused
`TypeAdapter(User).validate_python()` to raise `ValidationError` on
those unknown fields.
2. `DynamicClient._get_return` had a silent `try/except` that swallowed
the `ValidationError` and fell back to returning the raw `dict`. The
scheduler then received a `dict` and crashed on `.timezone`.
## What
- **`backend/data/model.py`**: Change `User.model_config`
`extra='forbid'` → `extra='ignore'`. Unknown fields from a newer
database manager are silently dropped, making the RPC layer
forward-compatible during rolling deploys. This is the primary fix.
- **`backend/util/service.py`**: Restore the `try/except` fallback in
`_get_return`, but make it **observable**: log the full error message at
`WARNING` (so ValidationError details — field name, value — appear in
logs) and call `sentry_sdk.capture_exception(e)` so every fallback is
tracked and alerted without crashing the caller. The raw result is still
returned as before (continuity).
- **`backend/util/service_test.py`**: Add `TestGetReturn` with two
direct unit tests: valid dict (including an unknown future field) →
typed `User` returned; invalid dict (missing required fields) → fallback
returns raw dict (no crash). Uses a typed `_SupportsGetReturn` Protocol
+ `cast` instead of `# type: ignore` suppressors.
- **`backend/executor/utils_test.py`**: Fix misleading docstring; move
inner imports to module top level per code style.
## How
`extra='ignore'` is the standard Pydantic pattern for forward-compatible
models at service boundaries. It means a rolling deploy where the DB
manager has a new column will not break older client pods — the extra
field is simply dropped on deserialization.
The restored `_get_return` fallback preserves continuity (callers don't
crash) while the `logger.warning` + `sentry_sdk.capture_exception`
ensure no schema mismatch goes undetected. Silent degradation is
replaced by observable degradation.
## Checklist
- [x] Changes are backward-compatible (unknown fields ignored, not
rejected)
- [x] Regression tests added for `_get_return` typed deserialization
contract
- [x] Fallback preserved with observable logging and Sentry capture (no
silent degradation)
- [x] `extra='ignore'` is consistent with forward-compatibility
requirements at service boundaries
- [x] No `# type: ignore` suppressors introduced
## Why
`OnboardingProvider` was generating a Sentry alert (BUILDER-7ME:
"Authorization header is missing") on every behave test run. The root
cause: when a user's session expires mid-flow, they get redirected to
`/login`. The provider remounts on the login page, calls
`getV1CheckIfOnboardingIsCompleted()` while unauthenticated, and the 401
falls into the catch block which calls `console.error`. Sentry's
`captureConsole` integration auto-captures all `console.error` calls as
events, triggering the alert.
This is expected behavior — the auth middleware handles the redirect,
there's nothing broken. It was just noisy.
## What
- In `OnboardingProvider`'s `initializeOnboarding` catch block, return
early and silently on `ApiError` with status 401 — no `console.error`,
no toast
- Only unexpected errors (non-401) still surface via `console.error` and
the destructive toast
## How
```ts
} catch (error) {
if (error instanceof ApiError && error.status === 401) {
return;
}
// ... existing error handling
}
```
## Checklist
- [x] `pnpm format && pnpm lint && pnpm types` pass
- [x] Change is minimal and scoped to the one catch block
- [x] No new test needed — this is a logging/noise fix, not a behavioral
change
## Why
Two bugs block OrchestratorBlock from working correctly:
1. **Dry-run always fails with "credentials required"** even when
`OPEN_ROUTER_API_KEY` is set on dev. The n8n conversion dry-run hits
this.
2. **Agent-mode OrchestratorBlock fails on the second LLM call** with
`Error code: 400 – Unknown parameter: 'input[2].status'` when using
OpenAI models (Responses API path).
## What
**Bug 1 — manager.py credential null** (`backend/executor/manager.py`):
The dry-run path called `input_data[field_name] = None` to "clear" the
credential slot, but `_execute` in `_base.py` filters out `None` values
before calling `input_schema(**...)`. This drops the required
`credentials` field from the schema constructor, causing a Pydantic
validation error.
Fix: Don't null out the field. If the user already has credential
metadata in `input_data` (normal case), leave it intact. If not (no
credentials configured), synthesise a minimal
`CredentialsMetaInput`-compatible placeholder from the platform
credentials so schema construction passes. The actual
`APIKeyCredentials` (platform key) is still injected via
`extra_exec_kwargs`.
**Bug 2 — Responses API `status` field**
(`backend/blocks/orchestrator.py`):
OpenAI returns output items (function calls, messages) with a `status:
"completed"` field. When `_convert_raw_response_to_dict` serialises
these items and they are stored in `conversation_history`, they are sent
back as input on the next call — but OpenAI rejects `status` as an
input-only field.
Fix: Strip `status` from each output item before it enters the history.
## How
- `manager.py` lines 311-314: removed the `input_data[field_name] =
None` nullification; added a conditional placeholder when no credential
metadata is present.
- `orchestrator.py` `_convert_raw_response_to_dict`: filter `k !=
"status"` when extracting Responses API output items.
- Tests added for both fixes.
## Checklist
- [x] Tests written and passing (94 total, all green)
- [x] Pre-commit hooks passed (Black, Ruff, isort, typecheck)
- [x] No out-of-scope changes
- Move inputValue, handleSend, handleKeyDown, isStreaming, canSend into
useBuilderChatPanel (0ubbe: keep render logic out of component)
- Add undo support: snapshot node state before apply, expose undoStack +
handleUndoLastAction, show undo button in PanelHeader
- Add toast feedback on handleApplyAction validation failures so users
know why Apply did nothing
- Fix getActionKey for update_node_input to include value so AI corrections
in later turns are not silently dropped by the dedup Set
- Add getNodeDisplayName shared helper in helpers.ts; use in both
serializeGraphForChat and ActionItem (removes duplication)
- Use Map<id, node> in serializeGraphForChat for O(1) edge lookups
- Add Retry button to session error state in MessageList
- Add graph context sent banner after seed message so AI response
does not appear unprompted (addresses confusing auto-response UX)
- Add aria-label to Apply buttons for screen-reader accessibility
- Remove hook-only test file (0ubbe: test component, not hook)
- Expand component tests: undo, retry, seed banner, action label format,
getNodeDisplayName, getActionKey value-inclusion, edge truncation
- All 1026 tests pass; lint and types clean
- Filter seed message by content prefix (SEED_PROMPT_PREFIX) instead of position
- Add exhaustiveness guard for unhandled GraphAction types
- Guard handleApplyAction against unknown keys/handles via inputSchema/outputSchema
- Add renderHook-based tests: session lifecycle, flowID reset, handleApplyAction, edge cases
- Fix session-creation effect to use isCreatingSessionRef so state-driven re-renders
don't prematurely cancel the in-flight request via the cancelled flag
- Add empty-input rejection test for BuilderChatPanel send button
- Replace auto-apply with per-action Apply buttons; users must explicitly
confirm each AI suggestion before the graph is mutated
- Accumulate parsedActions across all assistant messages so multi-turn
suggestions remain visible rather than disappearing after the next turn
- Escape < and > in node names/descriptions before embedding in XML prompt
context to prevent AI prompt injection via crafted node labels
- Add MAX_EDGES cap (200) in serializeGraphForChat to mirror the MAX_NODES
limit and prevent token overruns on dense graphs
- Add Escape key handler in the hook to close the chat panel
- Add helpers.test.ts with unit tests for buildSeedPrompt,
extractTextFromParts, and XML sanitization
## Why
When system-managed credentials are used (AutoGPT pays the API bills),
there was no visibility into which providers were being called, how much
each costs, or which users were driving usage. This makes it impossible
to set appropriate per-user limits or reconcile expenses with actual API
invoices.
## What
End-to-end platform cost tracking for all 22 system-credential providers
+ both copilot modes:
- Every block execution that uses system credentials records a
`PlatformCostLog` row (provider, cost, tokens, user, execution IDs)
- Copilot turns (SDK + baseline) are tracked with model name, token
counts, and actual USD cost
- Admin dashboard at `/admin/platform-costs` shows cost breakdown by
provider and user with date/provider/user filters and paginated raw logs
- Admin API endpoints with 30s TTL cache: `GET
/platform-costs/dashboard` and `GET /platform-costs/logs`
## How
### Core hook
`cost_tracking.py` calls `log_system_credential_cost()` after each block
node execution. It reads `NodeExecutionStats.provider_cost` (set by
`merge_stats()` inside each block) and dispatches a fire-and-forget
`INSERT` via `log_platform_cost_safe()`.
### Per-block tracking
Each block calls `self.merge_stats(NodeExecutionStats(provider_cost=...,
provider_cost_type=...))`:
| Tracking type | Providers | Amount |
|---|---|---|
| `cost_usd` | OpenRouter, Exa | Actual USD from API response |
| `tokens` | OpenAI, Anthropic, Groq, Ollama, Jina | Token count from
response.usage |
| `characters` | Unreal Speech, ElevenLabs, D-ID | Input text length |
| `sandbox_seconds` | E2B | Walltime |
| `walltime_seconds` | FAL, Revid, Replicate | Walltime |
| `per_run` | Google Maps, Apollo, SmartLead, etc. | 1 per execution |
OpenRouter cost: extracted via `with_raw_response.create()` and
`raw.headers.get("x-total-cost")` with `math.isfinite` + `>= 0`
validation (replaces private `_response` access).
### Copilot tracking
`token_tracking.py` writes a `PlatformCostLog` row per copilot LLM turn
via an async fire-and-forget queue bounded by a `Semaphore(50)`. SDK
path uses `sdk_msg.total_cost_usd`; baseline path uses the
`x-total-cost` header from OpenRouter streaming responses.
### Executor drain
`drain_pending_cost_logs()` is called before `executor.shutdown()` using
a module-level loop registry (`_active_node_execution_loops`) so that
pending log tasks from each worker thread's event loop are awaited
before the process exits. Tasks are filtered by `task.get_loop() is
current_loop` to avoid cross-loop `RuntimeError` in Python ≥ 3.10.
### CoPilot executor lifecycle
Worker threads connect Prisma on startup and disconnect on cleanup (even
on failure). If `db.connect()` fails during `@func_retry`, the event
loop is stopped and joined before re-raising so no loop is leaked across
retry attempts.
### Schema
```prisma
model PlatformCostLog {
id String @id @default(uuid())
createdAt DateTime @default(now())
userId String?
graphExecId String?
nodeExecId String?
blockName String
provider String
trackingType String
costMicrodollars BigInt @default(0)
inputTokens Int?
outputTokens Int?
duration Float?
model String?
}
```
### Admin dashboard
React page with three tabs (By Provider / By User / Raw Logs) driven by
two generated Orval hooks (`useGetV2GetPlatformCostDashboard`,
`useGetV2GetPlatformCostLogs`). Filters are URL-based (`searchParams`)
for bookmarkability. Pagination for raw logs. Per-provider estimated
totals using configurable cost-per-unit multipliers.
## Test plan
- [x] Migration applies cleanly
- [x] Block execution with system credentials creates PlatformCostLog
row
- [x] Copilot conversation records cost log with tokens + model
- [x] `/admin/platform-costs` dashboard renders with correct data
- [x] Date/provider/user filters work correctly
- [x] Non-admin users get 403 on cost endpoints
- [x] Executor drain completes before process exit (no lost logs)
---------
Co-authored-by: Zamil Majdy <majdyz@users.noreply.github.com>
Co-authored-by: Nicholas Tindle <nicholas.tindle@agpt.co>
### Why / What / How
**Why:** A series of production failures exposed gaps in the agent fleet
tooling:
1. Agents using `_wait_idle`/`wait_for_claude_idle` would time out
waiting for `❯` while a settings-error dialog blocked progress — because
the dialog can appear above the last 3 captured lines.
2. The run-loop's adaptive backoff used `POLL_CURRENT * 3 / 2` which
stalls at 1 forever in bash integer arithmetic, and printed the interval
*before* recomputing it.
3. `pr-address` agents were silently missing review threads when a PR
had >100 threads across multiple pages — they'd stop at page 1, address
69/111 threads, and falsely report "done".
4. `resolveReviewThread` was being called without a committed fix —
producing false "0 unresolved" signals that bypassed verification.
5. The onboarding bypass in `/pr-test` had no timeout on curl calls, so
the step could hang forever if the backend wasn't ready yet.
6. The orchestrator's own verification query used `first: 1` which can't
reliably count unresolved threads across all pages.
**What:**
- Idle detection hardened in both `spawn-agent.sh` and `run-loop.sh` —
full-pane check for 'Enter to confirm' so the dialog is never missed
- Adaptive backoff arithmetic fixed (`POLL_CURRENT + POLL_CURRENT/2 + 1`
always increments); log ordering corrected; `POLL_IDLE_MAX` made
env-configurable
- `pr-address/SKILL.md`: mandatory cursor-pagination loop collecting ALL
thread IDs before addressing anything; prominent ⚠️ warning with the PR
#12636 incident (142 threads, 2 pages, agent stopped at 69)
- `pr-address/SKILL.md`: new "Parallel thread resolution" section —
batch by file, one commit per file group, concurrent reply subshells
with 3s gaps, sequential resolves
- `pr-address/SKILL.md`: "Verify actual count" section now uses
paginated loop (not single first:100 query)
- `orchestrate/SKILL.md`: verification query fixed to paginate all
pages; new "Thread resolution integrity" section with anti-patterns;
fake-resolution detection query; state-staleness recovery; RUNNING-count
confusion explained
- `/pr-test` onboarding bypass: `--max-time 30` on curl calls; hard-fail
on bypass failure
**How:** All changes are to DX skill files and orchestration scripts —
no production code modified. Each fix is a separate commit so the change
history is readable.
### Changes 🏗️
**Scripts:**
- `run-loop.sh`: `wait_for_claude_idle` — add 'Enter to confirm' dialog
check (reset elapsed on dialog); fix backoff arithmetic stall; fix log
ordering; make `POLL_IDLE_MAX` env-configurable; reset poll interval
when `waiting_approval` agents present
- `spawn-agent.sh`: `_wait_idle` — capture full pane (not just `tail
-3`) for 'Enter to confirm' check; wait-for-idle before sending agent
objective to prevent stuck pasted-text
**SKILL.md files:**
- `pr-address/SKILL.md`:
- ⚠️ WARNING + totalCount step + cursor-pagination loop before
addressing any threads
- "Parallel thread resolution" section: group by file, batch commits,
concurrent replies, sequential resolves
- "Verify actual count" section: full paginated loop instead of single
first:100 query
- "What counts as a valid resolution" with explicit anti-patterns
(Acknowledged, Accepted, no-commit resolves)
- Rate limits table (403 secondary vs 429 primary), 2-3 min recovery
- `git rev-parse HEAD` pattern with `${FULL_SHA:0:9}` short SHA
- `orchestrate/SKILL.md`:
- Thread resolution integrity section + fake-resolution detection query
- Verification query fixed to paginate all pages
- State file staleness recovery (stale `loop_window`, closed windows,
repair recipes)
- RUNNING count confusion: explains `waiting_approval` included in regex
- Idle check before re-briefing agents
- `pr-test/SKILL.md`:
- `--max-time 30` on onboarding bypass curl calls
- Hard-fail (`exit 1`) if bypass verification fails
### Checklist 📋
#### For code changes:
- [x] I have clearly listed my changes in the PR description
- [x] I have made a test plan
- [x] I have tested my changes according to the test plan:
- [x] Verified adaptive backoff increments correctly (no longer stalls
at 1)
- [x] Verified 'Enter to confirm' dialog handled in both wait functions
- [x] Verified pagination loop collects all thread IDs across pages
- [x] Verified PR #12636 onboarding bypass works end-to-end (11/11
scenarios PASS)
---------
Co-authored-by: Zamil Majdy <majdy.zamil@gmail.com>
The initialization prompt ("I'm building an agent in the AutoGPT flow
builder...") was sent as a visible user message, exposing raw prompt
engineering instructions to end users. Track its ID via seedMessageId
and exclude it from the rendered message list.
handleApplyAction was defined and exported but never called, so the
"AI applied these changes" panel was displaying actions that had no
effect. Wire up a handleApplyActionRef so the status-change effect
can safely apply each parsed action to the local Zustand stores once
per completed AI turn, before the canvas refetch resolves.
Rejects update_node_input keys not present in inputSchema.properties and
connect_nodes handles not present in outputSchema/inputSchema.properties,
preventing AI from writing arbitrary fields that blocks do not support.
Validation is permissive when schema is undefined (backwards-compatible).
- Add useBuilderChatPanel.test.ts with direct tests for handleApplyAction:
update_node_input (merges hardcodedValues, no-ops for unknown node),
connect_nodes (calls addEdge with correct args, no-ops if either node missing)
- Add panel open/close state tests for useBuilderChatPanel
- Add session error UI test to BuilderChatPanel.test.tsx
- Extract `getActionKey(action)` to helpers.ts, removing duplicated key
computation from BuilderChatPanel.tsx and useBuilderChatPanel.ts
- Wire `textareaRef` through PanelInputProps so focus-on-open works
- Add `getActionKey` tests covering both action types
Merge resolution keeps:
- buildSeedPrompt helper (prompt injection mitigation with XML tags)
- extractTextFromParts naming (aligned with remote)
- cancelled flag pattern for session creation cleanup
- streamError display and empty/welcome state (new in this branch)
- Static Applied badge (span, no dead toggle logic)
- ARIA roles: role=dialog, role=log
- react-markdown for assistant messages
- Placeholder hint for Enter/Shift+Enter
- All new tests: keyboard, multi-action, customized_name, truncation,
primitive validation, stream error, ARIA assertions
Security:
- Wrap graph context in <graph_context> XML tags and label as untrusted to
mitigate prompt injection from node names/descriptions
- Add comment confirming backend validates session ownership before streaming
- Restrict update_node_input value to string|number|boolean primitives to
prevent prototype-pollution from crafted AI responses
- Add MAX_NODES=100 cap in serializeGraphForChat to prevent token overruns
- Add source/target node existence check before addEdge in handleApplyAction
Correctness:
- Add `ignore` flag to session-creation effect to prevent state updates after
unmount or effect re-run
- Add nodes+edges to initialization effect deps (hasSentSeedMessageRef guards
against re-firing)
- Gate parsedActions useMemo on status==='ready' to avoid hot-path regex
during streaming
Code quality:
- Rename initializedRef → hasSentSeedMessageRef for clarity
- Extract buildSeedPrompt and getMessageText helpers into helpers.ts
- Remove dead ActionItem handleApply/applied toggle (actions are auto-applied)
- Remove redundant setTimeout scroll in handleSend (useEffect already scrolls)
- Export error from useChat for stream error display
UX / accessibility:
- Add react-markdown rendering for assistant message bubbles
- Add empty/welcome state when no messages
- Add role="dialog" + aria-label to panel, role="log" + aria-live to messages
- Add streaming error display when useChat error is set
- Update placeholder to hint Enter/Shift+Enter behaviour
Tests:
- Add Enter-to-send and Shift+Enter-no-send keyboard tests
- Add multi-action block parsing test
- Add metadata.customized_name preference test
- Add MAX_NODES truncation test
- Add primitive value validation test (number, boolean)
- Add stream error display test
- Add ARIA role assertion tests
- Validate node existence before connect_nodes in handleApplyAction
- Add cleanup guard to session creation effect to prevent state updates
after unmount
- Extract extractTextFromParts helper to deduplicate text extraction
- Remove dead code in ActionItem (applied state was always true)
- Remove redundant setTimeout scroll in handleSend (useEffect handles it)
- Update test to match simplified ActionItem
## Changes
### verify-complete.sh
- CHANGES_REQUESTED reviews are now compared against the latest commit
timestamp. If the review was submitted **before** the latest commit, it
is treated as stale and does not block verification.
- Added fail-closed guard: if the `gh pr view` fetch fails, the script
exits 1 (rather than treating missing data as "no blocking reviews")
- Fixed edge case: a `CHANGES_REQUESTED` review with a null
`submittedAt` is now counted as fresh/blocking (previously silently
skipped)
- Combined two separate `gh pr view` calls into one (`--json
commits,reviews`) to reduce API calls and ensure consistency
### SKILL.md (orchestrate skill)
- Added `### /pr-test result evaluation` section with explicit
pass/partial/fail handling table
- **PARTIAL on any headline feature scenario = immediate blocker**:
re-brief the agent, fix, and re-run from scratch. Never approve or
output ORCHESTRATOR:DONE with a PARTIAL headline result.
- Concrete incident callout: PR #12699 S5 (Apply suggestions) was
PARTIAL — AI never output JSON action blocks — but was nearly approved.
This rule prevents recurrence.
- Updated `verify-complete.sh` description throughout to include "no
fresh CHANGES_REQUESTED"
- Added staleness rule documentation: a review only blocks if submitted
*after* the latest commit
## Why
Two separate incidents prompted these changes:
1. **verify-complete.sh false positive**: An automated bot
(autogpt-pr-reviewer) submitted a `CHANGES_REQUESTED` review in April.
An agent then pushed fixing commits. The old script still blocked on the
stale review, preventing the PR from being verified as done.
2. **Missed PARTIAL signal**: PR #12699 had a PARTIAL result on its
headline scenario (S5 Apply button) because the AI emitted direct
builder tool calls instead of JSON action blocks. The orchestrator
nearly approved it. The new SKILL.md rule makes PARTIAL = blocker
explicit.
## Checklist
- [x] I have read the contribution guide
- [x] My changes follow the code style of this project
- [x] Changes are limited to the scope of this PR (< 20% unrelated
changes)
- [x] All new and existing tests pass
- Feature-flag the BuilderChatPanel behind BUILDER_CHAT_PANEL flag (ntindle)
- Reset sessionId/initializedRef on flowID navigation (sentry x2)
- Block input until session is ready to prevent pre-seed messages (coderabbitai)
- Reset sessionError on panel reopen so retry works (coderabbitai)
- Gate canvas invalidation on actual graph mutations only (coderabbitai)
- Add comment explaining ActionItem applied=true is intentional (sentry)
- Rename test and assert disabled state directly (coderabbitai)
- Embed JSON action block instruction in the seed message so the AI
outputs parseable blocks after edit_agent calls, making the changes
section visible without a backend system-prompt deploy
- Auto-invalidate the graph React Query after streaming completes so
useFlow.ts re-fetches and repopulates nodeStore/edgeStore in real-time
- Start ActionItem in pre-applied state; section label reads "AI applied
these changes" since edit_agent saves immediately server-side
- Update tests to match new label and pre-applied default
### Why
When running multiple Claude Code agents in parallel worktrees, they
frequently get stuck: an agent exits and sits at a shell prompt, freezes
mid-task, or waits on an approval prompt with no human watching. Fixing
this currently requires manually checking each tmux window.
### What
Adds a `/orchestrate` skill — a meta-agent supervisor that manages a
fleet of Claude Code agents across tmux windows and spare worktrees. It
auto-discovers available worktrees, spawns agents, monitors them, kicks
idle/stuck ones, auto-approves safe confirmations, and recycles
worktrees on completion.
### How to use
**Prerequisites:**
- One tmux session already running (the skill adds windows to it; it
does not create a new session)
- Spare worktrees on `spare/N` branches (e.g. `AutoGPT3` on `spare/3`,
`AutoGPT7` on `spare/7`)
**Basic workflow:**
```
/orchestrate capacity → see how many spare worktrees are free
/orchestrate start → enter task list, agents spawn automatically
/orchestrate status → check what's running
/orchestrate add → add one more task to the next free worktree
/orchestrate stop → mark inactive (agents finish current work)
/orchestrate poll → one manual poll cycle (debug / on-demand)
```
**Worktree lifecycle:**
```text
spare/N branch → /orchestrate add → new window + feat/branch + claude running
↓
ORCHESTRATOR:DONE
↓
kill window + git checkout spare/N
↓
spare/N (free again)
```
Windows are always capped by worktree count — no creep.
### Changes
- `.claude/skills/orchestrate/SKILL.md` — skill definition with 5
subcommands, state file schema, spawn/recycle helpers, approval policy
- `.claude/skills/orchestrate/scripts/classify-pane.sh` — pane state
classifier: `idle` (shell foreground), `running` (non-shell),
`waiting_approval` (pattern match), `complete` (ORCHESTRATOR:DONE)
- `.claude/skills/orchestrate/scripts/poll-cycle.sh` — poll loop:
reads/updates state file atomically, outputs JSON action list, stuck
detection via output-hash sampling
**State detection:**
| State | Detection method |
|---|---|
| `idle` | `pane_current_command` is a shell (zsh/bash/fish) |
| `running` | `pane_current_command` is non-shell (claude/node) |
| `stuck` | pane hash unchanged for N consecutive polls |
| `waiting_approval` | pattern match on last 40 lines of pane output |
| `complete` | `ORCHESTRATOR:DONE` string present in pane output |
**Safety policy for auto-approvals:** git ops, package installs, tests,
docker compose → approve. `rm -rf` outside worktree, force push, `sudo`,
secrets → escalate to user.
State file lives at `~/.claude/orchestrator-state.json` (outside repo,
never committed).
### Checklist
#### For code changes:
- [x] I have clearly listed my changes in the PR description
- [x] I have made a test plan
- [x] I have tested my changes according to the test plan:
- [x] `classify-pane.sh`: idle shell → `idle`, running process →
`running`, `ORCHESTRATOR:DONE` → `complete`, approval prompt →
`waiting_approval`, nonexistent window → `error`
- [x] `poll-cycle.sh`: inactive state → `[]`, empty agents array → `[]`,
spare worktree discovery, stuck detection (3-poll hash cycle)
- [x] Real agent spawn in `autogpt1` tmux session — agent ran, output
`ORCHESTRATOR:DONE`, recycle verified
- [x] Upfront JSON validation before `set -e`-guarded jq reads
- [x] Idle timer reset only on `idle → running` transition (not stuck),
preventing false stuck-detections
- [x] Classify fallback only triggers when output is empty (no
double-JSON on classify exit 1)
- Validate required fields in parseGraphActions before emitting actions
(coderabbitai: reject malformed payloads instead of coercing to "")
- Gate chat seeding on isGraphLoaded to avoid seeding with empty graph
when panel is opened before graph finishes loading (coderabbitai)
- Deduplicate parsedActions in the hook to prevent duplicate React keys
when AI suggests the same action twice (sentry)
- Add tests for malformed action field validation
- Prevent infinite retry loop on session creation failure by tracking
sessionError state and bailing out on non-200 or thrown errors
- Remove nodes/edges from initialization effect deps (only fire once
when sessionId+transport become available)
- Show node display name instead of raw ID in action item labels
- Use stable content-based keys for action items instead of array index
Copilot chat sessions with long histories loaded all messages at once,
causing slow initial loads. This PR adds cursor-based pagination so only
the most recent messages load initially, with older messages fetched on
demand as the user scrolls up.
### Changes 🏗️
**Backend:**
- Cursor-based pagination on `GET /sessions/{session_id}` (`limit`,
`before_sequence` params)
- `user_id` relation filter on the paginated query — ownership check and
message fetch now run in parallel
- Backward boundary expansion to keep tool-call / assistant message
pairs intact at page edges
- Unit tests for paginated queries
**Frontend:**
- `useLoadMoreMessages` hook + `LoadMoreSentinel` (IntersectionObserver)
for infinite scroll upward
- `ScrollPreserver` to maintain scroll position when older messages are
prepended
- Session-keyed `Conversation` remount with one-frame opacity hide to
eliminate scroll flash on switch
- Scrollbar moved to the correct scroll container; loading spinner no
longer causes overflow
### Checklist 📋
- [x] Pagination: only recent messages load initially; older pages load
on scroll-up
- [x] Scroll position preserved on prepend; no flash on session switch
- [x] Tool-call boundary pairs stay intact across page edges
- [x] Stream reconnection still works on initial load
---------
Co-authored-by: Claude Opus 4.6 <noreply@anthropic.com>
Co-authored-by: Nicholas Tindle <nicholas.tindle@agpt.co>
## Why
Stale feature flags add noise to the codebase and make it harder to
understand which flags are actually gating live features. Four flags
were defined but never referenced anywhere in the frontend, and the
"Share Execution Results" flag has been stable long enough to remove its
gate.
## What
- Remove 4 unused flags from the `Flag` enum and `defaultFlags`:
`NEW_BLOCK_MENU`, `GRAPH_SEARCH`, `ENABLE_ENHANCED_OUTPUT_HANDLING`,
`AGENT_FAVORITING`
- Remove the `SHARE_EXECUTION_RESULTS` flag and its conditional — the
`ShareRunButton` now always renders
## How
- Deleted enum entries and default values in `use-get-flag.ts`
- Removed the `useGetFlag` call and conditional wrapper around
`<ShareRunButton />` in `SelectedRunActions.tsx`
## Changes
- `src/services/feature-flags/use-get-flag.ts` — removed 5 flags from
enum + defaults
- `src/app/(platform)/library/.../SelectedRunActions.tsx` — removed flag
import, condition; share button always renders
### Checklist
- [x] My PR is small and focused on one change
- [x] I've tested my changes locally
- [x] `pnpm format && pnpm lint` pass
🤖 Generated with [Claude Code](https://claude.com/claude-code)
Co-authored-by: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Requested by @Abhi1992002
## Why
JSON output data in the "Complete Output Data" dialog and node output
panel gets clipped — text overflows and is hidden with no way to scroll
right. Reported by Zamil in #frontend.
## What
The `ContentRenderer` wrapper divs used `overflow-hidden` which
prevented the `JSONRenderer`'s `overflow-x-auto` from working. Changed
both wrapper divs from `overflow-hidden` to `overflow-x-auto`.
```diff
- overflow-hidden [&>*]:rounded-xlarge [&>*]:!text-xs [&_pre]:whitespace-pre-wrap [&_pre]:break-words
+ overflow-x-auto [&>*]:rounded-xlarge [&>*]:!text-xs [&_pre]:whitespace-pre-wrap [&_pre]:break-words
- overflow-hidden [&>*]:rounded-xlarge [&>*]:!text-xs
+ overflow-x-auto [&>*]:rounded-xlarge [&>*]:!text-xs
```
## Scope
- 1 file changed (`ContentRenderer.tsx`)
- 2 lines: `overflow-hidden` → `overflow-x-auto`
- CSS only, no logic changes
Resolves SECRT-2206
Co-authored-by: Abhimanyu Yadav <122007096+Abhi1992002@users.noreply.github.com>
Co-authored-by: Nicholas Tindle <nicholas.tindle@agpt.co>
### Why / What / How
Copilot artifacts were not previewing reliably: PDFs downloaded instead
of rendering, Python code could still render like markdown, JSX/TSX
artifacts were brittle, HTML dashboards/charts could fail to execute,
and users had to manually open artifact panes after generation. The pane
also got stuck at maximized width when trying to drag it smaller.
This PR adds a dedicated copilot artifact panel and preview pipeline
across the backend/frontend boundary. It preserves artifact metadata
needed for classification, adds extension-first preview routing,
introduces dedicated preview/rendering paths for HTML/CSV/code/PDF/React
artifacts, auto-opens new or edited assistant artifacts, and fixes the
maximized-pane resize path so dragging exits maximized mode immediately.
### Changes 🏗️
- add artifact card and artifact panel UI in copilot, including
persisted panel state and resize/maximize/minimize behavior
- add shared artifact extraction/classification helpers and auto-open
behavior for new or edited assistant messages with artifacts
- add preview/rendering support for HTML, CSV, PDF, code, and React
artifact files
- fix code artifacts such as Python to render through the code renderer
with a dark code surface instead of markdown-style output
- improve JSX/TSX preview behavior with provider wrapping, fallback
export selection, and explicit runtime error surfaces
- allow script execution inside HTML previews so embedded chart
dashboards can render
- update workspace artifact/backend API handling and regenerate the
frontend OpenAPI client
- add regression coverage for artifact helpers, React preview runtime,
auto-open behavior, code rendering, and panel store behavior
- post-review hardening: correct download path for cross-origin URLs,
defer scroll restore until content mounts, gate auto-open behind the
ARTIFACTS flag, parse CSVs with RFC 4180-compliant quoted newlines + BOM
handling, distinguish 413 vs 409 on upload, normalize empty session_id,
and keep AnimatePresence mounted so the panel exit animation plays
### Checklist 📋
#### For code changes:
- [x] I have clearly listed my changes in the PR description
- [x] I have made a test plan
- [x] I have tested my changes according to the test plan:
- [x] `pnpm format`
- [x] `pnpm lint`
- [x] `pnpm types`
- [x] `pnpm test:unit`
#### For configuration changes:
- [x] `.env.default` is updated or already compatible with my changes
- [x] `docker-compose.yml` is updated or already compatible with my
changes
- [x] I have included a list of my configuration changes in the PR
description (under **Changes**)
<!-- CURSOR_SUMMARY -->
---
> [!NOTE]
> **Medium Risk**
> Adds a new Copilot artifact preview surface that executes
user/AI-generated HTML/React in sandboxed iframes and changes workspace
file upload/listing behavior, so regressions could affect file handling
and client security assumptions despite sandboxing safeguards.
>
> **Overview**
> Adds an **Artifacts** feature (flagged by `Flag.ARTIFACTS`) to
Copilot: workspace file links/attachments now render as `ArtifactCard`s
and can open a new resizable/minimizable `ArtifactPanel` with history,
auto-open behavior, copy/download actions, and persisted panel width.
>
> Introduces a richer artifact preview pipeline with type classification
and dedicated renderers for **HTML**, **CSV**, **PDF**, **code
(Shiki-highlighted)**, and **React/TSX** (transpiled and executed in a
sandboxed iframe), plus safer download filename handling and content
caching/scroll restore.
>
> Extends the workspace backend API by adding `GET /workspace/files`
pagination, standardizing operation IDs in OpenAPI, attaching
`metadata.origin` on uploads/agent-created files, normalizing empty
`session_id`, improving upload error mapping (409 vs 413), and hardening
post-quota soft-delete error handling; updates and expands test coverage
accordingly.
>
> <sup>Reviewed by [Cursor Bugbot](https://cursor.com/bugbot) for commit
b732d10eca. Bugbot is set up for automated
code reviews on this repo. Configure
[here](https://www.cursor.com/dashboard/bugbot).</sup>
<!-- /CURSOR_SUMMARY -->
---------
Co-authored-by: Zamil Majdy <zamil.majdy@agpt.co>
Co-authored-by: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
### Why / What / How
**Why:** The onboarding flow had inconsistent branding ("Autopilot" vs
"AutoPilot"), a heavy progress bar that dominated the header, an extra
click on the role screen, and no guidance on how many pain points to
select — leading to users selecting everything or nothing useful.
**What:** Copy & brand fixes, UX improvements (auto-advance, soft cap),
and visual polish (progress bar, checkmark badges, purple focus inputs).
**How:**
- Replaced all "Autopilot" with "AutoPilot" (capital P) across screens
1-3
- Removed the `?` tooltip on screen 1 (users will learn about AutoPilot
from the access email)
- Changed name label to conversational "What should I call you?"
- Screen 2: auto-advances 350ms after role selection (except "Other"
which still shows input + button)
- Screen 3: soft cap of 3 selections with green confirmation text and
shake animation on overflow attempt
- Thinned progress bar from ~10px to 3px (Linear/Notion style)
- Added purple checkmark badges on selected cards
- Updated Input atom focus state to purple ring
### Changes 🏗️
- **WelcomeStep**: "AutoPilot" branding, removed tooltip, conversational
label
- **RoleStep**: Updated subtitle, auto-advance on non-"Other" role
select, Continue button only for "Other"
- **PainPointsStep**: Soft cap of 3 with dynamic helper text and shake
animation
- **usePainPointsStep**: Added `atLimit`/`shaking` state, wrapped
`togglePainPoint` with cap logic
- **store.ts**: `togglePainPoint` returns early when at 3 and adding
- **ProgressBar**: 3px height, removed glow shadow
- **SelectableCard**: Added purple checkmark badge on selected state
- **Input atom**: Focus ring changed from zinc to purple
- **tailwind.config.ts**: Added `shake` keyframe and `animate-shake`
utility
### Checklist 📋
#### For code changes:
- [ ] I have clearly listed my changes in the PR description
- [ ] I have made a test plan
- [ ] I have tested my changes according to the test plan:
- [ ] Navigate through full onboarding flow (screens 1→2→3→4)
- [ ] Verify "AutoPilot" branding on all screens (no "Autopilot")
- [ ] Verify screen 2 auto-advances after tapping a role (non-"Other")
- [ ] Verify "Other" role still shows text input and Continue button
- [ ] Verify Back button works correctly from screen 2 and 3
- [ ] Select 3 pain points and verify green "3 selected" text
- [ ] Attempt 4th selection and verify shake animation + swap message
- [ ] Deselect one and verify can select a different one
- [ ] Verify checkmark badges appear on selected cards
- [ ] Verify progress bar is thin (3px) and subtle
- [ ] Verify input focus state is purple across onboarding inputs
- [ ] Verify "Something else" + other text input still works on screen 3
---------
Co-authored-by: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
## Changes
### 1. Inline image enforcement (Step 7)
- Added `CRITICAL` warning: never post a bare directory tree link
- Added post-comment verification block that greps for `` inline
- Screenshot captions were too vague to be useful ("shows the page")
- No mechanism to catch incomplete test runs — agent could skip
scenarios and still post a passing report
## Checklist
- [x] `.claude/skills/pr-test/SKILL.md` updated
- [x] No production code changes — skill/dx only
- [x] Pre-commit hooks pass
### Why / What / How
**Why:** Some AI-category blocks do not expose a `"model"` input
property in their `inputSchema`. The `fix_ai_model_parameter` fixer was
unconditionally injecting a default model value (e.g. `"gpt-4o"`) into
any node whose block has category `"AI"`, regardless of whether that
block actually accepts a `model` input. This causes the agent JSON to
include an invalid field for those blocks.
**What:** Guard the model-injection logic with a check that `"model"`
exists in the block's `inputSchema.properties` before attempting to set
or validate the field. AI blocks that have no model selector are now
skipped entirely.
**How:** In `fix_ai_model_parameter`, after confirming `is_ai_block`,
extract `input_properties` from the block's `inputSchema.properties` and
`continue` if `"model"` is absent. The subsequent `model_schema` lookup
is also simplified to reuse the already-fetched `input_properties` dict.
A regression test is added to cover this case.
### Changes 🏗️
- `backend/copilot/tools/agent_generator/fixer.py`: In
`fix_ai_model_parameter`, skip AI-category nodes whose block
`inputSchema.properties` does not contain a `"model"` key; reuse
`input_properties` for the subsequent `model_schema` lookup.
- `backend/copilot/tools/agent_generator/fixer_test.py`: Add
`test_ai_block_without_model_property_is_skipped` to
`TestFixAiModelParameter`.
### Checklist 📋
#### For code changes:
- [ ] I have clearly listed my changes in the PR description
- [ ] I have made a test plan
- [ ] I have tested my changes according to the test plan:
- [ ] Run `poetry run pytest
backend/copilot/tools/agent_generator/fixer_test.py` — all 50 tests pass
(49 pre-existing + 1 new)
Co-authored-by: Claude Sonnet 4.6 <noreply@anthropic.com>
## Summary
- Establish React integration tests (Vitest + RTL + MSW) as the primary
frontend testing strategy (~90% of tests)
- Update all contributor documentation (TESTING.md, CONTRIBUTING.md,
AGENTS.md) to reflect the integration-first convention
- Add `NuqsTestingAdapter` and `TooltipProvider` to the shared test
wrapper so page-level tests work out of the box
- Write 8 integration tests for the library page as a reference example
for the pattern
## Why
We had the testing infrastructure (Vitest, RTL, MSW, Orval-generated
handlers) but no established convention for page-level integration
tests. Most existing tests were for stores or small components. Since
our frontend is client-first, we need a documented, repeatable pattern
for testing full pages with mocked APIs.
## What
- **Docs**: Rewrote `TESTING.md` as a comprehensive guide. Updated
testing sections in `CONTRIBUTING.md`, `frontend/AGENTS.md`,
`platform/AGENTS.md`, and `autogpt_platform/AGENTS.md`
- **Test infra**: Added `NuqsTestingAdapter` (for `nuqs` query state
hooks) and `TooltipProvider` (for Radix tooltips) to `test-utils.tsx`
- **Reference tests**: `library/__tests__/main.test.tsx` with 8 tests
covering agent rendering, tabs, folders, search bar, and Jump Back In
## How
- Convention: tests live in `__tests__/` next to `page.tsx`, named
descriptively (`main.test.tsx`, `search.test.tsx`)
- Pattern: `setupHandlers()` → `render(<Page />)` → `findBy*` assertions
- MSW handlers from
`@/app/api/__generated__/endpoints/{tag}/{tag}.msw.ts` for API mocking
- Custom `render()` from `@/tests/integrations/test-utils` wraps all
required providers
## Test plan
- [x] All 422 unit/integration tests pass (`pnpm test:unit`)
- [x] `pnpm format` clean
- [x] `pnpm lint` clean (no new errors)
- [x] `pnpm types` — pre-existing onboarding type errors only, no new
errors
🤖 Generated with [Claude Code](https://claude.com/claude-code)
---------
Co-authored-by: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Co-authored-by: Nicholas Tindle <nicholas.tindle@agpt.co>
Co-authored-by: Reinier van der Leer <pwuts@agpt.co>
### Why / What / How
Users need a way to choose between fast, cheap responses (Sonnet) and
deep reasoning (Opus) in the copilot. Previously only the SDK/Opus path
existed, and the baseline path was a degraded fallback with no tool
calling, no file attachments, no E2B sandbox, and no permission
enforcement.
This PR adds a copilot mode toggle and brings the baseline (fast) path
to full feature parity with the SDK (extended thinking) path.
### Changes 🏗️
#### 1. Mode toggle (UI → full stack)
- Add Fast / Thinking mode toggle to ChatInput footer (Phosphor
`Brain`/`Zap` icons via lucide-react)
- Thread `mode: "fast" | "extended_thinking" | null` from
`StreamChatRequest` → RabbitMQ queue → executor → service selection
- Fast → baseline service (Sonnet 4 via OpenRouter), Thinking → SDK
service (Opus 4.6)
- Toggle gated behind `CHAT_MODE_OPTION` feature flag with server-side
enforcement
- Mode persists in localStorage with SSR-safe init
#### 2. Baseline service full tool parity
- **Tool call persistence**: Store structured `ChatMessage` entries
(assistant + tool results) instead of flat concatenated text — enables
frontend to render tool call details and maintain context across turns
- **E2B sandbox**: Wire up `get_or_create_sandbox()` so `bash_exec`
routes to E2B (image download, Python/PIL compression, filesystem
access)
- **File attachments**: Accept `file_ids`, download workspace files,
embed images as OpenAI vision blocks, save non-images to working dir
- **Permissions**: Filter tool list via `CopilotPermissions`
(whitelist/blacklist)
- **URL context**: Pass `context` dict to user message for URL-shared
content
- **Execution context**: Pass `sandbox`, `sdk_cwd`, `permissions` to
`set_execution_context()`
- **Model**: Changed `fast_model` from `google/gemini-2.5-flash` to
`anthropic/claude-sonnet-4` for reliable function calling
- **Temp dir cleanup**: Lazy `mkdtemp` (only when files attached) +
`shutil.rmtree` in finally
#### 3. Transcript support for Fast mode
- Baseline service now downloads / validates / loads / appends / uploads
transcripts (parity with SDK)
- Enables seamless mode switching mid-conversation via shared transcript
- Upload shielded from cancellation, bounded at 5s timeout
#### 4. Feature-flag infrastructure fixes
- `FORCE_FLAG_*` env-var overrides on both backend and frontend for
local dev / E2E
- LaunchDarkly context parity (frontend mirrors backend user context)
- `CHAT_MODE_OPTION` default flipped to `false` to match backend
#### 5. Other hardening
- Double-submit ref guard in `useChatInput` + reconnect dedup in
`useCopilotStream`
- `copilotModeRef` pattern to read latest mode without recreating
transport
- Shared `CopilotMode` type across frontend files
- File name collision handling with numeric suffix
- Path sanitization in file description hints (`os.path.basename`)
### Test plan
- [x] 30 new unit tests: `_env_flag_override` (12), `envFlagOverride`
(8), `_filter_tools_by_permissions` (4), `_prepare_baseline_attachments`
(6)
- [x] E2E tested on dev: fast mode creates E2B sandbox, calls 7-10
tools, generates and renders images
- [x] Mode switching mid-session works (shared transcript + session
messages)
- [x] Server-side flag gate enforced (crafted `mode=fast` stripped when
flag off)
- [x] All 37 CI checks green
- [x] Verified via agent-browser: workspace images render correctly in
all message positions
🤖 Generated with [Claude Code](https://claude.com/claude-code)
---------
Co-authored-by: Zamil Majdy <majdy.zamil@gmail.com>
## Why
#12604 (intermediate persistence) introduced two bugs on dev:
1. **Duplicate user messages** — `set_turn_duration` calls
`invalidate_session_cache()` which deletes the Redis key. Concurrent
`get_chat_session()` calls re-populate it from DB with stale data. The
executor loads this stale cache, misses the user message, and re-appends
it.
2. **Tool outputs lost on hydration** — Intermediate flushes save
assistant messages to DB before `StreamToolInputAvailable` sets
`tool_calls` on them. Since `_save_session_to_db` is append-only (uses
`start_sequence`), the `tool_calls` update is lost — subsequent flushes
start past that index. On page refresh / SSE reconnect, tool UIs
(SetupRequirementsCard, run_block output, etc.) are invisible.
3. **Sessions stuck running** — If a tool call hangs (e.g. WebSearch
provider not responding), the stream never completes,
`mark_session_completed` never runs, and the `active_stream` flag stays
stale in Redis.
## What
- **In-place cache update** in `set_turn_duration` — replaces
`invalidate_session_cache()` with a read-modify-write that patches the
duration on the cached session, eliminating the stale-cache repopulation
window
- **tool_calls backfill** — tracks the flush watermark and assistant
message index; when `StreamToolInputAvailable` sets `tool_calls` on an
already-flushed assistant, updates the DB record directly via
`update_message_tool_calls()`
- **Improved message dedup** — `is_message_duplicate()` /
`maybe_append_user_message()` scans trailing same-role messages (current
turn) instead of only checking `messages[-1]`
- **Idle timeout** — aborts the stream with a retryable error if no
meaningful SDK message arrives for 10 minutes, preventing hung tool
calls from leaving sessions stuck
## Changes
- `copilot/db.py` — `update_message_tool_calls()`, in-place cache update
in `set_turn_duration`
- `copilot/model.py` — `is_message_duplicate()`,
`maybe_append_user_message()`
- `copilot/sdk/service.py` — flush watermark tracking, tool_calls
backfill, idle timeout
- `copilot/baseline/service.py` — use `maybe_append_user_message()`
- `copilot/model_test.py` — unit tests for dedup
- `copilot/db_test.py` — unit tests for set_turn_duration cache update
## Checklist
- [x] My PR title follows [conventional
commit](https://www.conventionalcommits.org/) format
- [x] Out-of-scope changes are less than 20% of the PR
- [x] Changes to `data/*.py` validated for user ID checks (N/A)
- [x] Protected routes updated in middleware (N/A)
### Why / What / How
**Why:** We had no local pre-commit protection against accidentally
committing secrets. The existing `detect-secrets` hook only ran on
`pre-push`, which is too late — secrets are already in git history by
that point. GitHub's push protection only covers known provider patterns
and runs server-side.
**What:** Adds a 3-layer defense against secret leaks: local pre-commit
hooks (gitleaks + detect-secrets), and a CI workflow as a safety net.
**How:**
- Moved `detect-secrets` from `pre-push` to `pre-commit` stage
- Added `gitleaks` as a second pre-commit hook (Go binary, faster and
more comprehensive rule set)
- Added `.gitleaks.toml` config with allowlists for known false
positives (test fixtures, dev docker JWTs, Firebase public keys, lock
files, docs examples)
- Added `repo-secret-scan.yml` CI workflow using `gitleaks-action` on
PRs/pushes to master/dev
### Changes 🏗️
- `.pre-commit-config.yaml`: Moved `detect-secrets` to pre-commit stage,
added baseline arg, added `gitleaks` hook
- `.gitleaks.toml`: New config with tuned allowlists for this repo's
false positives
- `.secrets.baseline`: Empty baseline for detect-secrets to track known
findings
- `.github/workflows/repo-secret-scan.yml`: New CI workflow running
gitleaks on every PR and push
### Checklist 📋
#### For code changes:
- [x] I have clearly listed my changes in the PR description
- [x] I have made a test plan
- [x] I have tested my changes according to the test plan:
- [x] Ran `gitleaks detect --no-git` against the full repo — only `.env`
files (gitignored) remain as findings
- [x] Verified gitleaks catches a test secret file correctly
- [x] Pre-commit hooks pass on commit (both detect-secrets and gitleaks
passed)
#### For configuration changes:
- [x] `.env.default` is updated or already compatible with my changes
- [x] `docker-compose.yml` is updated or already compatible with my
changes
- [x] I have included a list of my configuration changes in the PR
description (under **Changes**)
## Summary
- Adds a four-tier subscription system (FREE/PRO/BUSINESS/ENTERPRISE)
for CoPilot with configurable multipliers (1x/5x/20x/60x) applied on top
of the base LaunchDarkly/config limits
- Stores user tier in the database (`User.subscriptionTier` column as a
Prisma enum, defaults to PRO for beta testing) with admin API endpoints
for tier management
- Includes tier info in usage status responses and OTEL/Langfuse trace
metadata for observability
## Tier Structure
| Tier | Multiplier | Daily Tokens | Weekly Tokens | Notes |
|------|-----------|-------------|--------------|-------|
| FREE | 1x | 2.5M | 12.5M | Base tier (unused during beta) |
| PRO | 5x | 12.5M | 62.5M | Default on sign-up (beta) |
| BUSINESS | 20x | 50M | 250M | Manual upgrade for select users |
| ENTERPRISE | 60x | 150M | 750M | Highest tier, custom |
## Changes
- **`rate_limit.py`**: `SubscriptionTier` enum
(FREE/PRO/BUSINESS/ENTERPRISE), `TIER_MULTIPLIERS`, `get_user_tier()`,
`set_user_tier()`, update `get_global_rate_limits()` to apply tier
multiplier and return 3-tuple, add `tier` field to `CoPilotUsageStatus`
- **`rate_limit_admin_routes.py`**: Add `GET/POST
/admin/rate_limit/tier` endpoints, include `tier` in
`UserRateLimitResponse`
- **`routes.py`** (chat): Include tier in `/usage` endpoint response
- **`sdk/service.py`**: Send `subscription_tier` in OTEL/Langfuse trace
metadata
- **`schema.prisma`**: Add `SubscriptionTier` enum and
`subscriptionTier` column to `User` model (default: PRO)
- **`config.py`**: Update docs to reflect tier system
- **Migration**: `20260326200000_add_rate_limit_tier` — creates enum,
migrates STANDARD→PRO, adds BUSINESS, sets default to PRO
## Test plan
- [x] 72 unit tests all passing (43 rate_limit + 11 admin routes + 18
chat routes)
- [ ] Verify FREE tier users get base limits (2.5M daily, 12.5M weekly)
- [ ] Verify PRO tier users get 5x limits (12.5M daily, 62.5M weekly)
- [ ] Verify BUSINESS tier users get 20x limits (50M daily, 250M weekly)
- [ ] Verify ENTERPRISE tier users get 60x limits (150M daily, 750M
weekly)
- [ ] Verify admin can read and set user tiers via API
- [ ] Verify tier info appears in Langfuse traces
- [ ] Verify migration applies cleanly (creates enum, migrates STANDARD
users to PRO, adds BUSINESS, default PRO)
---------
Co-authored-by: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Co-authored-by: Nicholas Tindle <nicholas.tindle@agpt.co>
### Why / What / How
**Why:** Frontend test coverage is still ramping up. The default
component status checks (project + patch at 80%) would block merges for
insufficient coverage on frontend changes, which isn't practical yet.
**What:** Override the platform-frontend component's coverage statuses
to be `informational: true`, so they report but don't block merges.
**How:** Added explicit `statuses` to the `platform-frontend` component
in `codecov.yml` with `informational: true` on both project and patch
checks, overriding the `default_rules`.
### Changes 🏗️
- **`codecov.yml`**: Added `informational: true` to platform-frontend
component's project and patch status checks
### Checklist 📋
#### For code changes:
- [ ] I have clearly listed my changes in the PR description
- [ ] I have made a test plan
- [ ] I have tested my changes according to the test plan:
- [ ] Verify Codecov frontend status checks show as informational
(non-blocking) on PRs touching frontend code
#### For configuration changes:
- [x] `.env.default` is updated or already compatible with my changes
- [x] `docker-compose.yml` is updated or already compatible with my
changes
- [x] I have included a list of my configuration changes in the PR
description (under **Changes**)
<!-- CURSOR_SUMMARY -->
---
> [!NOTE]
> **Low Risk**
> Low risk: Codecov configuration-only change that affects merge gating
for frontend coverage statuses but does not alter runtime code.
>
> **Overview**
> Updates `codecov.yml` to override the `platform-frontend` component’s
coverage `statuses` so both **project** and **patch** checks are marked
`informational: true` (non-blocking), while leaving the default
component coverage rules unchanged for other components.
>
> <sup>Written by [Cursor
Bugbot](https://cursor.com/dashboard?tab=bugbot) for commit
f8e8426a31. This will update automatically
on new commits. Configure
[here](https://cursor.com/dashboard?tab=bugbot).</sup>
<!-- /CURSOR_SUMMARY -->
Co-authored-by: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
AutoPilot (copilot) notifications had several follow-up issues after
initial implementation: old "Otto" branding, UX quirks, a service-worker
crash, notification state that didn't persist or sync across tabs, a
broken notification sound, and noisy Sentry alerts from SSR.
### Changes 🏗️
- **Rename "Otto" → "AutoPilot"** in all notification surfaces: browser
notifications, document title badge, permission dialog copy, and
notification banner copy
- **Agent Activity icon**: changed from `Bell` to `Pulse` (Phosphor) in
the navbar dropdown
- **Centered dialog buttons**: the "Stay in the loop" permission dialog
buttons are now centered instead of right-aligned
- **Service worker notification fix**: wrapped `new Notification()` in
try-catch so it degrades gracefully in service worker / PWA contexts
instead of throwing `TypeError: Illegal constructor`
- **Persist notification state**: `completedSessionIDs` is now stored in
localStorage (`copilot-completed-sessions`) so it survives page
refreshes and new tabs
- **Cross-tab sync**: a `storage` event listener keeps
`completedSessionIDs` and `document.title` in sync across all open tabs
— clearing a notification in one tab clears it everywhere
- **Fix notification sound**: corrected the sound file path from
`/sounds/notification.mp3` to `/notification.mp3` and added a
`.gitignore` exception (root `.gitignore` has a blanket `*.mp3` ignore
rule from legacy AutoGPT agent days)
- **Fix SSR Sentry noise**: guarded the Copilot Zustand store
initialization with a client-side check so `storage.get()` is never
called during SSR, eliminating spurious Sentry alerts (BUILDER-7CB, 7CC,
7C7) while keeping the Sentry reporting in `local-storage.ts` intact for
genuinely unexpected SSR access
### Checklist 📋
#### For code changes:
- [x] I have clearly listed my changes in the PR description
- [x] I have made a test plan
- [x] I have tested my changes according to the test plan:
- [x] Verify "AutoPilot" appears (not "Otto") in browser notification,
document title, permission dialog, and banner
- [x] Verify Pulse icon in navbar Agent Activity dropdown
- [x] Verify "Stay in the loop" dialog buttons are centered
- [x] Open two tabs on copilot → trigger completion → both tabs show
badge/checkmark
- [x] Click completed session in tab 1 → badge clears in both tabs
- [x] Refresh a tab → completed session state is preserved
- [x] Verify notification sound plays on completion
- [x] Verify no Sentry alerts from SSR localStorage access
---------
Co-authored-by: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
## Why
The copilot's `edit_agent` tool requires the LLM to provide a complete
agent JSON (all nodes + links), but the LLM had **no way to see the
current graph structure** before editing. It was editing blindly —
guessing/hallucinating the entire node+link structure and replacing the
graph wholesale.
## What
- Add `include_graph` boolean parameter (default `false`) to the
existing `find_library_agent` tool
- When `true`, each returned `AgentInfo` includes a `graph` field with
the full graph JSON (nodes, links, `input_default` values)
- Update the agent generation guide to instruct the LLM to always fetch
the current graph before editing
## How
- Added `graph: dict[str, Any] | None` field to `AgentInfo` model
- Added `_enrich_agents_with_graph()` helper in `agent_search.py` that
calls the existing `get_agent_as_json()` utility to fetch full graph
data
- Threaded `include_graph` parameter through `find_library_agent` →
`search_agents` → `_search_library`
- Updated `agent_generation_guide.md` to add an "if editing" step that
fetches the graph first
No new tools introduced — reuses existing `find_library_agent` with one
optional flag.
## Test plan
- [x] Unit tests: 2 new tests added
(`test_include_graph_fetches_nodes_and_links`,
`test_include_graph_false_does_not_fetch`)
- [x] All 7 `agent_search_test.py` tests pass
- [x] All pre-commit hooks pass (lint, format, typecheck)
- [ ] Verify copilot correctly uses `include_graph=true` before editing
an agent (manual test)
## Why
Dry-run block simulation is failing in production with `404 - model
gemini-2.5-flash does not exist`. The simulator's default model
(`google/gemini-2.5-flash`) is a non-OpenAI model that requires
OpenRouter routing, but the shared `get_openai_client()` prefers the
direct OpenAI key, creating a client that can't handle non-OpenAI
models. The old code also stripped the provider prefix, sending
`gemini-2.5-flash` to OpenAI's API.
## What
- Added `prefer_openrouter` keyword parameter to `get_openai_client()` —
when True, prefers the OpenRouter key (returns None if unavailable,
rather than falling back to an incompatible direct OpenAI client)
- Simulator now calls `get_openai_client(prefer_openrouter=True)` so
`google/gemini-2.5-flash` routes correctly through OpenRouter
- Removed the redundant `SIMULATION_MODEL` env var override and the
now-unnecessary provider prefix stripping from `_simulator_model()`
## How
`get_openai_client()` is decorated with `@cached(ttl_seconds=3600)`
which keys by args, so `get_openai_client()` and
`get_openai_client(prefer_openrouter=True)` are cached independently.
When `prefer_openrouter=True` and no OpenRouter key exists, returns
`None` instead of falling back — the simulator already handles `None`
with a clear error message.
### Checklist
- [x] All 24 dry-run tests pass
- [x] Test asserts `get_openai_client` is called with
`prefer_openrouter=True`
- [x] Format, lint, and pyright pass
- [x] No changes to user-facing APIs
- [ ] Deploy to staging and verify simulation works
---------
Co-authored-by: Nicholas Tindle <nicholas.tindle@agpt.co>
### Why / What / How
<img width="800" height="827" alt="Screenshot 2026-04-02 at 15 40 24"
src="https://github.com/user-attachments/assets/69a381c1-2884-434b-9406-4a3f7eec87cf"
/>
<img width="800" height="825" alt="Screenshot 2026-04-02 at 15 40 41"
src="https://github.com/user-attachments/assets/c6191a68-a8ba-482b-ba47-c06c71d69f0c"
/>
<img width="800" height="825" alt="Screenshot 2026-04-02 at 15 40 48"
src="https://github.com/user-attachments/assets/31b632b9-59cb-4bf7-a6a0-6158846fcf9a"
/>
<img width="800" height="812" alt="Screenshot 2026-04-02 at 15 40 54"
src="https://github.com/user-attachments/assets/64e38a15-2e56-4c0e-bd84-987bf6076bf7"
/>
**Why:** The existing onboarding flow was outdated and didn't align with
the new Autopilot-first experience. New users need a streamlined,
visually polished wizard that collects their role and pain points to
personalize Autopilot suggestions.
**What:** Complete redesign of the onboarding wizard as a 4-step flow:
Welcome → Role selection → Pain points → Preparing workspace. Uses the
design system throughout (atoms/molecules), adds animations, and syncs
steps with URL search params.
**How:**
- Zustand store manages wizard state (name, role, pain points, current
step)
- Steps synced to `?step=N` URL params for browser navigation support
- Pain points reordered based on selected role (e.g. Sales sees "Finding
leads" first)
- Design system components used exclusively (no raw shadcn `ui/`
imports)
- New reusable components: `FadeIn` (atom), `TypingText` (molecule) with
Storybook stories
- `AutoGPTLogo` made sizeable via Tailwind className prop, migrated in
Navbar
- Fixed `SetupAnalytics` crash (client component was rendered inside
`<head>`)
### Changes 🏗️
- **New onboarding wizard** (`steps/WelcomeStep`, `RoleStep`,
`PainPointsStep`, `PreparingStep`)
- **New shared components**: `ProgressBar`, `StepIndicator`,
`SelectableCard`, `CardCarousel`
- **New design system components**: `FadeIn` atom with stories,
`TypingText` molecule with stories
- **`AutoGPTLogo`** — size now controlled via `className` prop instead
of numeric `size`
- **Navbar** — migrated from legacy `IconAutoGPTLogo` to design system
`AutoGPTLogo`
- **Layout fix** — moved `SetupAnalytics` from `<head>` to `<body>` to
fix React hydration crash
- **Role-based pain point ordering** — top picks surfaced first based on
role selection
- **URL-synced steps** — `?step=N` search params for back/forward
navigation
- Removed old onboarding pages (1-welcome through 6-congrats, reset
page)
- Emoji/image assets for role selection cards
### Checklist 📋
#### For code changes:
- [x] I have clearly listed my changes in the PR description
- [x] I have made a test plan
- [x] I have tested my changes according to the test plan:
- [x] Complete onboarding flow from step 1 through 4 as a new user
- [x] Verify back button navigates to previous step
- [x] Verify progress bar advances correctly (hidden on step 4)
- [x] Verify step indicator dots show for steps 1-3
- [x] Verify role selection reorders pain points on next step
- [x] Verify "Other" role/pain point shows text input
- [x] Verify typing animation on PreparingStep title
- [x] Verify fade-in animations on all steps
- [x] Verify URL updates with `?step=N` on navigation
- [x] Verify browser back/forward works with step URLs
- [x] Verify mobile horizontal scroll on card grids
- [x] Verify `pnpm types` passes cleanly
---------
Co-authored-by: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
## Summary
This PR modernizes AutoGPT Classic to make it more useful for day-to-day
autonomous agent development. Major changes include consolidating the
project structure, adding new prompt strategies, modernizing the
benchmark system, and improving the development experience.
**Note: AutoGPT Classic is an experimental, unsupported project
preserved for educational/historical purposes. Dependencies will not be
actively updated.**
## Changes 🏗️
### Project Structure & Build System
- **Consolidated Poetry projects** - Merged `forge/`,
`original_autogpt/`, and benchmark packages into a single
`pyproject.toml` at `classic/` root
- **Removed old benchmark infrastructure** - Deleted the complex
`agbenchmark` package (3000+ lines) in favor of the new
`direct_benchmark` harness
- **Removed frontend** - Deleted `benchmark/frontend/` React app (no
longer needed)
- **Cleaned up CI workflows** - Simplified GitHub Actions workflows for
the consolidated project structure
- **Added CLAUDE.md** - Documentation for working with the codebase
using Claude Code
### New Direct Benchmark System
- **`direct_benchmark` harness** - New streamlined benchmark runner
with:
- Rich TUI with multi-panel layout showing parallel test execution
- Incremental resume and selective reset capabilities
- CI mode for non-interactive environments
- Step-level logging with colored prefixes
- "Would have passed" tracking for timed-out challenges
- Copy-paste completion blocks for sharing results
### Multiple Prompt Strategies
Added pluggable prompt strategy system supporting:
- **one_shot** - Single-prompt completion
- **plan_execute** - Plan first, then execute steps
- **rewoo** - Reasoning without observation (deferred tool execution)
- **react** - Reason + Act iterative loop
- **lats** - Language Agent Tree Search (MCTS-based exploration)
- **sub_agent** - Multi-agent delegation architecture
- **debate** - Multi-agent debate for consensus
### LLM Provider Improvements
- Added support for modern **Anthropic Claude models**
(claude-3.5-sonnet, claude-3-haiku, etc.)
- Added **Groq** provider support
- Improved tool call error feedback for LLM self-correction
- Fixed deprecated API usage
### Web Components
- **Replaced Selenium with Playwright** for web browsing (better async
support, faster)
- Added **lightweight web fetch component** for simple URL fetching
- **Modernized web search** with tiered provider system (Tavily, Serper,
Google)
### Agent Capabilities
- **Workspace permissions system** - Pattern-based allow/deny lists for
agent commands
- **Rich interactive selector** for command approval with scopes
(once/agent/workspace/deny)
- **TodoComponent** with LLM-powered task decomposition
- **Platform blocks integration** - Connect to AutoGPT Platform API for
additional blocks
- **Sub-agent architecture** - Agents can spawn and coordinate
sub-agents
### Developer Experience
- **Python 3.12+ support** with CI testing on 3.12, 3.13, 3.14
- **Current working directory as default workspace** - Run `autogpt`
from any project directory
- Simplified log format (removed timestamps)
- Improved configuration and setup flow
- External benchmark adapters for GAIA, SWE-bench, and AgentBench
### Bug Fixes
- Fixed N/A command loop when using native tool calling
- Fixed auto-advance plan steps in Plan-Execute strategy
- Fixed approve+feedback to execute command then send feedback
- Fixed parallel tool calls in action history
- Always recreate Docker containers for code execution
- Various pyright type errors resolved
- Linting and formatting issues fixed across codebase
## Test Plan
- [x] CI lint, type, and test checks pass
- [x] Run `poetry install` from `classic/` directory
- [x] Run `poetry run autogpt` and verify CLI starts
- [x] Run `poetry run direct-benchmark run --tests ReadFile` to verify
benchmark works
## Notes
- This is a WIP PR for personal use improvements
- The project is marked as **unsupported** - no active maintenance
planned
- Contains known vulnerabilities in dependencies (intentionally not
updated)
<!-- CURSOR_SUMMARY -->
---
> [!NOTE]
> **Medium Risk**
> CI/build workflows are substantially reworked (runner matrix removal,
path/layout changes, new benchmark runner), so breakage is most likely
in automation and packaging rather than runtime behavior.
>
> **Overview**
> **Modernizes the `classic/` project layout and automation around a
single consolidated Poetry project** (root
`classic/pyproject.toml`/`poetry.lock`) and updates docs
(`classic/README.md`, new `classic/CLAUDE.md`) accordingly.
>
> **Replaces the old `agbenchmark` CI usage with `direct-benchmark` in
GitHub Actions**, including new/updated benchmark smoke and regression
workflows, standardized `working-directory: classic`, and a move to
**Python 3.12** on Ubuntu-only runners (plus updated caching, coverage
flags, and required `ANTHROPIC_API_KEY` wiring).
>
> Cleans up repo/dev tooling by removing the classic frontend workflow,
deleting the Forge VCR cassette submodule (`.gitmodules`) and associated
CI steps, consolidating `flake8`/`isort`/`pyright` pre-commit hooks to
run from `classic/`, updating ignores for new report/workspace
artifacts, and updating `classic/Dockerfile.autogpt` to build from
Python 3.12 with the consolidated project structure.
>
> <sup>Written by [Cursor
Bugbot](https://cursor.com/dashboard?tab=bugbot) for commit
de67834dac. This will update automatically
on new commits. Configure
[here](https://cursor.com/dashboard?tab=bugbot).</sup>
<!-- /CURSOR_SUMMARY -->
---------
Co-authored-by: Claude Opus 4.5 <noreply@anthropic.com>
Co-authored-by: Zamil Majdy <zamil.majdy@agpt.co>
## Summary
- Add a read-only SQL query block for CoPilot/AutoPilot analytics access
- Supports **multiple databases**: PostgreSQL, MySQL, SQLite, MSSQL via
SQLAlchemy
- Enforces read-only queries (SELECT only) with defense-in-depth SQL
validation using sqlparse
- SSRF protection: blocks connections to private/internal IPs
- Credentials stored securely via the platform credential system
## Changes
- New `SQLQueryBlock` in `backend/blocks/sql_query_block.py` with
`DatabaseType` enum
- SQLAlchemy-based execution with dialect-specific read-only and timeout
settings
- Connection URL validation ensuring driver matches selected database
type
- Comprehensive test suite (62 tests) including URL validation,
sanitization, serialization
- Documentation in `docs/integrations/block-integrations/data.md`
- Added `DATABASE` provider to `ProviderName` enum
### Checklist 📋
- [x] I have clearly listed my changes in the PR description
- [x] I have made a test plan
- [x] I have tested my changes according to the test plan
#### Test plan:
- [x] Unit tests pass for query validation, URL validation, error
sanitization, value serialization
- [x] Read-only enforcement rejects INSERT/UPDATE/DELETE/DROP
- [x] Multi-statement injection blocked
- [x] SSRF protection blocks private IPs
- [x] Connection URL driver validation works for all 4 database types
---------
Co-authored-by: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
## Summary
- **OrchestratorBlock & AgentExecutorBlock** now execute for real in
dry-run mode so the orchestrator can make LLM calls and agent executors
can spawn child graphs. Their downstream tool blocks and child-graph
blocks are still simulated via `simulate_block()`. Credential fields
from node defaults are restored since `validate_exec()` wipes them in
dry-run mode. Agent-mode iterations capped at 1 in dry-run.
- **All blocks** (including MCPToolBlock) are simulated via a single
generic `simulate_block()` path. The LLM prompt is grounded by
`inspect.getsource(block.run)`, giving the simulator access to the exact
implementation of each block's `run()` method. This produces realistic
mock responses for any block type without needing block-specific
simulation logic.
- Updated agent generation guide to document special block dry-run
behavior.
- Minor frontend fixes: exported `formatCents` from
`RateLimitResetDialog` for reuse in `UsagePanelContent`, used `useRef`
for stable callback references in `useResetRateLimit` to avoid stale
closures.
- 74 tests (21 existing dry-run + 53 new simulator tests covering prompt
building, passthrough logic, and special block dry-run).
## Design
The simulator (`backend/executor/simulator.py`) uses a two-tier
approach:
1. **Passthrough blocks** (OrchestratorBlock, AgentExecutorBlock):
`prepare_dry_run()` returns modified input_data so these blocks execute
for real in `manager.py`. OrchestratorBlock gets `max_iterations=1`
(agent mode) or 0 (traditional mode). AgentExecutorBlock spawns real
child graph executions whose blocks inherit `dry_run=True`.
2. **All other blocks**: `simulate_block()` builds an LLM prompt
containing:
- Block name and description
- Input/output schemas (JSON Schema)
- The block's `run()` source code via `inspect.getsource(block.run)`
- The actual input values (with credentials stripped and long values
truncated)
The LLM then role-plays the block's execution, producing realistic
outputs grounded in the actual implementation.
Special handling for input/output blocks: `AgentInputBlock` and
`AgentOutputBlock` are pure passthrough (no LLM call needed).
## Test plan
- [x] All 74 tests pass (`pytest backend/copilot/tools/test_dry_run.py
backend/executor/simulator_test.py`)
- [x] Pre-commit hooks pass (ruff, isort, black, pyright, frontend
typecheck)
- [x] CI: all checks green
- [x] E2E: dry-run execution completes with `is_dry_run=true`, cost=0,
no errors
- [x] E2E: normal (non-dry-run) execution unchanged
- [x] E2E: Create agent with OrchestratorBlock + tool blocks, run with
`dry_run=True`, verify orchestrator makes real LLM calls while tool
blocks are simulated
- [x] E2E: AgentExecutorBlock spawns child graph in dry-run, child
blocks are LLM-simulated
- [x] E2E: Builder simulate button works end-to-end with special blocks
---------
Co-authored-by: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
### Why / What / How
**Why:** The Claude Agent SDK CLI renamed the sub-agent tool from
`"Task"` to `"Agent"` in v2.x. Our security hooks only checked for
`"Task"`, so all sub-agent security controls were silently bypassed on
production: concurrency limiting didn't apply, and slot tracking was
broken. This was discovered via Langfuse trace analysis of session
`62b1b2b9` where background sub-agents ran unchecked.
Additionally, the CLI writes sub-agent output to `/tmp/claude-<uid>/`
and project state to `$HOME/.claude/` — both outside the per-session
workspace (`/tmp/copilot-<session>/`). This caused `PermissionError` in
E2B sandboxes and silently lost sub-agent results.
The frontend also had no rendering for the `Agent` / `TaskOutput` SDK
built-in tools — they fell through to the generic "other" category with
no context-aware display.
**What:**
1. Fix the sub-agent tool name recognition (`"Task"` → `{"Task",
"Agent"}`)
2. Allow `run_in_background` — the SDK handles async lifecycle cleanly
(returns `isAsync:true`, model polls via `TaskOutput`)
3. Route CLI state into the workspace via `CLAUDE_CODE_TMPDIR` and
`HOME` env vars
4. Add lifecycle hooks (`SubagentStart`/`SubagentStop`) for
observability
5. Add frontend `"agent"` tool category with proper UI rendering
**How:**
- Security hooks check `tool_name in _SUBAGENT_TOOLS` (frozenset of
`"Task"` and `"Agent"`)
- Background agents are allowed but still count against `max_subtasks`
concurrency limit
- Frontend detects `isAsync: true` output → shows "Agent started
(background)" not "Agent completed"
- `TaskOutput` tool shows retrieval status and collected results
- Robot icon and agent-specific accordion rendering for both foreground
and background agents
### Changes 🏗️
**Backend:**
- **`security_hooks.py`**: Replace `tool_name == "Task"` with `tool_name
in _SUBAGENT_TOOLS`. Remove `run_in_background` deny block (SDK handles
async lifecycle). Add `SubagentStart`/`SubagentStop` hooks.
- **`tool_adapter.py`**: Add `"Agent"` to `_SDK_BUILTIN_ALWAYS` list
alongside `"Task"`.
- **`service.py`**: Set `CLAUDE_CODE_TMPDIR=sdk_cwd` and `HOME=sdk_cwd`
in SDK subprocess env.
- **`security_hooks_test.py`**: Update background tests (allowed, not
blocked). Add test for background agents counting against concurrency
limit.
**Frontend:**
- **`GenericTool/helpers.ts`**: Add `"agent"` tool category for `Agent`,
`Task`, `TaskOutput`. Agent-specific animation text detecting `isAsync`
output. Input summaries from description/prompt fields.
- **`GenericTool/GenericTool.tsx`**: Add `RobotIcon` for agent category.
Add `getAgentAccordionData()` with async-aware title/content.
`TaskOutput` shows retrieval status.
- **`useChatSession.ts`**: Fix pre-existing TS error (void mutation
body).
### Checklist 📋
#### For code changes:
- [x] I have clearly listed my changes in the PR description
- [x] I have made a test plan
- [x] I have tested my changes according to the test plan:
- [x] All security hooks tests pass (background allowed + limit
enforced)
- [x] Pre-commit hooks (ruff, black, isort, pyright, tsc) all pass
- [x] E2E test: copilot agent create+run scenario PASS
- [ ] Deploy to dev and test copilot sub-agent spawning with background
mode
#### For configuration changes:
- [x] `.env.default` is updated or already compatible
- [x] `docker-compose.yml` is updated or already compatible
### Why / What / How
**Why:** When copilot tools return large outputs (e.g. 3MB+ base64
images from API calls), the agent cannot process them in the E2B
sandbox. Three compounding issues prevent seamless file access:
1. The `<tool-output-truncated path="...">` tag uses a bare `path=`
attribute that the model confuses with a local filesystem path (it's
actually a workspace path)
2. `is_allowed_local_path` rejects `tool-outputs/` directories (only
`tool-results/` was allowed)
3. SDK-internal files read via the `Read` tool are not available in the
E2B sandbox for `bash_exec` processing
**What:** Fixes all three issues so that large tool outputs can be
seamlessly read and processed in both host and E2B contexts.
**How:**
- Changed `path=` → `workspace_path=` in the truncation tag to
disambiguate workspace vs filesystem paths
- Added `save_to_path` guidance in the retrieval instructions for E2B
users
- Extended `is_allowed_local_path` to accept both `tool-results` and
`tool-outputs` directories
- Added automatic bridging: when E2B is active and `Read` accesses an
SDK-internal file, the file is automatically copied to `/tmp/<filename>`
in the sandbox
- Updated system prompting to explain both SDK tool-result bridging and
workspace `<tool-output-truncated>` handling
### Changes 🏗️
- **`tools/base.py`**: `_persist_and_summarize` now uses
`workspace_path=` attribute and includes `save_to_path` example for E2B
processing
- **`context.py`**: `is_allowed_local_path` accepts both `tool-results`
and `tool-outputs` directory names
- **`sdk/e2b_file_tools.py`**: `_handle_read_file` bridges SDK-internal
files to `/tmp/` in E2B sandbox; new `_bridge_to_sandbox` helper
- **`prompting.py`**: Updated "SDK tool-result files" section and added
"Large tool outputs saved to workspace" section
- **Tests**: Added `tool-outputs` path validation tests in
`context_test.py` and `e2b_file_tools_test.py`; updated `base_test.py`
assertion for `workspace_path`
### Checklist 📋
#### For code changes:
- [x] I have clearly listed my changes in the PR description
- [x] I have made a test plan
- [x] I have tested my changes according to the test plan:
- [x] `poetry run pytest backend/copilot/tools/base_test.py` — all 9
tests pass (persistence, truncation, binary fields)
- [x] `poetry run format` and `poetry run lint` pass clean
- [x] All pre-commit hooks pass
- [ ] `context_test.py`, `e2b_file_tools_test.py`,
`security_hooks_test.py` — blocked by pre-existing DB migration issue on
worktree (missing `User.subscriptionTier` column); CI will validate
these
## Summary
Sets git author/committer identity in E2B sandboxes using the user's
connected GitHub account profile, so commits are properly attributed.
## Changes
### `integration_creds.py`
- Added `get_github_user_git_identity(user_id)` that fetches the user's
name and email from the GitHub `/user` API
- Uses TTL cache (10 min) to avoid repeated API calls
- Falls back to GitHub noreply email
(`{id}+{login}@users.noreply.github.com`) when user has a private email
- Falls back to `login` if `name` is not set
### `bash_exec.py`
- After injecting integration env vars, calls
`get_github_user_git_identity()` and sets `GIT_AUTHOR_NAME`,
`GIT_AUTHOR_EMAIL`, `GIT_COMMITTER_NAME`, `GIT_COMMITTER_EMAIL`
- Only sets these if the user has a connected GitHub account
### `bash_exec_test.py`
- Added tests covering: identity set from GitHub profile, no identity
when GitHub not connected, no injection when no user_id
## Why
Previously, commits made inside E2B sandboxes had no author identity
set, leading to unattributed commits. This dynamically resolves identity
from the user's actual GitHub account rather than hardcoding a default.
<!-- CURSOR_SUMMARY -->
---
> [!NOTE]
> **Medium Risk**
> Adds outbound calls to GitHub’s `/user` API during `bash_exec` runs
and injects returned identity into the sandbox environment, which could
impact reliability (network/timeouts) and attribution behavior. Caching
mitigates repeated calls but incorrect/expired tokens or API failures
may lead to missing identity in commits.
>
> **Overview**
> Sets git author/committer environment variables in the E2B `bash_exec`
path by fetching the connected user’s GitHub profile and injecting
`GIT_AUTHOR_*`/`GIT_COMMITTER_*` into the sandbox env.
>
> Introduces `get_github_user_git_identity()` with TTL caching
(including a short-lived null cache), fallback to GitHub noreply email
when needed, and ensures `invalidate_user_provider_cache()` also clears
identity caches for the `github` provider. Updates tests to cover
identity injection behavior and the new cache invalidation semantics.
>
> <sup>Written by [Cursor
Bugbot](https://cursor.com/dashboard?tab=bugbot) for commit
955ec81efe. This will update automatically
on new commits. Configure
[here](https://cursor.com/dashboard?tab=bugbot).</sup>
<!-- /CURSOR_SUMMARY -->
---------
Co-authored-by: AutoGPT <autopilot@agpt.co>
### Why / What / How
**Why:** The copilot can ask clarifying questions in plain text, but
that text gets collapsed into hidden "reasoning" UI when the LLM also
calls tools in the same turn. This makes clarification questions
invisible to users. The existing `ClarificationNeededResponse` model and
`ClarificationQuestionsCard` UI component were built for this purpose
but had no tool wiring them up.
**What:** Adds a generic `ask_question` tool that produces a visible,
interactive clarification card instead of collapsible plain text. Unlike
the agent-generation-specific `clarify_agent_request` proposed in
#12601, this tool is workflow-agnostic — usable for agent building,
editing, troubleshooting, or any flow needing user input.
**How:**
- Backend: New `AskQuestionTool` reuses existing
`ClarificationNeededResponse` model. Registered in `TOOL_REGISTRY` and
`ToolName` permissions.
- Frontend: New `AskQuestion/` renderer reuses
`ClarificationQuestionsCard` from CreateAgent. Registered in
`CUSTOM_TOOL_TYPES` (prevents collapse into reasoning) and
`MessagePartRenderer`.
- Guide: `agent_generation_guide.md` updated to reference `ask_question`
for the clarification step.
### Changes 🏗️
- **`copilot/tools/ask_question.py`** — New generic tool: takes
`question`, optional `options[]` and `keyword`, returns
`ClarificationNeededResponse`
- **`copilot/tools/__init__.py`** — Register `ask_question` in
`TOOL_REGISTRY`
- **`copilot/permissions.py`** — Add `ask_question` to `ToolName`
literal
- **`copilot/sdk/agent_generation_guide.md`** — Reference `ask_question`
tool in clarification step
- **`ChatMessagesContainer/helpers.ts`** — Add `tool-ask_question` to
`CUSTOM_TOOL_TYPES`
- **`MessagePartRenderer.tsx`** — Add switch case for
`tool-ask_question`
- **`AskQuestion/AskQuestion.tsx`** — Renderer reusing
`ClarificationQuestionsCard`
- **`AskQuestion/helpers.ts`** — Output parsing and animation text
### Checklist 📋
#### For code changes:
- [x] I have clearly listed my changes in the PR description
- [x] I have made a test plan
- [x] I have tested my changes according to the test plan:
- [x] Backend format + pyright pass
- [x] Frontend lint + types pass
- [x] Pre-commit hooks pass
- [ ] Manual test: copilot uses `ask_question` and card renders visibly
(not collapsed)
## Summary
Resolves [REQ-78](https://linear.app/autogpt/issue/REQ-78): The
`placeholder_values` field on `AgentDropdownInputBlock` is misleadingly
named. In every major UI framework "placeholder" means non-binding hint
text that disappears on focus, but this field actually creates a
dropdown selector that restricts the user to only those values.
## Changes
### Core rename (`autogpt_platform/backend/backend/blocks/io.py`)
- Renamed `placeholder_values` → `options` on
`AgentDropdownInputBlock.Input`
- Added clear field description: *"If provided, renders the input as a
dropdown selector restricted to these values. Leave empty for free-text
input."*
- Updated class docstring to describe actual behavior
- Overrode `model_construct()` to remap legacy `placeholder_values` →
`options` for **backward compatibility** with existing persisted agent
JSON
### Tests (`autogpt_platform/backend/backend/blocks/test/test_block.py`)
- Updated existing tests to use canonical `options` field name
- Added 2 new backward-compat tests verifying legacy
`placeholder_values` still works through both `model_construct()` and
`Graph._generate_schema()` paths
### Documentation
- Updated
`autogpt_platform/backend/backend/copilot/sdk/agent_generation_guide.md`
— changed field name in CoPilot SDK guide
- Updated `docs/integrations/block-integrations/basic.md` — changed
field name and description in public docs
### Load tests
(`autogpt_platform/backend/load-tests/tests/api/graph-execution-test.js`)
- Removed spurious `placeholder_values: {}` from AgentInputBlock node
(this field never existed on AgentInputBlock)
- Fixed execution input to use `value` instead of `placeholder_values`
## Backward Compatibility
Existing agents with `placeholder_values` in their persisted
`input_default` JSON will continue to work — the `model_construct()`
override transparently remaps the old key to `options`. No database
migration needed since the field is stored inside a JSON blob, not as a
dedicated column.
## Testing
- All existing tests updated and passing
- 2 new backward-compat tests added
- No frontend changes needed (frontend reads `enum` from generated JSON
Schema, not the field name directly)
---------
Co-authored-by: Zamil Majdy <zamil.majdy@agpt.co>
## Why
PR #12625 fixed the prompt-too-long retry mechanism for most paths, but
two SDK-specific paths were still broken. The dev session `d2f7cba3`
kept accumulating synthetic "Prompt is too long" error entries on every
turn, growing the transcript from 2.5 MB → 3.2 MB, making recovery
impossible.
Root causes identified from production logs (`[T25]`, `[T28]`):
**Path 1 — AssistantMessage content check:**
When the Claude API rejects a prompt, the SDK surfaces it as
`AssistantMessage(error="invalid_request", content=[TextBlock("Prompt is
too long")])`. Our check only inspected `error_text = str(sdk_error)`
which is `"invalid_request"` — not a prompt-too-long pattern. The
content was then streamed out as `StreamText`, setting `events_yielded =
1`, which blocked retry even when the ResultMessage fired.
**Path 2 — ResultMessage success subtype:**
After the SDK auto-compacts internally (via `PreCompact` hook) and the
compacted transcript is _still_ too long, the SDK returns
`ResultMessage(subtype="success", result="Prompt is too long")`. Our
check only ran for `subtype="error"`. With `subtype="success"`, the
stream "completed normally", appended the synthetic error entry to the
transcript via `transcript_builder`, and uploaded it to GCS — causing
the transcript to grow on each failed turn.
## What
- **AssistantMessage handler**: when `sdk_error` is set, also check the
content text. `sdk_error` being non-`None` confirms this is an API error
message (not user-generated content), so content inspection is safe.
- **ResultMessage handler**: check `result` for prompt-too-long patterns
regardless of `subtype`, covering the SDK auto-compact path where
`subtype="success"` with `result="Prompt is too long"`.
## How
Two targeted one-line condition expansions in `_run_stream_attempt`,
plus two new integration tests in `retry_scenarios_test.py` that
reproduce each broken path and verify retry fires correctly.
## Changes
- `backend/copilot/sdk/service.py`: fix AssistantMessage content check +
ResultMessage subtype-independent check
- `backend/copilot/sdk/retry_scenarios_test.py`: add 2 integration tests
for the new scenarios
## Checklist
- [x] Tests added for both new scenarios (45 total, all pass)
- [x] Formatted (`poetry run format`)
- [x] No false-positive risk: AssistantMessage check gated behind
`sdk_error is not None`
- [x] Root cause verified from production pod logs
## Why
CoPilot autopilot sessions are inconsistently failing to load user
credentials (specifically GitHub OAuth). Some sessions proceed normally,
some show "provide credentials" prompts despite the user having valid
creds, and some are completely blocked.
Production logs confirmed the root cause: `RuntimeError: Task got Future
<Future pending> attached to a different loop` in the credential refresh
path, cascading into null-cache poisoning that blocks credential lookups
for 60 seconds.
## What
Three interrelated bugs in the credential system:
1. **`refresh_if_needed` always acquired Redis locks even with
`lock=False`** — The `lock` parameter only controlled the inner
credential lock, but the outer "refresh" scope lock was always acquired.
The copilot executor uses multiple worker threads with separate event
loops; the `asyncio.Lock` inside `AsyncRedisKeyedMutex` was bound to one
loop and failed on others.
2. **Stale event loop in `locks()` singleton** — Both
`IntegrationCredentialsManager` and `IntegrationCredentialsStore` cached
their `AsyncRedisKeyedMutex` without tracking which event loop created
it. When a different worker thread (with a different loop) reused the
singleton, it got the "Future attached to different loop" error.
3. **Null-cache poisoning on refresh failure** — When OAuth refresh
failed (due to the event loop error), the code fell through to cache "no
credentials found" for 60 seconds via `_null_cache`. This blocked ALL
subsequent credential lookups for that user+provider, even though the
credentials existed and could refresh fine on retry.
## How
- Split `refresh_if_needed` into `_refresh_locked` / `_refresh_unlocked`
so `lock=False` truly skips ALL Redis locking (safe for copilot's
best-effort background injection)
- Added event loop tracking to `locks()` in both
`IntegrationCredentialsManager` and `IntegrationCredentialsStore` —
recreates the mutex when the running loop changes
- Only populate `_null_cache` when the user genuinely has no
credentials; skip caching when OAuth refresh failed transiently
- Updated existing test to verify null-cache is not poisoned on refresh
failure
## Test plan
- [x] All 14 existing `integration_creds_test.py` tests pass
- [x] Updated
`test_oauth2_refresh_failure_returns_none_without_null_cache` verifies
null-cache is not populated on refresh failure
- [x] Format, lint, and typecheck pass
- [ ] Deploy to staging and verify copilot sessions consistently load
GitHub credentials
The SDK returns AssistantMessage(error="invalid_request", content=[TextBlock("Prompt is too long")])
followed by ResultMessage(subtype="success", result="Prompt is too long") when the transcript is
rejected after internal auto-compaction. Both paths bypassed the retry mechanism:
- AssistantMessage handler only checked error_text ("invalid_request"), not the content which
holds the actual error description. The content was then streamed as text, setting events_yielded=1,
which blocked retry even when ResultMessage fired.
- ResultMessage handler only triggered prompt-too-long detection for subtype="error", not
subtype="success". The stream "completed normally", stored the synthetic error entry in the
transcript, and uploaded it — causing the transcript to grow unboundedly on each failed turn.
Fixes:
1. AssistantMessage handler: when sdk_error is set (confirmed error message), also check content
text. sdk_error being set guarantees this is an API error, not user-generated content, so
content inspection is safe.
2. ResultMessage handler: check result for prompt-too-long regardless of subtype, covering the
case where the SDK auto-compacts internally but the result is still too long.
Adds integration tests for both new scenarios.
## Why
CoPilot has several context management issues that degrade long
sessions:
1. "Prompt is too long" errors crash the session instead of triggering
retry/compaction
2. Stale thinking blocks bloat transcripts, causing unnecessary
compaction every turn
3. Compression target is hardcoded regardless of model context window
size
4. Truncated tool calls (empty `{}` args from max_tokens) kill the
session instead of guiding the model to self-correct
## What
**Fix 1: Prompt-too-long retry bypass (SENTRY-1207)**
The SDK surfaces "prompt too long" via `AssistantMessage.error` and
`ResultMessage.result` — neither triggered the retry/compaction loop
(only Python exceptions did). Now both paths are intercepted and
re-raised.
**Fix 2: Strip stale thinking blocks before upload**
Thinking/redacted_thinking blocks in non-last assistant entries are
10-50K tokens each but only needed for API signature verification in the
*last* message. Stripping before upload reduces transcript size and
prevents per-turn compaction.
**Fix 3: Model-aware compression target**
`compress_context()` now computes `target_tokens` from the model's
context window (e.g. 140K for Opus 200K) instead of a hardcoded 120K
default. Larger models retain more history; smaller models compress more
aggressively.
**Fix 4: Self-correcting truncated tool calls**
When the model's response exceeds max_tokens, tool call inputs get
silently truncated to `{}`. Previously this tripped a circuit breaker
after 3 attempts. Now the MCP wrapper detects empty args and returns
guidance: "write in chunks with `cat >>`, pass via
`@@agptfile:filename`". The model can self-correct instead of the
session dying.
## How
- **service.py**: `_is_prompt_too_long` checks in both
`AssistantMessage.error` and `ResultMessage` error handlers. Circuit
breaker limit raised from 3→5.
- **transcript.py**: `strip_stale_thinking_blocks()` reverse-scans for
last assistant `message.id`, strips thinking blocks from all others.
Called in `upload_transcript()`.
- **prompt.py**: `get_compression_target(model)` computes
`context_window - 60K overhead`. `compress_context()` uses it when
`target_tokens` is None.
- **tool_adapter.py**: `_truncating` wrapper intercepts empty args on
tools with required params, returns actionable guidance instead of
failing.
## Related
- Fixes SENTRY-1207
- Sessions: `d2f7cba3` (repeated compaction), `08b807d4` (prompt too
long), `130d527c` (truncated tool calls)
- Extends #12413, consolidates #12626
## Test plan
- [x] 6 unit tests for `strip_stale_thinking_blocks`
- [x] 1 integration test for ResultMessage prompt-too-long → compaction
retry
- [x] Pyright clean (0 errors), all pre-commit hooks pass
- [ ] E2E: Load transcripts from affected sessions and verify behavior
## Why
CoPilot sessions are duplicating Linear tickets and GitHub PRs.
Investigation of 5 production sessions (March 31st) found that 3/5
created duplicate Linear issues — each with consecutive IDs at the exact
same timestamp, but only one visible in Langfuse traces.
Production gcloud logs confirm: **279 arg mismatch warnings per day**,
**37 duplicate block execution pairs**, and all LinearCreateIssueBlock
failures in pairs.
Related: SECRT-2204
## What
Replace the speculative pre-launch mechanism with the SDK's native
parallel dispatch via `readOnlyHint` tool annotations. Remove ~580 lines
of pre-launch infrastructure code.
## How
### Root cause
The pre-launch mechanism had three compounding bugs:
1. **Arg mismatch**: The SDK CLI normalises args between the
`AssistantMessage` (used for pre-launch) and the MCP `tools/call`
dispatch, causing frequent mismatches (279/day in prod)
2. **FIFO desync on denial**: Security hooks can deny tool calls,
causing the CLI to skip the MCP dispatch — but the pre-launched task
stays in the FIFO queue, misaligning all subsequent matches
3. **Cancel race**: `task.cancel()` is best-effort in asyncio — if the
HTTP call to Linear/GitHub already completed, the side effect is
irreversible
### Fix
- **Removed** `pre_launch_tool_call()`, `cancel_pending_tool_tasks()`,
`_tool_task_queues` ContextVar, all FIFO queue logic, and all 4
`cancel_pending_tool_tasks()` calls in `service.py`
- **Added** `readOnlyHint=True` annotations on 15+ read-only tools
(`find_block`, `search_docs`, `list_workspace_files`, etc.) — the SDK
CLI natively dispatches these in parallel ([ref:
anthropics/claude-code#14353](https://github.com/anthropics/claude-code/issues/14353))
- Side-effect tools (`run_block`, `bash_exec`, `create_agent`, etc.)
have no annotation → CLI runs them sequentially → no duplicate execution
risk
### Net change: -578 lines, +105 lines
### Why / What / How
**Why:** When a user asks CoPilot to build an agent with an ambiguous
goal (output format, delivery channel, data source, or trigger
unspecified), the agent generator previously made assumptions and jumped
straight into JSON generation. This produced agents that didn't match
what the user actually wanted, requiring multiple correction cycles.
**What:** Adds a "Clarifying Before Building" section to the agent
generation guide. When the goal is ambiguous, CoPilot first calls
`find_block` to discover what the platform actually supports for the
ambiguous dimension, then asks the user one concrete question grounded
in real platform options (e.g. "The platform supports Gmail, Slack, and
Google Docs — which should the agent use for delivery?"). Only after the
user answers does the full agent generation workflow proceed.
**How:** The clarification instruction is added to
`agent_generation_guide.md` — the guide loaded on-demand via
`get_agent_building_guide` when the LLM is about to build an agent. This
avoids polluting the system prompt supplement (which loads for every
CoPilot conversation, not just agent building). No dedicated tool is
needed — the LLM asks naturally in conversation text after discovering
real platform options via `find_block`.
### Changes 🏗️
- `backend/copilot/sdk/agent_generation_guide.md`: Adds "Clarifying
Before Building" section before the workflow steps. Instructs the model
to call `find_block` for the ambiguous dimension, ask the user one
grounded question, wait for the answer, then proceed to generation.
- `backend/copilot/prompting_test.py`: New test file verifying the guide
contains the clarification section and references `find_block`.
### Checklist 📋
#### For code changes:
- [x] I have clearly listed my changes in the PR description
- [x] I have made a test plan
- [x] I have tested my changes according to the test plan:
- [ ] Ask CoPilot to "build an agent to send a report" (ambiguous
output) — verify it calls `find_block` for delivery options and asks one
grounded question before generating JSON
- [ ] Ask CoPilot to "build an agent to scrape prices from Amazon and
email me daily" (specific goal) — verify it skips clarification and
proceeds directly to agent generation
- [ ] Verify the clarification question lists real block options (e.g.
Gmail, Slack, Google Docs) rather than abstract options
---------
Co-authored-by: Claude Sonnet 4.6 <noreply@anthropic.com>
Co-authored-by: Zamil Majdy <zamil.majdy@agpt.co>
### Why / What / How
<!-- Why: Why does this PR exist? What problem does it solve, or what's
broken/missing without it? -->
This PR fixes
[BUILDER-7HD](https://sentry.io/organizations/significant-gravitas/issues/7374387984/).
The issue was that: LaunchDarkly SDK fails to construct streaming URL
due to non-string `_url` from malformed `localStorage` bootstrap data.
<!-- What: What does this PR change? Summarize the changes at a high
level. -->
Removed the `bootstrap: "localStorage"` option from the LaunchDarkly
provider configuration.
<!-- How: How does it work? Describe the approach, key implementation
details, or architecture decisions. -->
This change ensures that LaunchDarkly no longer attempts to load initial
feature flag values from local storage. Flag values will now always be
fetched directly from the LaunchDarkly service, preventing potential
issues with stale local storage data.
### Changes 🏗️
<!-- List the key changes. Keep it higher level than the diff but
specific enough to highlight what's new/modified. -->
- Removed the `bootstrap: "localStorage"` option from the LaunchDarkly
provider configuration.
- LaunchDarkly will now always fetch flag values directly from its
service, bypassing local storage.
### Checklist 📋
#### For code changes:
- [x] I have clearly listed my changes in the PR description
- [ ] I have made a test plan
- [ ] I have tested my changes according to the test plan:
<!-- Put your test plan here: -->
- [ ] Verify that LaunchDarkly flags are loaded correctly without
issues.
- [ ] Ensure no errors related to `localStorage` or streaming URL
construction appear in the console.
<details>
<summary>Example test plan</summary>
- [ ] Create from scratch and execute an agent with at least 3 blocks
- [ ] Import an agent from file upload, and confirm it executes
correctly
- [ ] Upload agent to marketplace
- [ ] Import an agent from marketplace and confirm it executes correctly
- [ ] Edit an agent from monitor, and confirm it executes correctly
</details>
#### For configuration changes:
- [ ] `.env.default` is updated or already compatible with my changes
- [ ] `docker-compose.yml` is updated or already compatible with my
changes
- [ ] I have included a list of my configuration changes in the PR
description (under **Changes**)
<details>
<summary>Examples of configuration changes</summary>
- Changing ports
- Adding new services that need to communicate with each other
- Secrets or environment variable changes
- New or infrastructure changes such as databases
</details>
---------
Co-authored-by: Zamil Majdy <zamil.majdy@agpt.co>
Co-authored-by: seer-by-sentry[bot] <157164994+seer-by-sentry[bot]@users.noreply.github.com>
### Why / What / How
Why: repo guidance was split between Claude-specific `CLAUDE.md` files
and Codex-specific `AGENTS.md` files, which duplicated instruction
content and made the same repository behave differently across agents.
The repo also had Claude skills under `.claude/skills` but no
Codex-visible repo skill path.
What: this PR bridges the repo's Claude skills into Codex and normalizes
shared instruction files so `AGENTS.md` becomes the canonical source
while each `CLAUDE.md` imports its sibling `AGENTS.md`.
How: add a repo-local `.agents/skills` symlink pointing to
`../.claude/skills`; move nested `CLAUDE.md` content into sibling
`AGENTS.md` files; replace each repo `CLAUDE.md` with a one-line
`@AGENTS.md` shim so Claude and Codex read the same scoped guidance
without duplicating text. The root `CLAUDE.md` now imports the root
`AGENTS.md` rather than symlinking to it.
Note: the instruction-file normalization commit was created with
`--no-verify` because the repo's frontend pre-commit `tsc` hook
currently fails on unrelated existing errors, largely missing
`autogpt_platform/frontend/src/app/api/__generated__/*` modules.
### Changes 🏗️
- Add `.agents/skills` as a repo-local symlink to `../.claude/skills` so
Codex discovers the existing Claude repo skills.
- Add a real root `CLAUDE.md` shim that imports the canonical root
`AGENTS.md`.
- Promote nested scoped instruction content into sibling `AGENTS.md`
files under `autogpt_platform/`, `autogpt_platform/backend/`,
`autogpt_platform/frontend/`, `autogpt_platform/frontend/src/tests/`,
and `docs/`.
- Replace the corresponding nested `CLAUDE.md` files with one-line
`@AGENTS.md` shims.
- Preserve the existing scoped instruction hierarchy while making the
shared content cross-compatible between Claude and Codex.
### Checklist 📋
#### For code changes:
- [x] I have clearly listed my changes in the PR description
- [x] I have made a test plan
- [x] I have tested my changes according to the test plan:
- [x] Verified `.agents/skills` resolves to `../.claude/skills`
- [x] Verified each repo `CLAUDE.md` now contains only `@AGENTS.md`
- [x] Verified the expected `AGENTS.md` files exist at the root and
nested scoped directories
- [x] Verified the branch contains only the intended agent-guidance
commits relative to `dev` and the working tree is clean
#### For configuration changes:
- [x] `.env.default` is updated or already compatible with my changes
- [x] `docker-compose.yml` is updated or already compatible with my
changes
- [x] I have included a list of my configuration changes in the PR
description (under **Changes**)
No runtime configuration changes are included in this PR.
<!-- CURSOR_SUMMARY -->
---
> [!NOTE]
> **Low Risk**
> Low risk: documentation/instruction-file reshuffle plus an
`.agents/skills` pointer; no runtime code paths are modified.
>
> **Overview**
> Unifies agent guidance so **`AGENTS.md` becomes canonical** and all
corresponding `CLAUDE.md` files become 1-line shims (`@AGENTS.md`) at
the repo root, `autogpt_platform/`, backend, frontend, frontend tests,
and `docs/`.
>
> Adds `.agents/skills` pointing to `../.claude/skills` so non-Claude
agents discover the same shared skills/instructions, eliminating
duplicated/agent-specific guidance content.
>
> <sup>Written by [Cursor
Bugbot](https://cursor.com/dashboard?tab=bugbot) for commit
839483c3b6. This will update automatically
on new commits. Configure
[here](https://cursor.com/dashboard?tab=bugbot).</sup>
<!-- /CURSOR_SUMMARY -->
## Summary
- **Backend**: Strip empty `error` pins from dry-run simulation outputs
that the simulator always includes (set to `""` meaning "no error").
This was causing the LLM to misinterpret successful simulations as
failures and report "INCOMPLETE" status to users
- **Backend**: Add explicit "Status: COMPLETED" to dry-run response
message to prevent LLM misinterpretation
- **Backend**: Update simulation prompt to exclude `error` from the
"MUST include" keys list, and instruct LLM to omit error unless
simulating a logical failure
- **Frontend**: Fix `isRunBlockErrorOutput()` type guard that was too
broad (`"error" in output` matched BlockOutputResponse objects, not just
ErrorResponse), causing dry-run results to be displayed as errors
- **Frontend**: Fix `parseOutput()` fallback matching to not classify
BlockOutputResponse as ErrorResponse
- **Frontend**: Filter out empty error pins from `BlockOutputCard`
display and accordion metadata output key counting
- **Frontend**: Clear stale execution results before dry-run/no-input
runs so the UI shows fresh output
- **Frontend**: Fix first-click simulate race condition by invalidating
execution details query after WebSocket subscription confirms
## Test plan
- [x] All 12 existing + 5 new dry-run tests pass (`poetry run pytest
backend/copilot/tools/test_dry_run.py -x -v`)
- [x] All 23 helpers tests pass (`poetry run pytest
backend/copilot/tools/helpers_test.py -x -v`)
- [x] All 13 run_block tests pass (`poetry run pytest
backend/copilot/tools/run_block_test.py -x -v`)
- [x] Backend linting passes (ruff check + format)
- [x] Frontend linting passes (next lint)
- [ ] Manual: trigger dry-run on a block with error output pin (e.g.
Komodo Image Generator) — should show "Simulated" status with clean
output, no misleading "error" section
- [ ] Manual: first click on Simulate button should immediately show
results (no race condition)
---------
Co-authored-by: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Co-authored-by: Nicholas Tindle <nicholas.tindle@agpt.co>
Co-authored-by: copilot-swe-agent[bot] <198982749+Copilot@users.noreply.github.com>
## Why
CoPilot session `d2f7cba3` took **82 minutes** and cost **$20.66** for a
single user message. Root causes:
1. Redis session meta key expired after 1h, making the session invisible
to the resume endpoint — causing empty page on reload
2. Redis stream key also expired during sub-agent gaps (task_progress
events produced no chunks)
3. No intermediate persistence — session messages only saved to DB after
the entire turn completes
4. Sub-agents retried similar WebSearch queries (addressed via prompt
guidance)
## What
### Redis TTL fixes (root cause of empty session on reload)
- `publish_chunk()` now periodically refreshes **both** the session meta
key AND stream key TTL (every 60s).
- `task_progress` SDK events now emit `StreamHeartbeat` chunks, ensuring
`publish_chunk` is called even during long sub-agent gaps where no real
chunks are produced.
- Without this fix, turns exceeding the 1h `stream_ttl` lose their
"running" status and stream data, making `get_active_session()` return
False.
### Intermediate DB persistence
- Session messages flushed to DB every **30 seconds** or **10 new
messages** during the stream loop.
- Uses `asyncio.shield(upsert_chat_session())` matching the existing
`finally` block pattern.
### Orphaned message cleanup on rollback
- On stream attempt rollback, orphaned messages persisted by
intermediate flushes are now cleaned up from the DB via
`delete_messages_from_sequence`.
- Prevents stale messages from resurfacing on page reload after a failed
retry.
### Prompt guidance
- Added web search best practices to code supplement (search efficiency,
sub-agent scope separation).
### Approach: root cause fixes, not capability limits
- **No tool call caps** — artificial limits on WebSearch or total tool
calls would reduce autopilot capability without addressing why searches
were redundant.
- **Task tool remains enabled** — sub-agent delegation via Task is a
core capability. The existing `max_subtasks` concurrency guard is
sufficient.
- The real fixes (TTL refresh, persistence, prompt guidance) address the
underlying bugs and behavioral issues.
## How
### Files changed
- `stream_registry.py` — Redis meta + stream key TTL refresh in
`publish_chunk()`, module-level keepalive tracker
- `response_adapter.py` — `task_progress` SystemMessage →
StreamHeartbeat emission
- `service.py` — Intermediate DB persistence in `_run_stream_attempt`
stream loop, orphan cleanup on rollback
- `db.py` — `delete_messages_from_sequence` for rollback cleanup
- `prompting.py` — Web search best practices
### GCP log evidence
```
# Meta key expired during 82-min turn:
09:49 — GET_SESSION: active_session=False, msg_count=1 ← meta gone
10:18 — Session persisted in finally with 189 messages ← turn completed
# T13 (1h45min) same bug reproduced live:
16:20 — task_progress events still arriving, but active_session=False
# Actual cost:
Turn usage: cache_read=347916, cache_create=212472, output=12375, cost_usd=20.66
```
### Test plan
- [x] task_progress emits StreamHeartbeat
- [x] Task background blocked, foreground allowed, slot release on
completion/failure
- [x] CI green (lint, type-check, tests, e2e, CodeQL)
---------
Co-authored-by: Zamil Majdy <majdy.zamil@gmail.com>
Fixes#9175
### Changes 🏗️
The Agent Outputs panel only displayed the last execution result per
output node, discarding all prior outputs during a run.
**Root cause:** In `AgentOutputs.tsx`, the `outputs` useMemo extracted
only the last element from `nodeExecutionResults`:
```tsx
const latestResult = executionResults[executionResults.length - 1];
```
**Fix:** Changed `.map()` to `.flatMap()` over output nodes, iterating
through all `executionResults` for each node. Each execution result now
gets its own renderer lookup and metadata entry, so the panel shows
every output produced during the run.
### Checklist 📋
#### For code changes:
- [x] I have clearly listed my changes in the PR description
- [x] I have made a test plan
- [x] I have tested my changes according to the test plan:
- [x] Verified TypeScript compiles without errors
- [x] Confirmed the flatMap logic correctly iterates all execution
results
- [x] Verified existing filter for null renderers is preserved
- [x] Run an agent with multiple outputs and confirm all show in the
panel
---------
Signed-off-by: majiayu000 <1835304752@qq.com>
Co-authored-by: Zamil Majdy <zamil.majdy@agpt.co>
## Summary
- Adds a session-level `dry_run` flag that forces ALL tool calls
(`run_block`, `run_agent`) in a copilot/autopilot session to use dry-run
simulation mode
- Stores the flag in a typed `ChatSessionMetadata` JSON model on the
`ChatSession` DB row, accessed via `session.dry_run` property
- Adds `dry_run` to the AutoPilot block Input schema so graph builders
can create dry-run autopilot nodes
- Refactors multiple copilot tools from `**kwargs` to explicit
parameters for type safety
## Changes
- **Prisma schema**: Added `metadata` JSON column to `ChatSession` model
with migration
- **Python models**: Added `ChatSessionMetadata` model with `dry_run`
field, added `metadata` field to `ChatSessionInfo` and `ChatSession`,
updated `from_db()`, `new()`, and `create_chat_session()`
- **Session propagation**: `set_execution_context(user_id, session)`
called from `baseline/service.py` so tool handlers can read
session-level flags via `session.dry_run`
- **Tool enforcement**: `run_block` and `run_agent` check
`session.dry_run` and force `dry_run=True` when set; `run_agent` blocks
scheduling in dry-run sessions
- **AutoPilot block**: Added `dry_run` input field, passes it when
creating sessions
- **Chat API**: Added `CreateSessionRequest` model with `dry_run` field
to `POST /sessions` endpoint; added `metadata` to session responses
- **Frontend**: Updated `useChatSession.ts` to pass body to the create
session mutation
- **Tool refactoring**: Multiple copilot tools refactored from
`**kwargs` to explicit named parameters (agent_browser, manage_folders,
workspace_files, connect_integration, agent_output, bash_exec, etc.) for
better type safety
## Test plan
- [x] Unit tests for `ChatSession.new()` with dry_run parameter
- [x] Unit tests for `RunBlockTool` session dry_run override
- [x] Unit tests for `RunAgentTool` session dry_run override
- [x] Unit tests for session dry_run blocks scheduling
- [x] Existing dry_run tests still pass (12/12)
- [x] Existing permissions tests still pass
- [x] All pre-commit hooks pass (ruff, isort, pyright, tsc)
- [ ] Manual: Create autopilot session with `dry_run=True`, verify
run_block/run_agent calls use simulation
---------
Co-authored-by: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
### Why / What / How
**Why:** We need a third credential type: **system-provided but unique
per user** (managed credentials). Currently we have system credentials
(same for all users) and user credentials (user provides their own
keys). Managed credentials bridge the gap — the platform provisions them
automatically, one per user, for integrations like AgentMail where each
user needs their own pod-scoped API key.
**What:**
- Generic **managed credential provider registry** — any integration can
register a provider that auto-provisions per-user credentials
- **AgentMail** is the first consumer: creates a pod + pod-scoped API
key using the org-level API key
- Managed credentials appear in the credential dropdown like normal API
keys but with `autogpt_managed=True` — users **cannot update or delete**
them
- **Auto-provisioning** on `GET /credentials` — lazily creates managed
credentials when users browse their credential list
- **Account deletion cleanup** utility — revokes external resources
(pods, API keys) before user deletion
- **Frontend UX** — hides the delete button for managed credentials on
the integrations page
**How:**
### Backend
**New files:**
- `backend/integrations/managed_credentials.py` —
`ManagedCredentialProvider` ABC, global registry,
`ensure_managed_credentials()` (with per-user asyncio lock +
`asyncio.gather` for concurrency), `cleanup_managed_credentials()`
- `backend/integrations/managed_providers/__init__.py` —
`register_all()` called at startup
- `backend/integrations/managed_providers/agentmail.py` —
`AgentMailManagedProvider` with `provision()` (creates pod + API key via
agentmail SDK) and `deprovision()` (deletes pod)
**Modified files:**
- `credentials_store.py` — `autogpt_managed` guards on update/delete,
`has_managed_credential()` / `add_managed_credential()` helpers
- `model.py` — `autogpt_managed: bool` + `metadata: dict` on
`_BaseCredentials`
- `router.py` — calls `ensure_managed_credentials()` in list endpoints,
removed explicit `/agentmail/connect` endpoint
- `user.py` — `cleanup_user_managed_credentials()` for account deletion
- `rest_api.py` — registers managed providers at startup
- `settings.py` — `agentmail_api_key` setting
### Frontend
- Added `autogpt_managed` to `CredentialsMetaResponse` type
- Conditionally hides delete button on integrations page for managed
credentials
### Key design decisions
- **Auto-provision in API layer, not data layer** — keeps
`get_all_creds()` side-effect-free
- **Race-safe** — per-(user, provider) asyncio lock with double-check
pattern prevents duplicate pods
- **Idempotent** — AgentMail SDK `client_id` ensures pod creation is
idempotent; `add_managed_credential()` uses upsert under Redis lock
- **Error-resilient** — provisioning failures are logged but never block
credential listing
### Changes 🏗️
| File | Action | Description |
|------|--------|-------------|
| `backend/integrations/managed_credentials.py` | NEW | ABC, registry,
ensure/cleanup |
| `backend/integrations/managed_providers/__init__.py` | NEW | Registers
all providers at startup |
| `backend/integrations/managed_providers/agentmail.py` | NEW |
AgentMail provisioning/deprovisioning |
| `backend/integrations/credentials_store.py` | MODIFY | Guards +
managed credential helpers |
| `backend/data/model.py` | MODIFY | `autogpt_managed` + `metadata`
fields |
| `backend/api/features/integrations/router.py` | MODIFY |
Auto-provision on list, removed `/agentmail/connect` |
| `backend/data/user.py` | MODIFY | Account deletion cleanup |
| `backend/api/rest_api.py` | MODIFY | Provider registration at startup
|
| `backend/util/settings.py` | MODIFY | `agentmail_api_key` setting |
| `frontend/.../integrations/page.tsx` | MODIFY | Hide delete for
managed creds |
| `frontend/.../types.ts` | MODIFY | `autogpt_managed` field |
### Checklist 📋
#### For code changes:
- [x] I have clearly listed my changes in the PR description
- [x] I have made a test plan
- [x] I have tested my changes according to the test plan:
- [x] 23 tests pass in `router_test.py` (9 new tests for
ensure/cleanup/auto-provisioning)
- [x] `poetry run format && poetry run lint` — clean
- [x] OpenAPI schema regenerated
- [x] Manual: verify managed credential appears in AgentMail block
dropdown
- [x] Manual: verify delete button hidden for managed credentials
- [x] Manual: verify managed credential cannot be deleted via API (403)
#### For configuration changes:
- [x] `.env.default` is updated with `AGENTMAIL_API_KEY=`
---------
Co-authored-by: Zamil Majdy <zamil.majdy@agpt.co>
### Why / What / How
**Why:** When `AIConversationBlock` receives an empty messages list and
an empty prompt, the block blindly forwards the empty array to the
downstream LLM API, which returns a cryptic `400 Bad Request` error:
`"Invalid 'messages': empty array. Expected an array with minimum length
1."` This is confusing for users who don't understand why their agent
failed.
**What:** Add early input validation in `AIConversationBlock.run()` that
raises a clear `ValueError` when both `messages` and `prompt` are empty.
Also add three unit tests covering the validation logic.
**How:** A simple guard clause at the top of the `run` method checks `if
not input_data.messages and not input_data.prompt` before the LLM call
is made. If both are empty, a descriptive `ValueError` is raised. If
either one has content, the block proceeds normally.
### Changes
- `autogpt_platform/backend/backend/blocks/llm.py`: Add validation guard
in `AIConversationBlock.run()` to reject empty messages + empty prompt
before calling the LLM
- `autogpt_platform/backend/backend/blocks/test/test_llm.py`: Add
`TestAIConversationBlockValidation` with three tests:
- `test_empty_messages_and_empty_prompt_raises_error` — validates the
guard clause
- `test_empty_messages_with_prompt_succeeds` — ensures prompt-only usage
still works
- `test_nonempty_messages_with_empty_prompt_succeeds` — ensures
messages-only usage still works
### Checklist
#### For code changes:
- [x] I have clearly listed my changes in the PR description
- [x] I have made a test plan
- [x] I have tested my changes according to the test plan:
- [x] Lint passes (`ruff check`)
- [x] Formatting passes (`ruff format`)
- [x] New unit tests validate the empty-input guard and the happy paths
Closes#11875
---------
Co-authored-by: Zamil Majdy <zamil.majdy@agpt.co>
## Summary
`extract_openai_tool_calls()` in `llm.py` crashes with `IndexError` when
the LLM provider returns a response with an empty `choices` list.
### Changes 🏗️
- Added a guard check `if not response.choices: return None` before
accessing `response.choices[0]`
- This is consistent with the function's existing pattern of returning
`None` when no tool calls are found
### Bug Details
When an LLM provider returns a response with an empty choices list
(e.g., due to content filtering, rate limiting, or API errors),
`response.choices[0]` raises `IndexError`. This can crash the entire
agent execution pipeline.
### Checklist 📋
#### For code changes:
- [x] I have clearly listed my changes in the PR description
- [x] I have made a test plan
- [x] I have tested my changes according to the test plan:
- Verified that the function returns `None` when `response.choices` is
empty
- Verified existing behavior is unchanged when `response.choices` is
non-empty
---------
Co-authored-by: goingforstudying-ctrl <forgithubuse@gmail.com>
Co-authored-by: Claude Opus 4.6 <noreply@anthropic.com>
Co-authored-by: Zamil Majdy <zamil.majdy@agpt.co>
## Summary
- Adds `ExecutionMode` enum with `BUILT_IN` (default built-in tool-call
loop) and `EXTENDED_THINKING` (delegates to Claude Agent SDK for richer
reasoning)
- Extracts shared `tool_call_loop` into `backend/util/tool_call_loop.py`
— reusable by both OrchestratorBlock agent mode and copilot baseline
- Refactors copilot baseline to use the shared `tool_call_loop` with
callback-driven iteration
## ExecutionMode enum
`ExecutionMode` (`backend/blocks/orchestrator.py`) controls how
OrchestratorBlock executes tool calls:
- **`BUILT_IN`** — Default mode. Runs the built-in tool-call loop
(supports all LLM providers).
- **`EXTENDED_THINKING`** — Delegates to the Claude Agent SDK for
extended thinking and multi-step planning. Requires Anthropic-compatible
providers (`anthropic` / `open_router`) and direct API credentials
(subscription mode not supported). Validates both provider and model
name at runtime.
## Shared tool_call_loop
`backend/util/tool_call_loop.py` provides a generic, provider-agnostic
conversation loop:
1. Call LLM with tools → 2. Extract tool calls → 3. Execute tools → 4.
Update conversation → 5. Repeat
Callers provide three callbacks:
- `llm_call`: wraps any LLM provider (OpenAI streaming, Anthropic,
llm.llm_call, etc.)
- `execute_tool`: wraps any tool execution (TOOL_REGISTRY, graph block
execution, etc.)
- `update_conversation`: formats messages for the specific protocol
## OrchestratorBlock EXTENDED_THINKING mode
- `_create_graph_mcp_server()` converts graph-connected blocks to MCP
tools
- `_execute_tools_sdk_mode()` runs `ClaudeSDKClient` with those MCP
tools
- Agent mode refactored to use shared `tool_call_loop`
## Copilot baseline refactored
- Streaming callbacks buffer `Stream*` events during loop execution
- Events are drained after `tool_call_loop` returns
- Same conversation logic, less code duplication
## SDK environment builder extraction
- `build_sdk_env()` extracted to `backend/copilot/sdk/env.py` for reuse
by both copilot SDK service and OrchestratorBlock
## Provider validation
EXTENDED_THINKING mode validates `provider in ('anthropic',
'open_router')` and `model_name.startswith('claude')` because the Claude
Agent SDK requires an Anthropic API key or OpenRouter key. Subscription
mode is not supported — it uses the platform's internal credit system
which doesn't provide raw API keys needed by the SDK. The validation
raises a clear `ValueError` if an unsupported provider or model is used.
## PR Dependencies
This PR builds on #12511 (Claude SDK client). It can be reviewed
independently — #12511 only adds the SDK client module which this PR
imports. If #12511 merges first, this PR will have no conflicts.
### Checklist 📋
#### For code changes:
- [x] I have clearly listed my changes in the PR description
- [x] I have made a test plan
- [x] I have tested my changes according to the test plan:
- [x] All pre-commit hooks pass (typecheck, lint, format)
- [x] Existing OrchestratorBlock tests still pass
- [x] Copilot baseline behavior unchanged (same stream events, same tool
execution)
- [x] Manual: OrchestratorBlock with execution_mode=EXTENDED_THINKING +
downstream blocks → SDK calls tools
- [x] Agent mode regression test (non-SDK path works as before)
- [x] SDK mode error handling (invalid provider raises ValueError)
### Why / What / How
**Why:** When a user or LLM supplies a malformed recipient string (e.g.
a bare username, a JSON blob, or an empty value) to `GmailSendBlock`,
`GmailCreateDraftBlock`, or any reply block, the Gmail API returns an
opaque `HttpError 400: "Invalid To header"`. This surfaces as a
`BlockUnknownError` with no actionable guidance, making it impossible
for the LLM to self-correct. (Fixes#11954)
**What:** Adds a lightweight `validate_email_recipients()` function that
checks every recipient against a simplified RFC 5322 pattern
(`local@domain.tld`) and raises a clear `ValueError` listing all invalid
entries before any API call is made.
**How:** The validation is called in two shared code paths —
`create_mime_message()` (used by send and draft blocks) and
`_build_reply_message()` (used by reply blocks) — so all Gmail blocks
that compose outgoing email benefit from it with zero per-block changes.
The regex is intentionally permissive (any `x@y.z` passes) to avoid
false positives on unusual but valid addresses.
### Changes 🏗️
- Added `validate_email_recipients()` helper in `gmail.py` with a
compiled regex
- Hooked validation into `create_mime_message()` for `to`, `cc`, and
`bcc` fields
- Hooked validation into `_build_reply_message()` for reply/draft-reply
blocks
- Added `TestValidateEmailRecipients` test class covering valid,
invalid, mixed, empty, JSON-string, and field-name scenarios
### Checklist 📋
#### For code changes:
- [x] I have clearly listed my changes in the PR description
- [x] I have made a test plan
- [x] I have tested my changes according to the test plan:
- [x] Verified `validate_email_recipients` correctly accepts valid
emails (`user@example.com`, `a@b.com`, `test@sub.domain.co`)
- [x] Verified it rejects malformed entries (bare names, missing domain
dot, empty strings, JSON strings)
- [x] Verified error messages include the field name and all invalid
entries
- [x] Verified empty recipient lists pass without error
- [x] Confirmed `gmail.py` and test file parse correctly (AST check)
---------
Co-authored-by: Zamil Majdy <zamil.majdy@agpt.co>
## Why
The OrchestratorBlock fails with `Tool names must be unique` when
multiple nodes use the same block type (e.g., two "Web Search" blocks
connected as tools). The Anthropic API rejects the request because
duplicate tool names are sent.
## What
- Detect duplicate tool names after building tool signatures
- Append `_1`, `_2`, etc. suffixes to disambiguate
- Enrich descriptions of duplicate tools with their hardcoded default
values so the LLM can distinguish between them
- Clean up internal `_hardcoded_defaults` metadata before sending to API
- Exclude sensitive/credential fields from default value descriptions
## How
- After `_create_tool_node_signatures` builds all tool functions, count
name occurrences
- For duplicates: rename with suffix and append `[Pre-configured:
key=value]` to description using the node's `input_default` (excluding
linked fields that the LLM provides)
- Added defensive `isinstance(defaults, dict)` check for compatibility
with test mocks
- Suffix collision avoidance: skips candidates that collide with
existing tool names
- Long tool names truncated to fit within 64-character API limit
- 47 unit tests covering: basic dedup, description enrichment, unique
names unchanged, no metadata leaks, single tool, triple duplicates,
linked field exclusion, mixed unique/duplicate scenarios, sensitive
field exclusion, long name truncation, suffix collision, malformed
tools, missing description, empty list, 10-tool all-same-name, multiple
distinct groups, large default truncation, suffix collision cascade,
parameter preservation, boundary name lengths, nested dict/list
defaults, null defaults, customized name priority, required fields
## Test plan
- [x] All 47 tests in `test_orchestrator_tool_dedup.py` pass
- [x] All 11 existing orchestrator unit tests pass (dict, dynamic
fields, responses API)
- [x] Pre-commit hooks pass (ruff, black, isort, pyright)
- [ ] Manual test: connect two same-type blocks to an orchestrator and
verify the LLM call succeeds
---------
Co-authored-by: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
### Why / What / How
Remove extraneous whitespace in README.md:
- "Workflow Management" description: extra spaces between "block" and
"performs"
- "Agent Interaction" description: extra spaces between "user-friendly"
and "interface"
---------
Co-authored-by: Zamil Majdy <zamil.majdy@agpt.co>
## Why
The copilot chat had no indication of how long the AI spent "thinking"
on a response. Users couldn't tell if a long wait was normal or
something was stuck. Additionally, the thinking duration was lost on
page reload since it was only tracked client-side.
## What
- **Live elapsed timer**: Shows elapsed time ("23s", "1m 5s") in the
ThinkingIndicator while the AI is processing (appears after 20s to avoid
spam on quick responses)
- **Frozen "Thought for Xm Ys"**: Displays the final thinking duration
in TurnStatsBar after the response completes
- **Persisted duration**: Saves `durationMs` on the last assistant
message in the DB so the timer survives page reloads
## How
**Backend:**
- Added `durationMs Int?` column to `ChatMessage` (Prisma migration)
- `mark_session_completed` in `stream_registry.py` computes wall-clock
duration from Redis session `created_at` and saves it via
`DatabaseManager.set_turn_duration()`
- Invalidates Redis session cache after writing so GET returns fresh
data
**Frontend:**
- `useElapsedTimer` hook tracks client-side elapsed seconds during
streaming
- `ThinkingIndicator` shows only the elapsed time (no phrases) after
20s, with `font-mono text-sm` styling
- `TurnStatsBar` displays "Thought for Xs" after completion, preferring
live `elapsedSeconds` and falling back to persisted `durationMs`
- `convertChatSessionToUiMessages` extracts `duration_ms` from
historical messages into a `Map<string, number>` threaded through to
`ChatMessagesContainer`
## Test plan
- [ ] Send a message in copilot — verify ThinkingIndicator shows elapsed
time after 20s
- [ ] After response completes — verify "Thought for Xs" appears below
the response
- [ ] Refresh the page — verify "Thought for Xs" still appears
(persisted from DB)
- [ ] Check older conversations — they should NOT show timer (no
historical data)
- [ ] Verify no Zod/SSE validation errors in browser console
🤖 Generated with [Claude Code](https://claude.com/claude-code)
---------
Co-authored-by: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
## Summary
- Fixed `_resolve_discriminated_credentials()` in `helpers.py` to handle
URL/host-based credential discrimination (used by
`SendAuthenticatedWebRequestBlock`)
- Previously, only provider-based discrimination (with
`discriminator_mapping`) was handled; URL-based discrimination (with
`discriminator` set but no `discriminator_mapping`) was silently skipped
- This caused host-scoped credentials to either match the wrong host or
fail to match at all when the CoPilot called `run_block` for
authenticated HTTP requests
- Added 14 targeted tests covering discriminator resolution, host
matching, credential resolution integration, and RunBlockTool end-to-end
flows
## Root Cause
`_resolve_discriminated_credentials()` checked `if
field_info.discriminator and field_info.discriminator_mapping:` which
excluded host-scoped credentials where `discriminator="url"` but
`discriminator_mapping=None`. The URL from `input_data` was never added
to `discriminator_values`, so `_credential_is_for_host()` received empty
`discriminator_values` and returned `True` for **any** host-scoped
credential regardless of URL match.
## Fix
When `discriminator` is set without `discriminator_mapping`, the URL
value from `input_data` is now copied into `discriminator_values` on a
shallow copy of the field info (to avoid mutating the cached schema).
This enables `_credential_is_for_host()` to properly match the
credential's host against the target URL.
## Test plan
- [x] `TestResolveDiscriminatedCredentials` - 4 tests verifying URL
discriminator populates values, handles missing URL, doesn't mutate
original, preserves provider/type
- [x] `TestFindMatchingHostScopedCredential` - 5 tests verifying
correct/wrong host matching, wildcard hosts, multiple credential
selection
- [x] `TestResolveBlockCredentials` - 3 integration tests verifying full
credential resolution with matching/wrong/missing hosts
- [x] `TestRunBlockToolAuthenticatedHttp` - 2 end-to-end tests verifying
SetupRequirementsResponse when creds missing and BlockDetailsResponse
when creds matched
- [x] All 28 existing + new tests pass
- [x] Ruff lint, isort, Black formatting, pyright typecheck all pass
---------
Co-authored-by: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
## Why
Sentry alert
[AUTOGPT-SERVER-8C8](https://significant-gravitas.sentry.io/issues/7367978095/)
— `AIConditionBlock` failing in prod with:
```
Invalid 'max_output_tokens': integer below minimum value.
Expected a value >= 16, but got 10 instead.
```
Two problems:
1. `max_tokens=10` is below OpenAI's new minimum of 16
2. The `except Exception` handler was calling `logger.error()` which
triggered Sentry for what are known block errors, AND silently
defaulting to `result=False` — making the block appear to succeed with
an incorrect answer
## What
- Bump `max_tokens` from 10 to 16 (fixes the root cause)
- Remove the `try/except` entirely — the executor already handles
exceptions correctly (`ValueError` = known/no Sentry, everything else =
unknown/Sentry). The old handler was just swallowing errors and
producing wrong results.
## Test plan
- [x] Existing `AIConditionBlock` tests pass (block only expects
"true"/"false", 16 tokens is plenty)
- [x] No more silent `result=False` on errors
- [x] No more spurious Sentry alerts from `logger.error()`
Fixes AUTOGPT-SERVER-8C8
🤖 Generated with [Claude Code](https://claude.com/claude-code)
---------
Co-authored-by: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
### Why / What / How
**Why:** Agents working in worktrees lack guidance on two of the most
common workflows: properly opening PRs (using the repo template,
validating test coverage, triggering the review bot) and bootstrapping
the repo from scratch with a worktree-based layout. Without these
skills, agents either skip steps (no test plan, wrong template) or
require manual hand-holding for setup.
**What:** Adds two new Claude Code skills under `.claude/skills/`:
- `/open-pr` — A structured PR creation workflow that enforces the
canonical `.github/PULL_REQUEST_TEMPLATE.md`, validates test coverage
for existing and new behaviors, supports a configurable base branch, and
integrates the `/review` bot workflow for agents without local testing
capability. Cross-references `/pr-test`, `/pr-review`, and `/pr-address`
for the full PR lifecycle.
- `/setup-repo` — An interactive repo bootstrapping skill that creates a
worktree-based layout (main + reviews + N numbered work branches).
Handles .env file provisioning with graceful fallbacks (.env.default,
.env.example), copies branchlet config, installs dependencies, and is
fully idempotent (safe to re-run).
**How:** Markdown-based SKILL.md files following the existing skill
conventions. Both skills use proper bash patterns (seq-based loops
instead of brace expansion with variables, existence checks before
branch/worktree creation, error reporting on install failures).
`/open-pr` delegates to AskUserQuestion-style prompts for base branch
selection. `/setup-repo` uses AskUserQuestion for interactive branch
count and base branch selection.
### Changes 🏗️
- Added `.claude/skills/open-pr/SKILL.md` — PR creation workflow with:
- Pre-flight checks (committed, pushed, formatted)
- Test coverage validation (existing behavior not broken, new behavior
covered)
- Canonical PR template enforcement (read and fill verbatim, no
pre-checked boxes)
- Configurable base branch (defaults to dev)
- Review bot workflow (`/review` comment + 30min wait) for agents
without local testing
- Related skills table linking `/pr-test`, `/pr-review`, `/pr-address`
- Added `.claude/skills/setup-repo/SKILL.md` — Repo bootstrap workflow
with:
- Interactive setup (branch count: 4/8/16/custom, base branch selection)
- Idempotent branch creation (skips existing branches with info message)
- Idempotent worktree creation (skips existing directories)
- .env provisioning with fallback chain (.env → .env.default →
.env.example → warning)
- Branchlet config propagation
- Dependency installation with success/failure reporting per worktree
### Checklist 📋
#### For code changes:
- [x] I have clearly listed my changes in the PR description
- [x] I have made a test plan
- [x] I have tested my changes according to the test plan:
- [x] Verified SKILL.md frontmatter follows existing skill conventions
- [x] Verified trigger conditions match expected user intents
- [x] Verified cross-references to existing skills are accurate
- [x] Verified PR template section matches
`.github/PULL_REQUEST_TEMPLATE.md`
- [x] Verified bash snippets use correct patterns (seq, show-ref, quoted
vars)
- [x] Pre-commit hooks pass on all commits
- [x] Addressed all CodeRabbit, Sentry, and Cursor review comments
🤖 Generated with [Claude Code](https://claude.com/claude-code)
<!-- CURSOR_SUMMARY -->
---
> [!NOTE]
> **Low Risk**
> Low risk documentation-only change: adds new markdown skills without
modifying runtime code. Main risk is workflow guidance drift (e.g.,
`.env`/worktree steps) if it diverges from actual repo conventions.
>
> **Overview**
> Adds two new Claude Code skills under `.claude/skills/` to standardize
common developer workflows.
>
> `/open-pr` documents a PR creation flow that enforces using
`.github/PULL_REQUEST_TEMPLATE.md` verbatim, calls out required test
coverage, and describes how to trigger/poll the `/review` bot when local
testing isn’t available.
>
> `/setup-repo` documents an idempotent, interactive bootstrap for a
multi-worktree layout (creates `reviews` and `branch1..N`, provisions
`.env` files with `.env.default`/`.env.example` fallbacks, copies
`.branchlet.json`, and installs dependencies), complementing the
existing `/worktree` skill.
>
> <sup>Written by [Cursor
Bugbot](https://cursor.com/dashboard?tab=bugbot) for commit
80dbeb1596. This will update automatically
on new commits. Configure
[here](https://cursor.com/dashboard?tab=bugbot).</sup>
<!-- /CURSOR_SUMMARY -->
---------
Co-authored-by: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Co-authored-by: copilot-swe-agent[bot] <198982749+Copilot@users.noreply.github.com>
## Why
AutoPilot users hit `invalid_request_error` ("thinking or
redacted_thinking blocks in the latest assistant message cannot be
modified") when sessions get long enough to trigger transcript
compaction. The Anthropic API requires thinking blocks in the last
assistant message to be byte-for-byte identical to the original response
— our compaction was flattening them to plain text, destroying the
cryptographic signatures.
Reported in Discord `#breakage` by John Ababseh with session
`31d3f08a-cb94-45eb-9fce-56b3f0287ef4`.
## What
- **`compact_transcript`** now splits the transcript into a compressible
prefix and a preserved tail (last assistant entry + trailing entries).
Only the prefix is compressed; the tail is re-appended verbatim,
preserving thinking blocks exactly.
- **`_flatten_assistant_content`** now silently drops `thinking` and
`redacted_thinking` blocks instead of creating `[__thinking__]`
placeholders — they carry no useful context for compression summaries.
- **`response_adapter`** explicitly handles `ThinkingBlock` (skip
gracefully instead of silently falling through the isinstance chain).
- **`_format_sdk_content_blocks`** now passes through raw dict blocks
(e.g. `redacted_thinking` that the SDK may not have a typed class for)
verbatim to the transcript.
## How
The key insight is the Anthropic API's asymmetric constraint:
- **Last assistant message**: thinking/redacted_thinking blocks must be
preserved byte-for-byte
- **Older assistant messages**: thinking blocks can be removed entirely
`compact_transcript` uses `_find_last_assistant_entry()` to split the
JSONL into two parts:
1. **Prefix** (everything before the last assistant): flattened and
compressed normally
2. **Tail** (last assistant + any trailing user message): preserved
verbatim and re-chained via `_rechain_tail()` to maintain the
`parentUuid` chain
This ensures the API always sees the original thinking blocks in the
last assistant message while still achieving meaningful compression on
older turns.
## Test plan
- [x] 25 new tests across `thinking_blocks_test.py` (TDD: written before
implementation)
- [x] `_find_last_assistant_entry` splits correctly at last assistant,
handles edges (no assistant, index 0, trailing user)
- [x] `_rechain_tail` patches parentUuid chain, handles empty tail
- [x] `_flatten_assistant_content` strips thinking/redacted_thinking
blocks, handles mixed content
- [x] `compact_transcript` preserves last assistant's thinking blocks
- [x] `compact_transcript` strips thinking from older assistant messages
- [x] Edge cases: trailing user message, single assistant, no thinking
blocks
- [x] `response_adapter` handles ThinkingBlock without crash
- [x] `_format_sdk_content_blocks` preserves thinking block format and
raw dict blocks
- [x] All existing copilot SDK tests pass
- [x] Pre-commit hooks (lint, format, typecheck) all pass
## Summary
Upgrade the frontend **Docker image** from **Node.js v21** (EOL since
June 2024) to **Node.js v22 LTS** (supported through April 2027).
> **Scope:** This only affects the **Dockerfile** used for local
development (`docker compose`) and CI. It does **not** affect Vercel
(which manages its own Node.js runtime) or Kubernetes (the frontend Helm
chart was removed in Dec 2025 — the frontend is deployed exclusively via
Vercel).
## Why
- Node v21.7.3 has a **known TransformStream race condition bug**
causing `TypeError: controller[kState].transformAlgorithm is not a
function` — this is
[BUILDER-3KF](https://significant-gravitas.sentry.io/issues/BUILDER-3KF)
with **567,000+ Sentry events**
- The error is entirely in Node.js internals
(`node:internal/webstreams/transformstream`), zero first-party code
- Node 21 is **not an LTS release** and has been EOL since June 2024
- `package.json` already declares `"engines": { "node": "22.x" }` — the
Dockerfile was inconsistent
- Node 22.x LTS (v22.22.1) fixes the TransformStream bug
- Next.js 15.4.x requires Node 18.18+, so Node 22 is fully compatible
## Changes
- `autogpt_platform/frontend/Dockerfile`: `node:21-alpine` →
`node:22.22-alpine3.23` (both `base` and `prod` stages)
## Test plan
- [ ] Verify frontend Docker image builds successfully via `docker
compose`
- [ ] Verify frontend starts and serves pages correctly in local Docker
environment
- [ ] Monitor Sentry for BUILDER-3KF — should drop to zero for
Docker-based runs
## Why
Multiple Sentry issues paging on-call in prod:
1. **AUTOGPT-SERVER-8BP**: `ConversionError: Failed to convert
anthropic/claude-sonnet-4-6 to <enum 'LlmModel'>` — the copilot passes
OpenRouter-style provider-prefixed model names
(`anthropic/claude-sonnet-4-6`) to blocks, but the `LlmModel` enum only
recognizes the bare model ID (`claude-sonnet-4-6`).
2. **BUILDER-7GF**: `Error invoking postEvent: Method not found` —
Sentry SDK internal error on Chrome Mobile Android, not a platform bug.
3. **XMLParserBlock**: `BlockUnknownError raised by XMLParserBlock with
message: Error in input xml syntax` — user sent bad XML but the block
raised `SyntaxError`, which gets wrapped as `BlockUnknownError`
(unexpected) instead of `BlockExecutionError` (expected).
4. **AUTOGPT-SERVER-8BS**: `Virus scanning failed for Screenshot
2026-03-26 091900.png: range() arg 3 must not be zero` — empty (0-byte)
file upload causes `range(0, 0, 0)` in the virus scanner chunking loop,
and the failure is logged at `error` level which pages on-call.
5. **AUTOGPT-SERVER-8BT**: `ValueError: <Token var=<ContextVar
name='current_context'>> was created in a different Context` —
OpenTelemetry `context.detach()` fails when the SDK streaming async
generator is garbage-collected in a different context than where it was
created (client disconnect mid-stream).
6. **AUTOGPT-SERVER-8BW**: `RuntimeError: Attempted to exit cancel scope
in a different task than it was entered in` — anyio's
`TaskGroup.__aexit__` detects cancel scope entered in one task but
exited in another when `GeneratorExit` interrupts the SDK cleanup during
client disconnect.
7. **Workspace UniqueViolationError**: `UniqueViolationError: Unique
constraint failed on (workspaceId, path)` — race condition during
concurrent file uploads handled by `WorkspaceManager._persist_db_record`
retry logic, but Sentry still captures the exception at the raise site.
8. **Library UniqueViolationError**: `UniqueViolationError` on
`LibraryAgent (userId, agentGraphId, agentGraphVersion)` — race
conditions in `add_graph_to_library` and `create_library_agent` caused
crashes or silent data loss.
9. **Graph version collision**: `UniqueViolationError` on `AgentGraph
(id, version)` — copilot re-saving an agent at an existing version
collides with the primary key.
## What
### Backend: `LlmModel._missing_()` for provider-prefixed model names
- Adds `_missing_` classmethod to `LlmModel` enum that strips the
provider prefix (e.g., `anthropic/`) when direct lookup fails
- Self-contained in the enum — no changes to the generic type conversion
system
### Frontend: Filter Sentry SDK noise
- Adds `postEvent: Method not found` to `ignoreErrors` — a known Sentry
SDK issue on certain mobile browsers
### Backend: XMLParserBlock — raise ValueError instead of SyntaxError
- Changed `_validate_tokens()` to raise `ValueError` instead of
`SyntaxError`
- Changed the `except SyntaxError` handler in `run()` to re-raise as
`ValueError`
- This ensures `Block.execute()` wraps XML parsing failures as
`BlockExecutionError` (expected/user-caused) instead of
`BlockUnknownError` (unexpected/alerts Sentry)
### Backend: Virus scanner — handle empty files + reduce alert noise
- Added early return for empty (0-byte) files in `scan_file()` to avoid
`range() arg 3 must not be zero` when `chunk_size` is 0
- Added `max(1, len(content))` guard on `chunk_size` as defense-in-depth
- Downgraded `scan_content_safe` failure log from `error` to `warning`
so single-file scan failures don't page on-call via Sentry
### Backend: Suppress SDK client cleanup errors on SSE disconnect
- Replaced `async with ClaudeSDKClient` in `_run_stream_attempt` with
manual `__aenter__`/`__aexit__` wrapped in new
`_safe_close_sdk_client()` helper
- `_safe_close_sdk_client()` catches `ValueError` (OTEL context token
mismatch) and `RuntimeError` (anyio cancel scope in wrong task) during
`__aexit__` and logs at `debug` level — these are expected when SSE
client disconnects mid-stream
- Added `_is_sdk_disconnect_error()` helper for defense-in-depth at the
outer `except BaseException` handler in `stream_chat_completion_sdk`
- Both Sentry errors (8BT and 8BW) are now suppressed without affecting
normal cleanup flow
### Backend: Filter workspace UniqueViolationError from Sentry alerts
- Added `before_send` filter in `_before_send()` to drop
`UniqueViolationError` events where the message contains `workspaceId`
and `path`
- The error is already handled by `WorkspaceManager._persist_db_record`
retry logic — it must propagate for the retry logic to work, so the fix
is at the Sentry filter level rather than catching/suppressing at source
### Backend: Library agent race condition fixes
- **`add_graph_to_library`**: Replaced check-then-create pattern with
create-then-catch-`UniqueViolationError`-then-update. On collision,
updates the existing row (restoring soft-deleted/archived agents)
instead of crashing.
- **`create_library_agent`**: Replaced `create` with `upsert` on the
`(userId, agentGraphId, agentGraphVersion)` composite unique constraint,
so concurrent adds restore soft-deleted entries instead of throwing.
### Backend: Graph version auto-increment on collision
- `__create_graph` now checks if the `(id, version)` already exists
before `create_many`, and auto-increments the version to `max_existing +
1` to avoid `UniqueViolationError` when the copilot re-saves an agent.
### Backend: Workspace `get_or_create_workspace` upsert
- Changed from find-then-create to `upsert` to atomically handle
concurrent workspace creation.
## Test plan
- [x] `LlmModel("anthropic/claude-sonnet-4-6")` resolves correctly
- [x] `LlmModel("claude-sonnet-4-6")` still works (no regression)
- [x] `LlmModel("invalid/nonexistent-model")` still raises `ValueError`
- [x] XMLParserBlock: unclosed tags, extra closing tags, empty XML all
raise `ValueError`
- [x] XMLParserBlock: `SyntaxError` from gravitasml library is caught
and re-raised as `ValueError`
- [x] Virus scanner: empty file (0 bytes) returns clean without hitting
ClamAV
- [x] Virus scanner: single-byte file scans normally (regression test)
- [x] Virus scanner: `scan_content_safe` logs at WARNING not ERROR on
failure
- [x] SDK disconnect: `_is_sdk_disconnect_error` correctly identifies
cancel scope and context var errors
- [x] SDK disconnect: `_is_sdk_disconnect_error` rejects unrelated
errors
- [x] SDK disconnect: `_safe_close_sdk_client` suppresses ValueError,
RuntimeError, and unexpected exceptions
- [x] SDK disconnect: `_safe_close_sdk_client` calls `__aexit__` on
clean exit
- [x] Library: `add_graph_to_library` creates new agent on first call
- [x] Library: `add_graph_to_library` updates existing on
UniqueViolationError
- [x] Library: `create_library_agent` uses upsert to handle concurrent
adds
- [x] All existing workspace overwrite tests still pass
- [x] All tests passing (existing + 4 XML syntax + 3 virus scanner + 10
SDK disconnect + library tests)
## Summary
- When users hit their daily CoPilot token limit, they can now spend
credits ($2.00 default) to reset it and continue working
- Adds a dialog prompt when rate limit error occurs, offering the
credit-based reset option
- Adds a "Reset daily limit" button in the usage limits panel when the
daily limit is reached
- Backend: new `POST /api/chat/usage/reset` endpoint,
`reset_daily_usage()` Redis helper, `rate_limit_reset_cost` config
- Frontend: `RateLimitResetDialog` component, updated
`UsagePanelContent` with reset button, `useCopilotStream` exposes rate
limit state
- **NEW: Resetting the daily limit also reduces weekly usage by the
daily limit amount**, effectively granting 1 extra day's worth of weekly
capacity (e.g., daily_limit=10000 → weekly usage reduced by 10000,
clamped to 0)
## Context
Users have been confused about having credits available but being
blocked by rate limits (REQ-63, REQ-61). This provides a short-term
solution allowing users to spend credits to bypass their daily limit.
The weekly usage reduction ensures that a paid daily reset doesn't just
move the bottleneck to the weekly limit — users get genuine additional
capacity for the day they paid to unlock.
### Checklist 📋
#### For code changes:
- [x] I have clearly listed my changes in the PR description
- [x] I have made a test plan
- [x] I have tested my changes according to the test plan:
- [x] Hit daily rate limit → dialog appears with reset option
- [x] Click "Reset for $2.00" → credits charged, daily counter reset,
dialog closes
- [x] Usage panel shows "Reset daily limit" button when at 100% daily
usage
- [x] When `rate_limit_reset_cost=0` (disabled), rate limit shows toast
instead of dialog
- [x] Insufficient credits → error toast shown
- [x] Verify existing rate limit tests pass
- [x] Unit tests: weekly counter reduced by daily_limit on reset
- [x] Unit tests: weekly counter clamped to 0 when usage < daily_limit
- [x] Unit tests: no weekly reduction when daily_token_limit=0
#### For configuration changes:
- [x] `.env.default` is updated or already compatible with my changes
(new config fields `rate_limit_reset_cost` and `max_daily_resets` have
defaults in code)
- [x] `docker-compose.yml` is updated or already compatible with my
changes (no Docker changes needed)
## Why
Admins need visibility into per-user CoPilot rate limit usage and the
ability to reset a user's counters when needed (e.g., after a false
positive or for debugging). Additionally, the global rate limits were
hardcoded deploy-time constants with no way to adjust without
redeploying.
## What
- Admin endpoints to **check** a user's current rate limit usage and
**reset** their daily/weekly counters to zero
- Global rate limits are now **LaunchDarkly-configurable** via
`copilot-daily-token-limit` and `copilot-weekly-token-limit` flags,
falling back to existing `ChatConfig` values
- Frontend admin page at `/admin/rate-limits` with user lookup, usage
visualization, and reset capability
- Chat routes updated to source global limits from LD flags
## How
- **Backend**: Added `reset_user_usage()` to `rate_limit.py` that
deletes Redis usage keys. New admin routes in
`rate_limit_admin_routes.py` (GET `/api/copilot/admin/rate_limit` and
POST `/api/copilot/admin/rate_limit/reset`). Added
`COPILOT_DAILY_TOKEN_LIMIT` and `COPILOT_WEEKLY_TOKEN_LIMIT` to the
`Flag` enum. Chat routes use `_get_global_rate_limits()` helper that
checks LD first.
- **Frontend**: New `/admin/rate-limits` page with `RateLimitManager`
(user lookup) and `RateLimitDisplay` (usage bars + reset button). Added
`getUserRateLimit` and `resetUserRateLimit` to `BackendAPI` client.
## Test plan
- [x] Backend: 4 tests covering get, reset, redis failure, and
admin-only access
- [ ] Manual: Look up a user's rate limits in the admin UI
- [ ] Manual: Reset a user's usage counters
- [ ] Manual: Verify LD flag overrides are respected for global limits
Replaces "Sort my bookmarks into categories" with "Summarize my unread
emails" in the Organize suggestion category. CoPilot has no access to
browser bookmarks or local files, so the original prompt was misleading.
---
Co-authored-by: Toran Bruce Richards (@Torantulino)
<Torantulino@users.noreply.github.com>
## Why
`AgentInputBlock` has a `placeholder_values` field whose
`generate_schema()` converts it into a JSON schema `enum`. The frontend
renders any field with `enum` as a dropdown/select. This means
AI-generated agents that populate `placeholder_values` with example
values (e.g. URLs) on regular `AgentInputBlock` nodes end up with
dropdowns instead of free-text inputs — users can't type custom values.
Only `AgentDropdownInputBlock` should produce dropdown behavior.
## What
- Removed `placeholder_values` field from `AgentInputBlock.Input`
- Moved the `enum` generation logic to
`AgentDropdownInputBlock.Input.generate_schema()`
- Cleaned up test data for non-dropdown input blocks
- Updated copilot agent generation guide to stop suggesting
`placeholder_values` for `AgentInputBlock`
## How
The base `AgentInputBlock.Input.generate_schema()` no longer converts
`placeholder_values` → `enum`. Only `AgentDropdownInputBlock.Input`
defines `placeholder_values` and overrides `generate_schema()` to
produce the `enum`.
**Backward compatibility**: Existing agents with `placeholder_values` on
`AgentInputBlock` nodes load fine — `model_construct()` silently ignores
extra fields not defined on the model. Those inputs will now render as
text fields (desired behavior).
## Test plan
- [x] `poetry run pytest backend/blocks/test/test_block.py -xvs` — all
block tests pass
- [x] `poetry run format && poetry run lint` — clean
- [ ] Import an agent JSON with `placeholder_values` on an
`AgentInputBlock` — verify it loads and renders as text input
- [ ] Create an agent with `AgentDropdownInputBlock` — verify dropdown
still works
Requested by @itsababseh
Users can copy assistant output messages but not their own prompts. This
adds the same copy button to user messages — appears on hover,
right-aligned, using the existing `CopyButton` component.
## Why
Users write long prompts and need to copy them to reuse or share.
Currently requires manual text selection. ChatGPT shows copy on hover
for user messages — this matches that pattern.
## What
- Added `CopyButton` to user prompt messages in
`ChatMessagesContainer.tsx`
- Shows on hover (`group-hover:opacity-100`), positioned right-aligned
below the message
- Reuses the existing `CopyButton` and `MessageActions` components —
zero new code
## How
One file changed, 11 lines added:
1. Import `MessageActions` and `CopyButton`
2. Render them after user `MessageContent`, gated on `message.role ===
"user"` and having text parts
---
Co-authored-by: itsababseh (@itsababseh)
<36419647+itsababseh@users.noreply.github.com>
Fix broken UI when selecting nodes with array fields (list[str],
list[Enum]) in the builder. The select/input inside array items was
squeezed by the Remove button instead of taking full width.
<img width="2559" height="1077" alt="Screenshot 2026-03-26 at 10 23
34 AM"
src="https://github.com/user-attachments/assets/2ffc28a2-8d6c-428c-897c-021b1575723c"
/>
### Changes 🏗️
- **ArrayFieldItemTemplate**: Changed layout from horizontal flex-row to
vertical flex-col so the input takes full width and Remove button sits
below aligned left, with tighter spacing between them
- **Storybook config**: Added `renderers/**` glob to
`.storybook/main.ts` so renderer stories are discoverable
- **FormRenderer stories**: Added comprehensive Storybook stories
covering all backend field types (string, int, float, bool, enum,
date/time, list[str], list[int], list[Enum], list[bool], nested objects,
Optional, anyOf unions, oneOf discriminated unions, multi-select, list
of objects, and a kitchen sink). Includes exact Twitter GetUserBlock
schema for realistic oneOf + multi-select testing.
### Checklist 📋
#### For code changes:
- [x] I have clearly listed my changes in the PR description
- [x] I have made a test plan
- [x] I have tested my changes according to the test plan:
- [x] Verified array field items render with full-width input and Remove
button below in Storybook
- [x] Verified list[Enum] select dropdown takes full width
- [x] Verified list[str] text input takes full width
- [x] Verified all FormRenderer stories render without errors in
Storybook
- [x] Verified multi-select and oneOf discriminated union stories match
real backend schemas
### Why / What / How
**Why:** When voice recording is active in the CoPilot chat input, the
recording UI (waveform + timer) overlays on top of the placeholder/hint
text, creating a visually broken appearance. Reported by a user via
SECRT-2163.
**What:** Hide the textarea placeholder text while voice recording is
active so it doesn't bleed through the `RecordingIndicator` overlay.
**How:** When `isRecording` is true, the placeholder is set to an empty
string. The existing `RecordingIndicator` overlay (waveform animation +
elapsed time) then displays cleanly without the hint text showing
underneath.
### Changes 🏗️
- Clear the `PromptInputTextarea` placeholder to `""` when voice
recording is active, preventing it from rendering behind the
`RecordingIndicator` overlay
### Checklist 📋
#### For code changes:
- [x] I have clearly listed my changes in the PR description
- [x] I have made a test plan
- [x] I have tested my changes according to the test plan:
- [x] Open CoPilot chat at /copilot
- [x] Click the microphone button or press Space to start voice
recording
- [x] Verify the placeholder text ("Type your message..." / "What else
can I help with?") is hidden during recording
- [x] Verify the RecordingIndicator (waveform + timer) displays cleanly
without overlapping text
- [x] Stop recording and verify placeholder text reappears
- [x] Verify "Transcribing..." placeholder shows during transcription
## Summary
- Added `validate_sink_input_existence` method to `AgentValidator` to
ensure all sink names in links and input defaults reference valid input
schema fields in the corresponding block
- Added comprehensive tests covering valid/invalid sink names, nested
inputs, and default key handling
- Updated `ReadDiscordMessagesBlock` description to clarify it reads new
messages and triggers on new posts
- Removed leftover test function file
## Test plan
- [ ] Run `pytest` on `validator_test.py` to verify all sink input
validation cases pass
- [ ] Verify existing agent validation flow is unaffected
- [ ] Confirm `ReadDiscordMessagesBlock` description update is accurate
🤖 Generated with [Claude Code](https://claude.com/claude-code)
---------
Co-authored-by: Claude Sonnet 4.6 <noreply@anthropic.com>
Co-authored-by: Zamil Majdy <zamil.majdy@agpt.co>
## Summary
- Adds `visibilitychange`-based sleep/wake detection to the copilot chat
— when the page becomes visible after >30s hidden, automatically refetch
the session and either resume an active stream or hydrate completed
messages
- Blocks chat input during re-sync (`isSyncing` state) to prevent users
from accidentally sending a message that overwrites the agent's
completed work
- Replaces `PulseLoader` with a spinning `CircleNotch` icon on sidebar
session names for background streaming sessions (closer to ChatGPT's UX)
## How it works
1. When the page goes hidden, we record a timestamp
2. When the page becomes visible, we check elapsed time
3. If >30s elapsed (indicating sleep or long background), we refetch the
session from the API
4. If backend still has `active_stream=true` → remove stale assistant
message and resume SSE
5. If backend is done → the refetch triggers React Query invalidation
which hydrates the completed messages
6. Chat input stays disabled (`isSyncing=true`) until re-sync completes
## Test plan
- [ ] Open copilot, start a long-running agent task
- [ ] Close laptop lid / lock screen for >30 seconds
- [ ] Wake device — verify chat shows the agent's completed response (or
resumes streaming)
- [ ] Verify chat input is temporarily disabled during re-sync, then
re-enables
- [ ] Verify sidebar shows spinning icon (not pulse loader) for
background sessions
- [ ] Verify no duplicate messages appear after wake
- [ ] Verify normal streaming (no sleep) still works as expected
Resolves: [SECRT-2159](https://linear.app/autogpt/issue/SECRT-2159)
🤖 Generated with [Claude Code](https://claude.com/claude-code)
---------
Co-authored-by: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
## Summary
<img width="700" height="575" alt="Screenshot 2026-03-23 at 21 40 07"
src="https://github.com/user-attachments/assets/f6138c63-dd5e-4bde-a2e4-7434d0d3ec72"
/>
Re-applies #12452 which was reverted as collateral in #12485 (invite
system revert).
Replaces the flat list of suggestion pills in the CoPilot empty session
with themed prompt categories (Learn, Create, Automate, Organize), each
shown as a popover with contextual prompts.
- **Backend**: Adds `suggested_prompts` as a themed `dict[str,
list[str]]` keyed by category. Updates Tally extraction LLM prompt to
generate prompts per theme, and the `/suggested-prompts` API to return
grouped themes. Legacy `list[str]` rows are preserved under a
`"General"` key for backward compatibility.
- **Frontend**: Replaces inline pill buttons with a `SuggestionThemes`
popover component. Each theme button (with icon) opens a dropdown of 5
relevant prompts. Falls back to hardcoded defaults when the API has no
personalized prompts. Normalizes partial API responses by padding
missing themes with defaults. Legacy `"General"` prompts are distributed
round-robin across themes.
### Changes 🏗️
- `backend/data/understanding.py`: `suggested_prompts` field added as
`dict[str, list[str]]`; legacy list rows preserved under `"General"` key
via `_json_to_themed_prompts`
- `backend/data/tally.py`: LLM prompt updated to generate themed
prompts; validation now per-theme with blank-string rejection
- `backend/api/features/chat/routes.py`: New `SuggestedTheme` model;
endpoint returns `themes[]`
- `frontend/copilot/components/EmptySession/EmptySession.tsx`: Uses
generated API hooks for suggested prompts
- `frontend/copilot/components/EmptySession/helpers.ts`:
`DEFAULT_THEMES` replaces `DEFAULT_QUICK_ACTIONS`; `getSuggestionThemes`
normalizes partial API responses
-
`frontend/copilot/components/EmptySession/components/SuggestionThemes/`:
New popover component with theme icons and loading states
### Checklist 📋
- [x] I have clearly listed my changes in the PR description
- [x] I have made a test plan
- [x] I have tested my changes according to the test plan:
- [x] Verify themed suggestion buttons render on CoPilot empty session
- [x] Click each theme button and confirm popover opens with prompts
- [x] Click a prompt and confirm it sends the message
- [x] Verify fallback to default themes when API returns no custom
prompts
- [x] Verify legacy users' personalized prompts are preserved and
visible
🤖 Generated with [Claude Code](https://claude.com/claude-code)
---------
Co-authored-by: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
## Summary
Two backend fixes for CoPilot stability:
1. **Steer model away from bash_exec for SDK tool-result files** — When
the SDK returns tool results as file paths, the copilot model was
attempting to use `bash_exec` to read them instead of treating the
content directly. Added system prompt guidance to prevent this.
2. **Guard against missing 'name' in execution input_data** —
`GraphExecution.from_db()` assumed all INPUT/OUTPUT block node
executions have a `name` field in `input_data`. This crashes with
`KeyError: 'name'` when non-standard blocks (e.g., OrchestratorBlock)
produce node executions without this field. Added `"name" in
exec.input_data` guards.
## Why
- The bash_exec issue causes copilot to fail when processing SDK tool
outputs
- The KeyError crashes the `update_graph_execution_stats` endpoint,
causing graph executions to appear stuck (retries 35+ times, never
completes)
## How
- Added system prompt instruction to treat tool result file contents
directly
- Added `"name" in exec.input_data` guard in both input extraction (line
340) and output extraction (line 365) in `execution.py`
### Changes
- `backend/copilot/sdk/service.py` — system prompt guidance
- `backend/data/execution.py` — KeyError guard for missing `name` field
### Checklist 📋
- [x] I have clearly listed my changes in the PR description
- [x] I have made a test plan
- [x] I have tested my changes according to the test plan
#### Test plan:
- [x] OrchestratorBlock graph execution no longer gets stuck
- [x] Standard Agent Input/Output blocks still work correctly
- [x] Copilot SDK tool results are processed without bash_exec
## Why
Admins reviewing marketplace submissions currently approve blindly —
they can see raw metadata in the admin table but cannot see what the
listing actually looks like (images, video, branding, layout). This
risks approving inappropriate content. With full-scale production
approaching, this is critical.
Additionally, when a creator un-publishes an agent, users who already
added it to their library lose access — breaking their workflows.
Product decided on a "you added it, you keep it" model.
## What
- **Admin preview page** at `/admin/marketplace/preview/[id]` — renders
the listing exactly as it would appear on the public marketplace
- **Add to Library** for admins to test-run pending agents before
approving
- **Library membership grants graph access** — if you added an agent to
your library, you keep access even if it's un-published or rejected
- **Preview button** on every submission row in the admin marketplace
table
- **Cross-reference comments** on original functions to prevent
SECRT-2162-style regressions
## How
### Backend
**Admin preview (`store/db.py`):**
- `get_store_agent_details_as_admin()` queries `StoreListingVersion`
directly, bypassing the APPROVED-only `StoreAgent` DB view
- Validates `CreatorProfile` FK integrity, reads all fields including
`recommendedScheduleCron`
**Admin add-to-library (`library/_add_to_library.py`):**
- Extracted shared logic into `resolve_graph_for_library()` +
`add_graph_to_library()` — eliminates duplication between public and
admin paths
- Admin path uses `get_graph_as_admin()` to bypass marketplace status
checks
- Handles concurrent double-click race via `UniqueViolationError` catch
**Library membership grants graph access (`data/graph.py`):**
- `get_graph()` now falls back to `LibraryAgent` lookup if ownership and
marketplace checks fail
- Only for authenticated users with non-deleted, non-archived library
records
- `validate_graph_execution_permissions()` updated to match — library
membership grants execution access too
**New endpoints (`store_admin_routes.py`):**
- `GET /admin/submissions/{id}/preview` — returns `StoreAgentDetails`
- `POST /admin/submissions/{id}/add-to-library` — creates `LibraryAgent`
via admin path
### Frontend
- Preview page reuses `AgentInfo` + `AgentImages` with admin banner
- Shows instructions, recommended schedule, and slug
- "Add to My Library" button wired to admin endpoint
- Preview button added to `ExpandableRow` (header + version history)
- Categories column uncommented in version history table
### Testing (19 tests)
**Graph access control (9 in `graph_test.py`):** Owner access,
marketplace access, library member access (unpublished),
deleted/archived/anonymous denied, null FK denied, efficiency checks
**Admin bypass (5 in `store_admin_routes_test.py`):** Preview uses
StoreListingVersion not StoreAgent, admin path uses get_graph_as_admin,
regular path uses get_graph, library member can view in builder
**Security (3):** Non-admin 403 on preview, non-admin 403 on
add-to-library, nonexistent 404
**SECRT-2162 regression (2):** Admin access to pending agent, export
with sub-graphs
### Checklist
- [x] Changes clearly listed
- [x] Test plan made
- [x] 19 backend tests pass
- [x] Frontend lints and types clean
## Test plan
- [x] Navigate to `/admin/marketplace`, click Preview on a PENDING
submission
- [x] Verify images, video, description, categories, instructions,
schedule render correctly
- [x] Click "Add to My Library", verify agent appears in library and
opens in builder
- [x] Verify non-admin users get 403
- [x] Verify un-publishing doesn't break access for users who already
added it
🤖 Generated with [Claude Code](https://claude.com/claude-code)
<!-- CURSOR_SUMMARY -->
---
> [!NOTE]
> **High Risk**
> Adds new admin-only endpoints that bypass marketplace
approval/ownership checks and changes `get_graph`/execution
authorization to grant access via library membership, which impacts
security-sensitive access control paths.
>
> **Overview**
> Adds **admin preview + review workflow support** for marketplace
submissions: new admin routes to `GET /admin/submissions/{id}/preview`
(querying `StoreListingVersion` directly) and `POST
/admin/submissions/{id}/add-to-library` (admin bypass to pull pending
graphs into an admin’s library).
>
> Refactors library add-from-store logic into shared helpers
(`resolve_graph_for_library`, `add_graph_to_library`) and introduces an
admin variant `add_store_agent_to_library_as_admin`, including restore
of archived/deleted entries and dedup/race handling.
>
> Changes core graph access rules: `get_graph()` now falls back to
**library membership** (non-deleted/non-archived, version-specific) when
ownership and marketplace approval don’t apply, and
`validate_graph_execution_permissions()` is updated accordingly.
Frontend adds a preview link and a dedicated admin preview page with
“Add to My Library”; tests expand significantly to lock in the new
bypass and access-control behavior.
>
> <sup>Written by [Cursor
Bugbot](https://cursor.com/dashboard?tab=bugbot) for commit
a362415d12. This will update automatically
on new commits. Configure
[here](https://cursor.com/dashboard?tab=bugbot).</sup>
<!-- /CURSOR_SUMMARY -->
---------
Co-authored-by: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
### Why / What / How
**Why:** GitHub's code scanning detected a HIGH severity security
vulnerability in `/autogpt_platform/backend/backend/util/json.py:172`.
The error handler in `sanitize_json()` was logging sensitive data
(potentially including secrets, API keys, credentials) as clear text
when serialization fails.
**What:** This PR removes the logging of actual data content from the
error handler while preserving useful debugging metadata (error type,
error message, and data type).
**How:** Removed the `"Data preview: %s"` format parameter and the
corresponding `truncate(str(data), 100)` argument from the
logger.error() call. The error handler now logs only safe metadata that
helps debugging without exposing sensitive information.
### Changes 🏗️
- **Security Fix**: Modified `sanitize_json()` function in
`backend/util/json.py`
- Removed logging of data content (`truncate(str(data), 100)`) from the
error handler
- Retained logging of error type (`type(e).__name__`)
- Retained logging of truncated error message (`truncate(str(e), 200)`)
- Retained logging of data type (`type(data).__name__`)
- Error handler still provides useful debugging information without
exposing secrets
### Checklist 📋
#### For code changes:
- [x] I have clearly listed my changes in the PR description
- [x] I have made a test plan
- [x] I have tested my changes according to the test plan:
- [x] Verified the code passes type checking (`poetry run pyright
backend/util/json.py`)
- [x] Verified the code passes linting (`poetry run ruff check
backend/util/json.py`)
- [x] Verified all pre-commit hooks pass
- [x] Reviewed the diff to ensure only the sensitive data logging was
removed
- [x] Confirmed that useful debugging information (error type, error
message, data type) is still logged
#### For configuration changes:
- N/A - No configuration changes required
## Summary
- **pr-address skill**: Add explicit rule against empty commits for CI
re-triggers, and strengthen push-immediately guidance with rationale
- **Platform CLAUDE.md**: Add "split PRs by concern" guideline under
Creating Pull Requests
### Changes
- Updated `.claude/skills/pr-address/SKILL.md`
- Updated `autogpt_platform/CLAUDE.md`
### Checklist 📋
#### For code changes:
- [x] I have clearly listed my changes in the PR description
- [x] I have made a test plan
- [x] I have tested my changes according to the test plan
#### Test plan:
- [x] Documentation-only changes — no functional tests needed
- [x] Verified markdown renders correctly
### Changes 🏗️
- When the AutoPilot block executes a copilot session via
`collect_copilot_response`, it calls `stream_chat_completion_sdk`
directly, bypassing the copilot executor and stream registry. This means
the frontend sees no `active_stream` on the session and cannot connect
via SSE — users see a frozen chat with no updates until the turn fully
completes.
- Fix: register a `stream_registry` session in
`collect_copilot_response` and publish each chunk to Redis as events are
consumed. This allows the frontend to detect `active_stream=true` and
connect via the SSE reconnect endpoint for live streaming updates during
AutoPilot execution.
- Error handling is graceful — if stream registry fails, AutoPilot still
works normally, just without real-time frontend updates.
### Checklist 📋
#### For code changes:
- [x] I have clearly listed my changes in the PR description
- [x] I have made a test plan
- [x] I have tested my changes according to the test plan:
- [x] Trigger an AutoPilot block execution that creates a new chat
session
- [x] Verify the new session appears in the sidebar with streaming
indicator
- [x] Click on the session while AutoPilot is still executing — verify
SSE connects and messages stream in real-time
- [x] Verify that after AutoPilot completes, the session shows as
complete (no active_stream)
- [x] Test reconnection: disconnect and reconnect while AutoPilot is
running — verify stream resumes (found and fixed GeneratorExit bug that
caused stuck sessions)
- [x] E2E: 10 stream events published to Redis (StreamStart,
3×ToolInput, 3×ToolOutput, TextStart, TextEnd, StreamFinish)
- [x] E2E: Redis xadd latency 0.2–3.4ms per chunk
- [x] E2E: Chat sessions registered in Redis (confirmed via redis-cli)
## Summary
- Allow `/tmp` as a valid writable directory in E2B sandbox file tools
(`write_file`, `read_file`, `edit_file`, `glob`, `grep`)
- The E2B sandbox is already fully isolated, so restricting writes to
only `/home/user` was unnecessarily limiting — scripts and tools
commonly use `/tmp` for temporary files
- Extract `is_within_allowed_dirs()` helper in `context.py` to
centralize the allowed-directory check for both path resolution and
symlink escape detection
## Changes
- `context.py`: Add `E2B_ALLOWED_DIRS` tuple and `E2B_ALLOWED_DIRS_STR`,
introduce `is_within_allowed_dirs()`, update `resolve_sandbox_path()` to
use it
- `e2b_file_tools.py`: Update `_check_sandbox_symlink_escape()` to use
`is_within_allowed_dirs()`, update tool descriptions
- Tests: Add coverage for `/tmp` paths in both `context_test.py` and
`e2b_file_tools_test.py`
### Checklist 📋
#### For code changes:
- [x] I have clearly listed my changes in the PR description
- [x] I have made a test plan
- [x] I have tested my changes according to the test plan:
- [x] All 59 existing + new tests pass (`poetry run pytest
backend/copilot/context_test.py
backend/copilot/sdk/e2b_file_tools_test.py`)
- [x] `poetry run format` and `poetry run lint` pass clean
- [x] Verify `/tmp` write works in live E2B sandbox
- [x] E2E: Write file to /tmp/test.py in E2B sandbox via copilot
- [x] E2E: Execute script from /tmp — output "Hello, World!"
- [x] E2E: E2B sandbox lifecycle (create, use, pause) works correctly
## Summary
- Filter SDK-provisioned default credentials from credentials API list
endpoints
- Reuse `CredentialsMetaResponse` model from internal router in external
API (removes duplicate `CredentialSummary`)
- Add `is_sdk_default()` helper for identifying platform-provisioned
credentials
- Add `provider_matches()` to credential store for consistent provider
filtering
- Add tests for credential filtering behavior
### Changes
- `backend/data/model.py` — add `is_sdk_default()` helper
- `backend/api/features/integrations/router.py` — filter SDK defaults
from list endpoints
- `backend/api/external/v1/integrations.py` — reuse
`CredentialsMetaResponse`, filter SDK defaults
- `backend/integrations/credentials_store.py` — add `provider_matches()`
- `backend/sdk/registry.py` — update credential registration
- `backend/api/features/integrations/router_test.py` — new tests
- `backend/api/features/integrations/conftest.py` — test fixtures
### Checklist 📋
#### For code changes:
- [x] I have clearly listed my changes in the PR description
- [x] I have made a test plan
- [x] I have tested my changes according to the test plan
#### Test plan:
- [x] Unit tests for credential filtering (`router_test.py`)
- [x] Verify SDK default credentials excluded from API responses
- [x] Verify user-created credentials still returned normally
## Why
Agent generation and building needs a way to test-run agents without
requiring real credentials or producing side effects. Currently, every
execution hits real APIs, consumes credits, and requires valid
credentials — making it impossible to debug or validate agent graphs
during the build phase without real consequences.
## Summary
Adds a `dry_run` execution mode to the copilot's `run_block` and
`run_agent` tools. When `dry_run=True`, every block execution is
simulated by an LLM instead of calling the real service — no real API
calls, no credentials consumed, no side effects.
Inspired by
[Significant-Gravitas/agent-simulator](https://github.com/Significant-Gravitas/agent-simulator).
### How it works
- **`backend/executor/simulator.py`** (new): `simulate_block()` builds a
prompt from the block's name, description, input/output schemas, and
actual input values, then calls `gpt-4o-mini` via the existing
OpenRouter client with JSON mode. Retries up to 5 times on JSON parse
failures. Missing output pins are filled with `None` (or `""` for the
`error` pin). Long inputs (>20k chars) are truncated before sending to
the LLM.
- **`ExecutionContext`**: Added `dry_run: bool = False` field; threaded
through `add_graph_execution()` so graph-level dry runs propagate to
every block execution.
- **`execute_block()` helper**: When `dry_run=True`, the function
short-circuits before any credential injection or credit checks, calls
`simulate_block()`, and returns a `[DRY RUN]`-prefixed
`BlockOutputResponse`.
- **`RunBlockTool`**: New `dry_run` boolean parameter.
- **`RunAgentTool`**: New `dry_run` boolean parameter; passes
`ExecutionContext(dry_run=True)` to graph execution.
### Tests
11 tests in `backend/copilot/tools/test_dry_run.py`:
- Correct output tuples from LLM response
- JSON retry logic (3 total calls when first 2 fail)
- All-retries-exhausted yields `SIMULATOR ERROR`
- Missing output pins filled with `None`/`""`
- No-client case
- Input truncation at 20k chars
- `execute_block(dry_run=True)` skips real `block.execute()`
- Response format: `[DRY RUN]` message, `success=True`
- `dry_run=False` unchanged (real path)
- `RunBlockTool` parameter presence
- `dry_run` kwarg forwarding
## Test plan
- [x] Run `pytest backend/copilot/tools/test_dry_run.py -v` — all 11
pass
- [x] Call `run_block` with `dry_run=true` in copilot; verify no real
API calls occur and output contains `[DRY RUN]`
- [x] Call `run_agent` with `dry_run=true`; verify execution is created
with `dry_run=True` in context
- [x] E2E: Simulate button (flask icon) present in builder alongside
play button
- [x] E2E: Simulated run labeled with "(Simulated)" suffix and badge in
Library
- [x] E2E: No credits consumed during dry-run
## Summary
**Critical CI fix** — litellm was compromised in a supply chain attack
(versions 1.82.7/1.82.8 contained infostealer malware) and PyPI
subsequently yanked many litellm versions including the 1.7x range that
stagehand 0.5.x depended on. This breaks `poetry lock` in CI for all
PRs.
- Bump `stagehand` from `^0.5.1` to `^3.4.0` — Stagehand v3 is a
Stainless-generated HTTP API client that **no longer depends on
litellm**, completely removing litellm from our dependency tree
- Migrate stagehand blocks to use `AsyncStagehand` + session-based API
(`sessions.start`, `session.navigate/act/observe/extract`)
- Net reduction of ~430 lines in `poetry.lock` from dropping litellm and
its transitive dependencies
## Why
All CI pipelines are blocked because `poetry lock` fails to resolve
yanked litellm versions that stagehand 0.5.x required.
## Test plan
- [x] CI passes (poetry lock resolves, backend tests green)
- [ ] Verify stagehand blocks still function with the new session-based
API
## Summary
- Implements **infrastructure-level parallel tool execution** for
CoPilot: all tools called in a single LLM turn now execute concurrently
with zero changes to individual tool implementations or LLM prompts.
- Adds `pre_launch_tool_call()` to `tool_adapter.py`: when an
`AssistantMessage` with `ToolUseBlock`s arrives, all tools are
immediately fired as `asyncio.Task`s before the SDK dispatches MCP
handlers. Each MCP handler then awaits its pre-launched task instead of
executing fresh.
- Adds a `_tool_task_queues` `ContextVar` (initialized per-session in
`set_execution_context()`) so concurrent sessions never share task
queues.
- DRY refactor: extracts `prepare_block_for_execution()`,
`check_hitl_review()`, and `BlockPreparation` dataclass into
`helpers.py` so the execution pipeline is reusable.
- 10 unit tests for the parallel pre-launch infrastructure (queue
enqueue/dequeue, MCP prefix stripping, fallback path, `CancelledError`
handling, multi-same-tool FIFO ordering).
## Root cause
The Claude Agent SDK CLI sends MCP tool calls as sequential
request-response pairs: it waits for each `control_response` before
issuing the next `mcp_message`. Even though Python dispatches handlers
with `start_soon`, the CLI never issues call B until call A's response
is sent — blocks always ran sequentially. The pre-launch pattern fixes
this at the infrastructure level by starting all tasks before the SDK
even dispatches the first handler.
## Test plan
- [x] `poetry run pytest backend/copilot/sdk/tool_adapter_test.py` — 27
tests pass (10 new parallel infra tests)
- [x] `poetry run pytest backend/copilot/tools/helpers_test.py` — 20
tests pass
- [x] `poetry run pytest backend/copilot/tools/run_block_test.py
backend/copilot/tools/test_run_block_details.py` — all pass
- [x] Manually test in CoPilot: ask the agent to run two blocks
simultaneously — verify both start executing before either completes
- [x] E2E: Both GetCurrentTimeBlock and CalculatorBlock executed
concurrently (time=09:35:42, 42×7=294)
- [x] E2E: Pre-launch mechanism active — two run_block events at same
timestamp (3ms apart)
- [x] E2E: Arg-mismatch fallback tested — system correctly cancels and
falls back to direct execution
## Summary
- Renames `SmartDecisionMakerBlock` to `OrchestratorBlock` across the
entire codebase
- The block supports iteration/agent mode and general tool
orchestration, so "Smart Decision Maker" no longer accurately describes
its capabilities
- Block UUID (`3b191d9f-356f-482d-8238-ba04b6d18381`) remains unchanged
— fully backward compatible with existing graphs
## Changes
- Renamed block class, constants, file names, test files, docs, and
frontend enum
- Updated copilot agent generator (helpers, validator, fixer) references
- Updated agent generation guide documentation
- No functional changes — pure rename refactor
### For code changes
- [x] I have clearly listed my changes in the PR description
- [x] I have made corresponding changes to the documentation
- [x] My changes do not generate new warnings or errors
- [x] New and existing unit tests pass locally with my changes
## Test plan
- [x] All pre-commit hooks pass (typecheck, lint, format)
- [x] Existing graphs with this block continue to load and execute (same
UUID)
- [x] Agent mode / iteration mode works as before
- [x] Copilot agent generator correctly references the renamed block
Requested by @majdyz
Follow-up to #12513. Anthropic/OpenAI 401, 403, and 429 errors are
user-caused (bad API keys, forbidden, rate limits) and should not hit
Sentry as exceptions.
### Changes
**Changes in `blocks/llm.py`:**
- Anthropic `APIError` handler (line ~950): check `status_code` — use
`logger.warning()` for 401/403/429, keep `logger.error()` for server
errors
- Generic `Exception` handler in LLM block `run()` (line ~1467): same
pattern — `logger.warning()` for user-caused status codes,
`logger.exception()` for everything else
- Extracted `USER_ERROR_STATUS_CODES = (401, 403, 429)` module-level
constant
- Added `break` to short-circuit retry loop for user-caused errors
- Removed double-logging from inner Anthropic handler
**Changes in `blocks/test/test_llm.py`:**
- Added 8 regression tests covering 401/403/429 fast-exit and 500 retry
behavior
**Sentry issues addressed:**
- AUTOGPT-SERVER-8B6, 8B7, 8B8 — `[LLM-Block] Anthropic API error: Error
code: 401 - invalid x-api-key`
- Any OpenAI 401/403/429 errors hitting the generic exception handler
Part of SECRT-2166
### Checklist 📋
#### For code changes:
- [x] I have clearly listed my changes in the PR description
- [x] I have made a test plan
- [x] I have tested my changes according to the test plan
#### Test plan:
- [x] Unit tests for 401/403/429 Anthropic errors → warning log, no
retry
- [x] Unit tests for 500 Anthropic errors → error log, retry
- [x] Unit tests for 401/403/429 OpenAI errors → warning log, no retry
- [x] Unit tests for 500 OpenAI errors → error log, retry
- [x] Verified USER_ERROR_STATUS_CODES constant is used consistently
- [x] Verified no double-logging in Anthropic handler path
---
Co-authored-by: Zamil Majdy (@majdyz) <zamil.majdy@agpt.co>
---------
Co-authored-by: Zamil Majdy (@majdyz) <zamil.majdy@agpt.co>
## Summary
- Adds `CopilotPermissions` model (`copilot/permissions.py`) — a
capability filter that restricts which tools and blocks the
AutoPilot/Copilot may use during a single execution
- Exposes 4 new `advanced=True` fields on `AutoPilotBlock`: `tools`,
`tools_exclude`, `blocks`, `blocks_exclude`
- Threads permissions through the full execution path: `AutoPilotBlock`
→ `collect_copilot_response` → `stream_chat_completion_sdk` →
`run_block`
- Implements recursion inheritance via contextvar: sub-agent executions
can only be *more* restrictive than their parent
## Design
**Tool filtering** (`tools` + `tools_exclude`):
- `tools_exclude=True` (default): `tools` is a **blacklist** — listed
tools denied, all others allowed. Empty list = allow all.
- `tools_exclude=False`: `tools` is a **whitelist** — only listed tools
are allowed.
- Users specify short names (`run_block`, `web_fetch`, `Read`, `Task`,
…) — mapped to full SDK format internally.
- Validated eagerly at block-run time with a clear error listing valid
names.
**Block filtering** (`blocks` + `blocks_exclude`):
- Same semantics as tool filtering, applied inside `run_block` via
contextvar.
- Each entry can be a full UUID, an 8-char partial UUID (first segment),
or a case-insensitive block name.
- Validated against the live block registry; invalid identifiers surface
a helpful error before the session is created.
**Recursion inheritance**:
- `_inherited_permissions` contextvar stores the parent execution's
permissions.
- On each `AutoPilotBlock.run()`, the child's permissions are merged
with the parent via `merged_with_parent()` — effective allowed sets are
intersected (tools) and the parent chain is kept for block checks.
- Sub-agents can never expand what the parent allowed.
## Test plan
- [x] 68 new unit tests in `copilot/permissions_test.py` and
`blocks/autopilot_permissions_test.py`
- [x] Block identifier matching: full UUID, partial UUID, name,
case-insensitivity
- [x] Tool allow/deny list semantics including edge cases (empty list,
unknown tool)
- [x] Parent/child merging and recursion ceiling correctness
- [x] `validate_tool_names` / `validate_block_identifiers` with mock
block registry
- [x] `apply_tool_permissions` SDK tool-list integration
- [x] `AutoPilotBlock.run()` — invalid tool/block yields error before
session creation
- [x] `AutoPilotBlock.run()` — valid permissions forwarded to
`execute_copilot`
- [x] Existing `AutoPilotBlock` block tests still pass (2/2)
- [x] All hooks pass (pyright, ruff, black, isort)
- [x] E2E: CoPilot chat works end-to-end with E2B sandbox (12s stream)
- [x] E2E: Permission fields render in Builder UI (Tools combobox,
exclude toggles)
- [x] E2E: Agent with restricted permissions (whitelist web_fetch only)
executes correctly
- [x] E2E: Permission values preserved through API round-trip
## Why
Admins cannot download submitted-but-not-yet-approved agents from
`/admin/marketplace`. Clicking "Download" fails silently with a Server
Components render error. This blocks admins from reviewing agents that
companies have submitted.
## What
Remove the redundant ownership/marketplace check from
`get_graph_as_admin()` that was silently tightened in PR #11323 (Nov
2025). Add regression tests for both the admin download path and the
non-admin marketplace access control.
## How
**Root cause:** In PR #11323, Reinier refactored an inline
`StoreListingVersion` query (which had no status filter) into a call to
`is_graph_published_in_marketplace()` (which requires `submissionStatus:
APPROVED`). This was collateral cleanup — his PR focused on sub-agent
execution permissions — but it broke admin download of pending agents.
**Fix:** Remove the ownership/marketplace check from
`get_graph_as_admin()`, keeping only the null guard. This is safe
because `get_graph_as_admin` is only callable through admin-protected
routes (`requires_admin_user` at router level).
**Tests added:**
- `test_admin_can_access_pending_agent_not_owned` — admin can access a
graph they don't own that isn't APPROVED
- `test_admin_download_pending_agent_with_subagents` — admin export
includes sub-graphs
- `test_get_graph_non_owner_approved_marketplace_agent` — protects PR
#11323: non-owners CAN access APPROVED agents
- `test_get_graph_non_owner_pending_marketplace_agent_denied` — protects
PR #11323: non-owners CANNOT access PENDING agents
### Checklist
- [x] I have clearly listed my changes in the PR description
- [x] I have made a test plan
- [x] I have tested my changes according to the test plan:
- [x] 4 regression tests pass locally
- [x] Admin can download pending agents (verified via unit test)
- [x] Non-admin marketplace access control preserved
## Test plan
- [ ] Verify admin can download a submitted-but-not-approved agent from
`/admin/marketplace`
- [ ] Verify non-admin users still cannot access admin endpoints
- [ ] Verify the download succeeds without console errors
🤖 Generated with [Claude Code](https://claude.com/claude-code)
<!-- CURSOR_SUMMARY -->
---
> [!NOTE]
> **Medium Risk**
> Changes access-control behavior for admin graph retrieval; risk is
mitigated by route-level admin auth but misuse of `get_graph_as_admin()`
outside admin-protected routes would expose non-approved graphs.
>
> **Overview**
> Admins can now download/review **submitted-but-not-approved**
marketplace agents: `get_graph_as_admin()` no longer enforces ownership
or *marketplace APPROVED* checks, only returning `None` when the graph
doesn’t exist.
>
> Adds regression tests covering the admin download/export path
(including sub-graphs) and confirming non-admin behavior is unchanged:
non-owners can fetch **APPROVED** marketplace graphs but cannot access
**pending** ones.
>
> <sup>Written by [Cursor
Bugbot](https://cursor.com/dashboard?tab=bugbot) for commit
a6d2d69ae4. This will update automatically
on new commits. Configure
[here](https://cursor.com/dashboard?tab=bugbot).</sup>
<!-- /CURSOR_SUMMARY -->
---------
Co-authored-by: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
## Summary
- Adds a two-layer circuit breaker to prevent AutoPilot from looping
infinitely when tool calls fail with empty parameters
- **Tool-level**: After 3 consecutive identical failures per tool,
returns a hard-stop message instructing the model to output content as
text instead of retrying
- **Stream-level**: After 6 consecutive empty tool calls (`input: {}`),
aborts the stream entirely with a user-visible error and retry button
## Background
In session `c5548b48`, the model completed all research successfully but
then spent 51+ minutes in an infinite loop trying to write output —
every tool call was sent with `input: {}` (likely due to context
saturation preventing argument serialization). 21+ identical failing
tool calls with no circuit breaker.
## Changes
- `tool_adapter.py`: Added `_check_circuit_breaker`,
`_record_tool_failure`, `_clear_tool_failures` functions with a
`ContextVar`-based tracker. Integrated into both `create_tool_handler`
(BaseTool) and the `_truncating` wrapper (all tools).
- `service.py`: Added empty-tool-call detection in the main stream loop
that counts consecutive `AssistantMessage`s with empty
`ToolUseBlock.input` and aborts after the limit.
- `test_circuit_breaker.py`: 7 unit tests covering threshold behavior,
per-args tracking, reset on success, and uninitialized tracker safety.
## Test plan
- [x] Unit tests pass (`pytest
backend/copilot/sdk/test_circuit_breaker.py` — 8/8 passing)
- [x] Pre-commit hooks pass (Ruff, Black, isort, typecheck all pass)
- [x] E2E: CoPilot tool calls work normally (GetCurrentTimeBlock
returned 09:16:39 UTC)
- [x] E2E: Circuit breaker pass-through verified (successful calls don't
trigger breaker)
- [x] E2E: Circuit breaker code integrated into tool_adapter truncating
wrapper
## Summary
- Add "Critical Requirements" section making screenshots at every step,
PR comment posting, state verification, negative tests, and full
evidence reports non-negotiable
- Add "State Manipulation for Realistic Testing" section with Redis CLI,
DB query, and API before/after patterns
- Strengthen fix mode to require before/after screenshot pairs, rebuild
only affected services, and commit after each fix
- Expand test report format to include API evidence and screenshot
evidence columns
- Bump version to 2.0.0
## Test plan
- [x] Run `/pr-test` on an existing PR and verify it follows the new
critical requirements
- [x] Verify screenshots are posted to PR comment
- [x] Verify fix mode produces before/after screenshot pairs
Bumps the development-dependencies group with 2 updates in the
/autogpt_platform/autogpt_libs directory:
[pytest-cov](https://github.com/pytest-dev/pytest-cov) and
[ruff](https://github.com/astral-sh/ruff).
Updates `pytest-cov` from 7.0.0 to 7.1.0
<details>
<summary>Changelog</summary>
<p><em>Sourced from <a
href="https://github.com/pytest-dev/pytest-cov/blob/master/CHANGELOG.rst">pytest-cov's
changelog</a>.</em></p>
<blockquote>
<h2>7.1.0 (2026-03-21)</h2>
<ul>
<li>
<p>Fixed total coverage computation to always be consistent, regardless
of reporting settings.
Previously some reports could produce different total counts, and
consequently can make --cov-fail-under behave different depending on
reporting options.
See <code>[#641](https://github.com/pytest-dev/pytest-cov/issues/641)
<https://github.com/pytest-dev/pytest-cov/issues/641></code>_.</p>
</li>
<li>
<p>Improve handling of ResourceWarning from sqlite3.</p>
<p>The plugin adds warning filter for sqlite3
<code>ResourceWarning</code> unclosed database (since 6.2.0).
It checks if there is already existing plugin for this message by
comparing filter regular expression.
When filter is specified on command line the message is escaped and does
not match an expected message.
A check for an escaped regular expression is added to handle this
case.</p>
<p>With this fix one can suppress <code>ResourceWarning</code> from
sqlite3 from command line::</p>
<p>pytest -W "ignore:unclosed database in <sqlite3.Connection
object at:ResourceWarning" ...</p>
</li>
<li>
<p>Various improvements to documentation.
Contributed by Art Pelling in
<code>[#718](https://github.com/pytest-dev/pytest-cov/issues/718)
<https://github.com/pytest-dev/pytest-cov/pull/718></code>_ and
"vivodi" in
<code>[#738](https://github.com/pytest-dev/pytest-cov/issues/738)
<https://github.com/pytest-dev/pytest-cov/pull/738></code><em>.
Also closed
<code>[#736](https://github.com/pytest-dev/pytest-cov/issues/736)
<https://github.com/pytest-dev/pytest-cov/issues/736></code></em>.</p>
</li>
<li>
<p>Fixed some assertions in tests.
Contributed by in Markéta Machová in
<code>[#722](https://github.com/pytest-dev/pytest-cov/issues/722)
<https://github.com/pytest-dev/pytest-cov/pull/722></code>_.</p>
</li>
<li>
<p>Removed unnecessary coverage configuration copying (meant as a backup
because reporting commands had configuration side-effects before
coverage 5.0).</p>
</li>
</ul>
</blockquote>
</details>
<details>
<summary>Commits</summary>
<ul>
<li><a
href="66c8a526b1"><code>66c8a52</code></a>
Bump version: 7.0.0 → 7.1.0</li>
<li><a
href="f707662478"><code>f707662</code></a>
Make the examples use pypy 3.11.</li>
<li><a
href="6049a78478"><code>6049a78</code></a>
Make context test use the old ctracer (seems the new sysmon tracer
behaves di...</li>
<li><a
href="8ebf20bbbc"><code>8ebf20b</code></a>
Update changelog.</li>
<li><a
href="861d30e60d"><code>861d30e</code></a>
Remove the backup context manager - shouldn't be needed since coverage
5.0, ...</li>
<li><a
href="fd4c956014"><code>fd4c956</code></a>
Pass the precision on the nulled total (seems that there's some caching
goion...</li>
<li><a
href="78c9c4ecb0"><code>78c9c4e</code></a>
Only run the 3.9 on older deps.</li>
<li><a
href="4849a922e8"><code>4849a92</code></a>
Punctuation.</li>
<li><a
href="197c35e2f3"><code>197c35e</code></a>
Update changelog and hopefully I don't forget to publish release again
:))</li>
<li><a
href="14dc1c92d4"><code>14dc1c9</code></a>
Update examples to use 3.11 and make the adhoc layout example look a bit
more...</li>
<li>Additional commits viewable in <a
href="https://github.com/pytest-dev/pytest-cov/compare/v7.0.0...v7.1.0">compare
view</a></li>
</ul>
</details>
<br />
Updates `ruff` from 0.15.0 to 0.15.7
<details>
<summary>Release notes</summary>
<p><em>Sourced from <a
href="https://github.com/astral-sh/ruff/releases">ruff's
releases</a>.</em></p>
<blockquote>
<h2>0.15.7</h2>
<h2>Release Notes</h2>
<p>Released on 2026-03-19.</p>
<h3>Preview features</h3>
<ul>
<li>Display output severity in preview (<a
href="https://redirect.github.com/astral-sh/ruff/pull/23845">#23845</a>)</li>
<li>Don't show <code>noqa</code> hover for non-Python documents (<a
href="https://redirect.github.com/astral-sh/ruff/pull/24040">#24040</a>)</li>
</ul>
<h3>Rule changes</h3>
<ul>
<li>[<code>pycodestyle</code>] Recognize <code>pyrefly:</code> as a
pragma comment (<code>E501</code>) (<a
href="https://redirect.github.com/astral-sh/ruff/pull/24019">#24019</a>)</li>
</ul>
<h3>Server</h3>
<ul>
<li>Don't return code actions for non-Python documents (<a
href="https://redirect.github.com/astral-sh/ruff/pull/23905">#23905</a>)</li>
</ul>
<h3>Documentation</h3>
<ul>
<li>Add company AI policy to contributing guide (<a
href="https://redirect.github.com/astral-sh/ruff/pull/24021">#24021</a>)</li>
<li>Document editor features for Markdown code formatting (<a
href="https://redirect.github.com/astral-sh/ruff/pull/23924">#23924</a>)</li>
<li>[<code>pylint</code>] Improve phrasing (<code>PLC0208</code>) (<a
href="https://redirect.github.com/astral-sh/ruff/pull/24033">#24033</a>)</li>
</ul>
<h3>Other changes</h3>
<ul>
<li>Use PEP 639 license information (<a
href="https://redirect.github.com/astral-sh/ruff/pull/19661">#19661</a>)</li>
</ul>
<h3>Contributors</h3>
<ul>
<li><a
href="https://github.com/tmimmanuel"><code>@tmimmanuel</code></a></li>
<li><a
href="https://github.com/DimitriPapadopoulos"><code>@DimitriPapadopoulos</code></a></li>
<li><a
href="https://github.com/amyreese"><code>@amyreese</code></a></li>
<li><a href="https://github.com/statxc"><code>@statxc</code></a></li>
<li><a href="https://github.com/dylwil3"><code>@dylwil3</code></a></li>
<li><a
href="https://github.com/hunterhogan"><code>@hunterhogan</code></a></li>
<li><a
href="https://github.com/renovate"><code>@renovate</code></a></li>
</ul>
<h2>Install ruff 0.15.7</h2>
<h3>Install prebuilt binaries via shell script</h3>
<pre lang="sh"><code>curl --proto '=https' --tlsv1.2 -LsSf
https://releases.astral.sh/github/ruff/releases/download/0.15.7/ruff-installer.sh
| sh
</code></pre>
<h3>Install prebuilt binaries via powershell script</h3>
<pre lang="sh"><code>powershell -ExecutionPolicy Bypass -c "irm
https://releases.astral.sh/github/ruff/releases/download/0.15.7/ruff-installer.ps1
| iex"
</tr></table>
</code></pre>
</blockquote>
<p>... (truncated)</p>
</details>
<details>
<summary>Changelog</summary>
<p><em>Sourced from <a
href="https://github.com/astral-sh/ruff/blob/main/CHANGELOG.md">ruff's
changelog</a>.</em></p>
<blockquote>
<h2>0.15.7</h2>
<p>Released on 2026-03-19.</p>
<h3>Preview features</h3>
<ul>
<li>Display output severity in preview (<a
href="https://redirect.github.com/astral-sh/ruff/pull/23845">#23845</a>)</li>
<li>Don't show <code>noqa</code> hover for non-Python documents (<a
href="https://redirect.github.com/astral-sh/ruff/pull/24040">#24040</a>)</li>
</ul>
<h3>Rule changes</h3>
<ul>
<li>[<code>pycodestyle</code>] Recognize <code>pyrefly:</code> as a
pragma comment (<code>E501</code>) (<a
href="https://redirect.github.com/astral-sh/ruff/pull/24019">#24019</a>)</li>
</ul>
<h3>Server</h3>
<ul>
<li>Don't return code actions for non-Python documents (<a
href="https://redirect.github.com/astral-sh/ruff/pull/23905">#23905</a>)</li>
</ul>
<h3>Documentation</h3>
<ul>
<li>Add company AI policy to contributing guide (<a
href="https://redirect.github.com/astral-sh/ruff/pull/24021">#24021</a>)</li>
<li>Document editor features for Markdown code formatting (<a
href="https://redirect.github.com/astral-sh/ruff/pull/23924">#23924</a>)</li>
<li>[<code>pylint</code>] Improve phrasing (<code>PLC0208</code>) (<a
href="https://redirect.github.com/astral-sh/ruff/pull/24033">#24033</a>)</li>
</ul>
<h3>Other changes</h3>
<ul>
<li>Use PEP 639 license information (<a
href="https://redirect.github.com/astral-sh/ruff/pull/19661">#19661</a>)</li>
</ul>
<h3>Contributors</h3>
<ul>
<li><a
href="https://github.com/tmimmanuel"><code>@tmimmanuel</code></a></li>
<li><a
href="https://github.com/DimitriPapadopoulos"><code>@DimitriPapadopoulos</code></a></li>
<li><a
href="https://github.com/amyreese"><code>@amyreese</code></a></li>
<li><a href="https://github.com/statxc"><code>@statxc</code></a></li>
<li><a href="https://github.com/dylwil3"><code>@dylwil3</code></a></li>
<li><a
href="https://github.com/hunterhogan"><code>@hunterhogan</code></a></li>
<li><a
href="https://github.com/renovate"><code>@renovate</code></a></li>
</ul>
<h2>0.15.6</h2>
<p>Released on 2026-03-12.</p>
<h3>Preview features</h3>
<ul>
<li>Add support for <code>lazy</code> import parsing (<a
href="https://redirect.github.com/astral-sh/ruff/pull/23755">#23755</a>)</li>
<li>Add support for star-unpacking of comprehensions (PEP 798) (<a
href="https://redirect.github.com/astral-sh/ruff/pull/23788">#23788</a>)</li>
<li>Reject semantic syntax errors for lazy imports (<a
href="https://redirect.github.com/astral-sh/ruff/pull/23757">#23757</a>)</li>
<li>Drop a few rules from the preview default set (<a
href="https://redirect.github.com/astral-sh/ruff/pull/23879">#23879</a>)</li>
<li>[<code>airflow</code>] Flag <code>Variable.get()</code> calls
outside of task execution context (<code>AIR003</code>) (<a
href="https://redirect.github.com/astral-sh/ruff/pull/23584">#23584</a>)</li>
<li>[<code>airflow</code>] Flag runtime-varying values in DAG/task
constructor arguments (<code>AIR304</code>) (<a
href="https://redirect.github.com/astral-sh/ruff/pull/23631">#23631</a>)</li>
<li>[<code>flake8-bugbear</code>] Implement
<code>delattr-with-constant</code> (<code>B043</code>) (<a
href="https://redirect.github.com/astral-sh/ruff/pull/23737">#23737</a>)</li>
</ul>
<!-- raw HTML omitted -->
</blockquote>
<p>... (truncated)</p>
</details>
<details>
<summary>Commits</summary>
<ul>
<li><a
href="0ef39de46c"><code>0ef39de</code></a>
Bump 0.15.7 (<a
href="https://redirect.github.com/astral-sh/ruff/issues/24049">#24049</a>)</li>
<li><a
href="beb543b5c6"><code>beb543b</code></a>
[ty] ecosystem-analyzer: Fail on newly panicking projects (<a
href="https://redirect.github.com/astral-sh/ruff/issues/24043">#24043</a>)</li>
<li><a
href="378fe73092"><code>378fe73</code></a>
Don't show noqa hover for non-Python documents (<a
href="https://redirect.github.com/astral-sh/ruff/issues/24040">#24040</a>)</li>
<li><a
href="b5665bd18e"><code>b5665bd</code></a>
[<code>pylint</code>] Improve phrasing (<code>PLC0208</code>) (<a
href="https://redirect.github.com/astral-sh/ruff/issues/24033">#24033</a>)</li>
<li><a
href="6e20f22190"><code>6e20f22</code></a>
test: migrate <code>show_settings</code> and <code>version</code> tests
to use <code>CliTest</code> (<a
href="https://redirect.github.com/astral-sh/ruff/issues/23702">#23702</a>)</li>
<li><a
href="f99b284c1f"><code>f99b284</code></a>
Drain file watcher events during test setup (<a
href="https://redirect.github.com/astral-sh/ruff/issues/24030">#24030</a>)</li>
<li><a
href="744c996c35"><code>744c996</code></a>
[ty] Filter out unsatisfiable inference attempts during generic call
narrowin...</li>
<li><a
href="16160958bd"><code>1616095</code></a>
[ty] Avoid inferring intersection types for call arguments (<a
href="https://redirect.github.com/astral-sh/ruff/issues/23933">#23933</a>)</li>
<li><a
href="7f275f431b"><code>7f275f4</code></a>
[ty] Pin mypy_primer in <code>setup_primer_project.py</code> (<a
href="https://redirect.github.com/astral-sh/ruff/issues/24020">#24020</a>)</li>
<li><a
href="7255e362e4"><code>7255e36</code></a>
[<code>pycodestyle</code>] Recognize <code>pyrefly:</code> as a pragma
comment (<code>E501</code>) (<a
href="https://redirect.github.com/astral-sh/ruff/issues/24019">#24019</a>)</li>
<li>Additional commits viewable in <a
href="https://github.com/astral-sh/ruff/compare/0.15.0...0.15.7">compare
view</a></li>
</ul>
</details>
<br />
Dependabot will resolve any conflicts with this PR as long as you don't
alter it yourself. You can also trigger a rebase manually by commenting
`@dependabot rebase`.
[//]: # (dependabot-automerge-start)
[//]: # (dependabot-automerge-end)
---
<details>
<summary>Dependabot commands and options</summary>
<br />
You can trigger Dependabot actions by commenting on this PR:
- `@dependabot rebase` will rebase this PR
- `@dependabot recreate` will recreate this PR, overwriting any edits
that have been made to it
- `@dependabot show <dependency name> ignore conditions` will show all
of the ignore conditions of the specified dependency
- `@dependabot ignore <dependency name> major version` will close this
group update PR and stop Dependabot creating any more for the specific
dependency's major version (unless you unignore this specific
dependency's major version or upgrade to it yourself)
- `@dependabot ignore <dependency name> minor version` will close this
group update PR and stop Dependabot creating any more for the specific
dependency's minor version (unless you unignore this specific
dependency's minor version or upgrade to it yourself)
- `@dependabot ignore <dependency name>` will close this group update PR
and stop Dependabot creating any more for the specific dependency
(unless you unignore this specific dependency or upgrade to it yourself)
- `@dependabot unignore <dependency name>` will remove all of the ignore
conditions of the specified dependency
- `@dependabot unignore <dependency name> <ignore condition>` will
remove the ignore condition of the specified dependency and ignore
conditions
</details>
Signed-off-by: dependabot[bot] <support@github.com>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
Requested by @majdyz
### Why / What / How
**Why:** PR descriptions currently explain the *what* and *how* but not
the *why*. Without motivation context, reviewers can't judge whether an
approach fits the problem. Nick flagged this in standup: "The PR
descriptions you use are explaining the what not the why."
**What:** Adds a consistent Why / What / How structure to PR
descriptions across the entire workflow — template, CLAUDE.md guidance,
and all PR-related skills (`/pr-review`, `/pr-test`, `/pr-address`).
**How:**
- **`.github/PULL_REQUEST_TEMPLATE.md`**: Replaced the old vague
`Changes` heading with a single `Why / What / How` section with guiding
comments
- **`autogpt_platform/CLAUDE.md`**: Added bullet under "Creating Pull
Requests" requiring the Why/What/How structure
- **`.claude/skills/pr-review/SKILL.md`**: Added "Read the PR
description" step before reading the diff, and "Description quality" to
the review checklist
- **`.claude/skills/pr-test/SKILL.md`**: Updated Step 1 to read the PR
description and understand Why/What/How before testing
- **`.claude/skills/pr-address/SKILL.md`**: Added "Read the PR
description" step before fetching comments
## Test plan
- [x] All five files reviewed for correct formatting and consistency
---
Co-authored-by: Zamil Majdy (@majdyz) <zamil.majdy@agpt.co>
## Summary
- Add `DB_STATEMENT_CACHE_SIZE` env var support for Prisma query engine
- Wires through as `statement_cache_size` URL parameter to control the
LRU prepared statement cache per connection in the Rust binary engine
## Why
Live investigation on dev pods showed the Prisma Rust engine growing
from 34MB to 932MB over ~1hr due to unbounded query plan cache. Despite
`pgbouncer=true` in the DATABASE_URL (which should disable caching), the
engine still caches.
This gives explicit control: setting `DB_STATEMENT_CACHE_SIZE=0`
disables the cache entirely.
## Live data (dev)
```
Fresh pod: Python=693MB, Engine=34MB, Total=727MB
Bloated: Python=2.1GB, Engine=932MB, Total=3GB
```
## Infra companion PR
[AutoGPT_cloud_infrastructure#299](https://github.com/Significant-Gravitas/AutoGPT_cloud_infrastructure/pull/299)
sets `DB_STATEMENT_CACHE_SIZE=0` along with `PYTHONMALLOC=malloc` and
memory limit changes.
## Test plan
- [ ] Deploy to dev and monitor Prisma engine memory over 1hr
- [ ] Verify queries still work correctly with cache disabled
- [ ] Compare engine RSS on fresh vs aged pods
## Summary
- Update /pr-test skill to consistently show screenshots inline to the
user with explanations
- Post PR comments with inline images and per-screenshot descriptions
(not just local file paths)
- Simplify GitHub Git API upload flow for screenshot hosting
## Changes
- Step 5: Take screenshots at every significant test step (aim for 1+
per scenario)
- Step 6 (new): Show every screenshot to the user via Read tool with 2-3
sentence explanations
- Step 7: Post PR comment with inline images, summary table, and
per-screenshot context
## Test plan
- [x] Tested end-to-end on PR #12512 — screenshots uploaded and rendered
correctly in PR comment
## Summary
- Replaces the arch-conditional chromium install (ARM64 vs AMD64) with a
single approach: always use the distro-packaged `chromium` and set
`AGENT_BROWSER_EXECUTABLE_PATH=/usr/bin/chromium`
- Removes `agent-browser install` entirely (it downloads Chrome for
Testing, which has no ARM64 binary)
- Removes the `entrypoint.sh` wrapper script that was setting the env
var at runtime
- Updates `autogpt_platform/db/docker/docker-compose.yml`: removes
`external: true` from the network declarations so the Supabase stack can
be brought up standalone (needed for the Docker integration tests in the
test plan below — without this, `docker compose up` fails unless the
platform stack is already running); also sets
`GOTRUE_MAILER_AUTOCONFIRM: true` for local dev convenience (no SMTP
setup required on first run — this compose file is not used in
production)
- Updates `autogpt_platform/docker-compose.platform.yml`: mounts the
`workspace` volume so agent-browser results (screenshots, snapshots) are
accessible from other services; without this the copilot workspace write
fails in Docker
## Verification
Tested via Docker build on arm64 (Apple Silicon):
```
=== Testing agent-browser with system chromium ===
✓ Example Domain
https://example.com/
=== SUCCESS: agent-browser launched with system chromium ===
```
agent-browser navigated to example.com in ~1.5s using system chromium
(v146 from Debian trixie).
## Test plan
- [x] Docker build test on arm64: `agent-browser open
https://example.com` succeeds with system chromium
- [x] Verify amd64 Docker build still works (CI)
## Summary
- Enable one-click import of workflows from other platforms (n8n,
Make.com, Zapier, etc.) into AutoGPT via CoPilot
- **No backend endpoint** — import is entirely client-side: the dialog
reads the file or fetches the n8n template URL, uploads the JSON to the
workspace via `uploadFileDirect`, stores the file reference in
`sessionStorage`, and redirects to CoPilot with `autosubmit=true`
- CoPilot receives the workflow JSON as a proper file attachment and
uses the existing agent-generator pipeline to convert it
- Library dialog redesigned: 2 tabs — "AutoGPT agent" (upload exported
agent JSON) and "Another platform" (file upload + optional n8n URL)
## How it works
1. User uploads a workflow JSON (or pastes an n8n template URL)
2. Frontend fetches/reads the JSON and uploads it to the user's
workspace via the existing file upload API
3. User is redirected to `/copilot?source=import&autosubmit=true`
4. CoPilot picks up the file from `sessionStorage` and sends it as a
`FileUIPart` attachment with a prompt to recreate the workflow as an
AutoGPT agent
## Test plan
- [x] Manual test: import a real n8n workflow JSON via the dialog
- [x] Manual test: paste an n8n template URL and verify it fetches +
converts
- [x] Manual test: import Make.com / Zapier workflow export JSON
- [x] Repeated imports don't cause 409 conflicts (filenames use
`crypto.randomUUID()`)
- [x] E2E: Import dialog has 2 tabs (AutoGPT agent + Another platform)
- [x] E2E: n8n quick-start template buttons present
- [x] E2E: n8n URL input enables Import button on valid URL
- [x] E2E: Workspace upload API returns file_id
Requested by @majdyz
User-caused errors (no payment method, webhook agent invocation, missing
credentials, bad API keys) were hitting Sentry via `logger.exception()`
in the `ValueError` handler, creating noise that obscures real bugs.
Additionally, a frontend crash on the copilot page (BUILDER-71J) needed
fixing.
**Changes:**
**Backend — rest_api.py**
- Set `log_error=False` for the `ValueError` exception handler (line
278), consistent with how `FolderValidationError` and `NotFoundError`
are already handled. User-caused 400 errors no longer trigger
`logger.exception()` → Sentry.
**Backend — executor/manager.py**
- Downgrade `ExecutionManager` input validation skip errors from `error`
to `warning` level. Missing credentials is expected user behavior, not
an internal error.
**Backend — blocks/llm.py**
- Sanitize unpaired surrogates in LLM prompt content before sending to
provider APIs. Prevents `UnicodeEncodeError: surrogates not allowed`
when httpx encodes the JSON body (AUTOGPT-SERVER-8AX).
**Frontend — package.json**
- Upgrade `ai` SDK from `6.0.59` to `6.0.134` to fix BUILDER-71J
(`TypeError: undefined is not an object (evaluating
'this.activeResponse.state')` on /copilot page). This is a known issue
in the Vercel AI SDK fixed in later patch versions.
**Sentry issues addressed:**
- `No payment method found` (ValueError → 400)
- `This agent is triggered by an external event (webhook)` (ValueError →
400)
- `Node input updated with non-existent credentials` (ValueError → 400)
- `[ExecutionManager] Skip execution, input validation error: missing
input {credentials}`
- `UnicodeEncodeError: surrogates not allowed` (AUTOGPT-SERVER-8AX)
- `TypeError: activeResponse.state` (BUILDER-71J)
Resolves SECRT-2166
---
Co-authored-by: Zamil Majdy (@majdyz) <zamil.majdy@agpt.co>
---------
Co-authored-by: Zamil Majdy (@majdyz) <zamil.majdy@agpt.co>
## Summary
Reduce CoPilot per-turn token overhead by systematically trimming tool
descriptions, parameter schemas, and system prompt content. All 35 MCP
tool schemas are passed on every SDK call — this PR reduces their size.
### Strategy
1. **Tool descriptions**: Trimmed verbose multi-sentence explanations to
concise single-sentence summaries while preserving meaning
2. **Parameter schemas**: Shortened parameter descriptions to essential
info, removed some `default` values (handled in code)
3. **System prompt**: Condensed `_SHARED_TOOL_NOTES` and storage
supplement template in `prompting.py`
4. **Cross-tool references**: Removed duplicate workflow hints (e.g.
"call find_block before run_block" appeared in BOTH tools — kept only in
the dependent tool). Critical cross-tool references retained (e.g.
`continue_run_block` in `run_block`, `fix_agent_graph` in
`validate_agent`, `get_doc_page` in `search_docs`, `web_fetch`
preference in `browser_navigate`)
### Token Impact
| Metric | Before | After | Reduction |
|--------|--------|-------|-----------|
| System Prompt | ~865 tokens | ~497 tokens | 43% |
| Tool Schemas | ~9,744 tokens | ~6,470 tokens | 34% |
| **Grand Total** | **~10,609 tokens** | **~6,967 tokens** | **34%** |
Saves **~3,642 tokens per conversation turn**.
### Key Decisions
- **Mostly description changes**: Tool logic, parameters, and types
unchanged. However, some schema-level `default` fields were removed
(e.g. `save` in `customize_agent`) — these are machine-readable
metadata, not just prose, and may affect LLM behavior.
- **Quality preserved**: All descriptions still convey what the tool
does and essential usage patterns
- **Cross-references trimmed carefully**: Kept prerequisite hints in the
dependent tool (run_block mentions find_block) but removed the reverse
(find_block no longer mentions run_block). Critical cross-tool guidance
retained where removal would degrade model behavior.
- **`run_time` description fixed**: Added missing supported values
(today, last 30 days, ISO datetime) per review feedback
### Future Optimization
The SDK passes all 35 tools on every call. The MCP protocol's
`list_tools()` handler supports dynamic tool registration — a follow-up
PR could implement lazy tool loading (register core tools + a discovery
meta-tool) to further reduce per-turn token cost.
### Changes
- Trimmed descriptions across 25 tool files
- Condensed `_SHARED_TOOL_NOTES` and `_build_storage_supplement` in
`prompting.py`
- Fixed `run_time` schema description in `agent_output.py`
### Checklist
#### For code changes:
- [x] I have clearly listed my changes in the PR description
- [x] I have made a test plan
- [x] I have tested my changes according to the test plan:
- [x] All 273 copilot tests pass locally
- [x] All 35 tools load and produce valid schemas
- [x] Before/after token dumps compared
- [x] Formatting passes (`poetry run format`)
- [x] CI green
## Summary
- Adds `/pr-test` skill for automated E2E testing of PRs using docker
compose, agent-browser, and API calls
- Covers full environment setup (copy .env, configure copilot auth,
ARM64 Docker fix)
- Includes browser UI testing, direct API testing, screenshot capture,
and test report generation
- Has `--fix` mode for auto-fixing bugs found during testing (similar to
`/pr-address`)
- **Screenshot uploads use GitHub Git API** (blobs → tree → commit →
ref) — no local git operations, safe for worktrees
- **Subscription mode improvements:**
- Extract subscription auth logic to `sdk/subscription.py` — uses SDK's
bundled CLI binary instead of requiring `npm install -g
@anthropic-ai/claude-code`
- Auto-provision `~/.claude/.credentials.json` from
`CLAUDE_CODE_OAUTH_TOKEN` env var on container startup — no `claude
login` needed in Docker
- Add `scripts/refresh_claude_token.sh` — cross-platform helper
(macOS/Linux/Windows) to extract OAuth tokens from host and update
`backend/.env`
## Test plan
- [x] Validated skill on multiple PRs (#12482, #12483, #12499, #12500,
#12501, #12440, #12472) — all test scenarios passed
- [x] Confirmed screenshot upload via GitHub Git API renders correctly
on all 7 PRs
- [x] Verified subscription mode E2E in Docker:
`refresh_claude_token.sh` → `docker compose up` → copilot chat responds
correctly with no API keys (pure OAuth subscription)
- [x] Verified auto-provisioning of credentials file inside container
from `CLAUDE_CODE_OAUTH_TOKEN` env var
- [x] Confirmed bundled CLI detection
(`claude_agent_sdk._bundled/claude`) works without system-installed
`claude`
- [x] `poetry run pytest backend/copilot/sdk/service_test.py` — 24/24
tests pass
## Summary
Fixes 5 high-priority Sentry alerts from production:
- **AUTOGPT-SERVER-8AM**: Fix `TypeError: TypedDict does not support
instance and class checks` — `_value_satisfies_type` in `type.py` now
handles TypedDict classes that don't support `isinstance()` checks
- **AUTOGPT-SERVER-8AN**: Fix `ValueError: No payment method found`
triggering Sentry error — catch the expected ValueError in the
auto-top-up endpoint and return HTTP 422 instead
- **BUILDER-7F5**: Fix `Upload failed (409): File already exists` — add
`overwrite` query param to workspace upload endpoint and set it to
`true` from the frontend direct-upload
- **BUILDER-7F0**: Fix `LaTeX-incompatible input` KaTeX warnings
flooding Sentry — set `strict: false` on rehype-katex plugin to suppress
warnings for unrecognized Unicode characters
- **AUTOGPT-SERVER-89N**: Fix `Tool execution with manager failed:
validation error for dict[str,list[any]]` — make RPC return type
validation resilient (log warning instead of crash) and downgrade
SmartDecisionMaker tool execution errors to warnings
## Test plan
- [ ] Verify TypedDict type coercion works for
GithubMultiFileCommitBlock inputs
- [ ] Verify auto-top-up without payment method returns 422, not 500
- [ ] Verify file re-upload in copilot succeeds (overwrites instead of
409)
- [ ] Verify LaTeX rendering with Unicode characters doesn't produce
console warnings
- [ ] Verify SmartDecisionMaker tool execution failures are logged at
warning level
Requested by @kcze
When a user closes the OAuth sign-in popup without completing
authentication, the 'Waiting on sign-in process' modal was stuck open
with no way to dismiss it, forcing a page refresh.
Two bugs caused this:
1. `oauth-popup.ts` had no detection for the popup being closed by the
user. The promise would hang until the 5-minute timeout.
2. The modal's cancel button aborted a disconnected `AbortController`
instead of the actual OAuth flow's abort function, so clicking
cancel/close did nothing.
### Changes
- Add `popup.closed` polling (500ms) in `openOAuthPopup()` that rejects
the promise when the user closes the auth window
- Add reject-on-abort so the cancel button properly terminates the flow
- Replace the disconnected `oAuthPopupController` with a direct
`cancelOAuthFlow()` function that calls the real abort ref
- Handle popup-closed and user-canceled as silent cancellations (no
error toast)
### Testing
Tested manually ✅
- [x] Start OAuth flow → close popup window → modal dismisses
automatically ✅
- [x] Start OAuth flow → click cancel on modal → popup closes, modal
dismisses ✅
- [x] Complete OAuth flow normally → works as before ✅
Resolves SECRT-2054
---
Co-authored-by: Krzysztof Czerwinski (@kcze)
<krzysztof.czerwinski@agpt.co>
---------
Co-authored-by: Krzysztof Czerwinski <kpczerwinski@gmail.com>
Co-authored-by: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
## Summary
### Before
<img width="500" height="501" alt="Screenshot 2026-03-20 at 21 50 31"
src="https://github.com/user-attachments/assets/6154cffb-6772-4c3d-a703-527c8ca0daff"
/>
### After
<img width="500" height="581" alt="Screenshot 2026-03-20 at 21 33 12"
src="https://github.com/user-attachments/assets/2f9bd69d-30c5-4d06-ad1e-ed76b184afe5"
/>
### Other minor fixes
- minor spacing adjustments in creator/search pages when empty and
between sections
### Summary
- Increase StoreCard height from 25rem to 26.5rem to prevent content
overflow
- Replace manual tooltip-based title truncation with `OverflowText`
component in StoreCard
- Adjust carousel indicator positioning and hide it on md+ when exactly
3 featured agents are shown
## Test plan
- [x] Verify marketplace cards display without text overflow
- [x] Verify featured section carousel indicators behave correctly
- [x] Check responsive behavior at common breakpoints
🤖 Generated with [Claude Code](https://claude.com/claude-code)
---------
Co-authored-by: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
## Summary
<img width="1487" height="670" alt="Screenshot 2026-03-20 at 00 52 58"
src="https://github.com/user-attachments/assets/f09de2a0-3c5b-4bce-b6f4-8a853f6792cf"
/>
- Move the download button from inline next to "Add to library" to a
separate line below it
- Add contextual text: "Want to use this agent locally? Download here"
- Style the "Download here" as a violet ghost button link with the
download icon
## Test plan
- [ ] Visit a marketplace agent page
- [ ] Verify "Add to library" button renders in its row
- [ ] Verify "Want to use this agent locally? Download here" appears
below it
- [ ] Click "Download here" and confirm the agent downloads correctly
🤖 Generated with [Claude Code](https://claude.com/claude-code)
---------
Co-authored-by: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Reduces `line-clamp` from 3 to 2 on the marketplace `StoreCard`
description to prevent text from overlapping with the
absolutely-positioned run count and +Add button at the bottom of the
card.
Resolves SECRT-2156.
---
Co-authored-by: Abhimanyu Yadav (@Abhi1992002)
<122007096+Abhi1992002@users.noreply.github.com>
## Summary
- Fixes SmartDecisionMakerBlock conversation management to work with
OpenAI's Responses API, which was introduced in #12099 (commit 1240f38)
- The migration to `responses.create` updated the outbound LLM call but
missed the conversation history serialization — the `raw_response` is
now the entire `Response` object (not a `ChatCompletionMessage`), and
tool calls/results use `function_call` / `function_call_output` types
instead of role-based messages
- This caused a 400 error on the second LLM call in agent mode:
`"Invalid value: ''. Supported values are: 'assistant', 'system',
'developer', and 'user'."`
### Changes
**`smart_decision_maker.py`** — 6 functions updated:
| Function | Fix |
|---|---|
| `_convert_raw_response_to_dict` | Detects Responses API `Response`
objects, extracts output items as a list |
| `_get_tool_requests` | Recognizes `type: "function_call"` items |
| `_get_tool_responses` | Recognizes `type: "function_call_output"`
items |
| `_create_tool_response` | New `responses_api` kwarg produces
`function_call_output` format |
| `_update_conversation` | Handles list return from
`_convert_raw_response_to_dict` |
| Non-agent mode path | Same list handling for traditional execution |
**`test_smart_decision_maker_responses_api.py`** — 61 tests covering:
- Every branch of all 6 affected helper functions
- Chat Completions, Anthropic, and Responses API formats
- End-to-end agent mode and traditional mode conversation validity
## Test plan
- [x] 61 new unit tests all pass
- [x] 11 existing SmartDecisionMakerBlock tests still pass (no
regressions)
- [x] All pre-commit hooks pass (ruff, black, isort, pyright)
- [ ] CI integration tests
🤖 Generated with [Claude Code](https://claude.com/claude-code)
<!-- CURSOR_SUMMARY -->
---
> [!NOTE]
> **Medium Risk**
> Updates core LLM invocation and agent conversation/tool-call
bookkeeping to match OpenAI’s Responses API, which can affect tool
execution loops and prompt serialization across providers. Risk is
mitigated by extensive new unit tests, but regressions could surface in
production agent-mode flows or token/usage accounting.
>
> **Overview**
> **Migrates OpenAI calls from Chat Completions to the Responses API
end-to-end**, including tool schema conversion, output parsing,
reasoning/text extraction, and updated token usage fields in
`LLMResponse`.
>
> **Fixes SmartDecisionMakerBlock conversation/tool handling for
Responses API** by treating `raw_response` as a Response object
(splitting it into `output` items for replay), recognizing
`function_call`/`function_call_output` entries, and emitting tool
outputs in the correct Responses format to prevent invalid follow-up
prompts.
>
> Also adjusts prompt compaction/token estimation to understand
Responses API tool items, changes
`get_execution_outputs_by_node_exec_id` to return list-valued
`CompletedBlockOutput`, removes `gpt-3.5-turbo` from model/cost/docs
lists, and adds focused unit tests plus a lightweight `conftest.py` to
run these tests without the full server stack.
>
> <sup>Written by [Cursor
Bugbot](https://cursor.com/dashboard?tab=bugbot) for commit
ff292efd3d. This will update automatically
on new commits. Configure
[here](https://cursor.com/dashboard?tab=bugbot).</sup>
<!-- /CURSOR_SUMMARY -->
---------
Co-authored-by: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Co-authored-by: Otto <otto@agpt.co>
Co-authored-by: Krzysztof Czerwinski <kpczerwinski@gmail.com>
Requested by @majdyz
Adds TDD (test-driven development) guidance to CLAUDE.md files so Claude
Code follows a test-first workflow when fixing bugs or adding features.
**Changes:**
- **Parent `CLAUDE.md`**: Cross-cutting TDD workflow — write a failing
`xfail` test, implement the fix, remove the marker
- **Backend `CLAUDE.md`**: Concrete pytest example with
`@pytest.mark.xfail` pattern
- **Frontend `CLAUDE.md`**: Note about using Playwright `.fixme`
annotation for bug-fix tests
The workflow is: write a failing test first → confirm it fails for the
right reason → implement → confirm it passes. This ensures every bug fix
is covered by a test that would have caught the regression.
---
Co-authored-by: Zamil Majdy (@majdyz) <zamil.majdy@agpt.co>
Reverts Significant-Gravitas/AutoGPT#12099
<!-- CURSOR_SUMMARY -->
---
> [!NOTE]
> **Medium Risk**
> Reverts the OpenAI integration in `llm_call` from the Responses API
back to `chat.completions`, which can change tool-calling, JSON-mode
behavior, and token accounting across core AI blocks. The change is
localized but touches the primary LLM execution path and associated
tests/docs.
>
> **Overview**
> Reverts the OpenAI path in `backend/blocks/llm.py` from the Responses
API back to `chat.completions`, including updating JSON-mode
(`response_format`), tool handling, and usage extraction to match the
Chat Completions response shape.
>
> Removes the now-unused `backend/util/openai_responses.py` helpers and
their unit tests, updates LLM tests to mock `chat.completions.create`,
and adds `gpt-3.5-turbo` to the supported model list, cost config, and
LLM docs.
>
> <sup>Written by [Cursor
Bugbot](https://cursor.com/dashboard?tab=bugbot) for commit
7d6226d10e. This will update automatically
on new commits. Configure
[here](https://cursor.com/dashboard?tab=bugbot).</sup>
<!-- /CURSOR_SUMMARY -->
## Summary
Reverts the invite system PRs due to security gaps identified during
review:
- The move from Supabase-native `allowed_users` gating to
application-level gating allows orphaned Supabase auth accounts (valid
JWT without a platform `User`)
- The auth middleware never verifies `User` existence, so orphaned users
get 500s instead of clean 403s
- OAuth/Google SSO signup completely bypasses the invite gate
- The DB trigger that atomically created `User` + `Profile` on signup
was dropped in favor of a client-initiated API call, introducing a
failure window
### Reverted PRs
- Reverts #12347 — Foundation: InvitedUser model, invite-gated signup,
admin UI
- Reverts #12374 — Tally enrichment: personalized prompts from form
submissions
- Reverts #12451 — Pre-check: POST /auth/check-invite endpoint
- Reverts #12452 (collateral) — Themed prompt categories /
SuggestionThemes UI. This PR built on top of #12374's
`suggested_prompts` backend field and `/chat/suggested-prompts`
endpoint, so it cannot remain without #12374. The copilot empty session
falls back to hardcoded default prompts.
### Migration
Includes a new migration (`20260319120000_revert_invite_system`) that:
- Drops the `InvitedUser` table and its enums (`InvitedUserStatus`,
`TallyComputationStatus`)
- Restores the `add_user_and_profile_to_platform()` trigger on
`auth.users`
- Backfills `User` + `Profile` rows for any auth accounts created during
the invite-gate window
### What's NOT reverted
- The `generate_username()` function (never dropped, still used by
backfill migration)
- The old `add_user_to_platform()` function (superseded by
`add_user_and_profile_to_platform()`)
- PR #12471 (admin UX improvements) — was never merged, no action needed
## Test plan
- [x] Verify migration: `InvitedUser` table dropped, enums dropped,
trigger restored
- [x] Verify backfill: no orphaned auth users, no users without Profile
- [x] Verify existing users can still log in (email + OAuth)
- [x] Verify CoPilot chat page loads with default prompts
- [ ] Verify new user signup creates `User` + `Profile` via the restored
trigger
- [ ] Verify admin `/admin/users` page loads without crashing
- [ ] Run backend tests: `poetry run test`
🤖 Generated with [Claude Code](https://claude.com/claude-code)
---------
Co-authored-by: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Co-authored-by: Zamil Majdy <zamil.majdy@agpt.co>
## Summary
<img width="400" height="339" alt="Screenshot 2026-03-19 at 22 53 23"
src="https://github.com/user-attachments/assets/2fa76b8f-424d-4764-90ac-b7a331f5f610"
/>
<img width="600" height="595" alt="Screenshot 2026-03-19 at 22 53 31"
src="https://github.com/user-attachments/assets/23f51cc7-b01e-4d83-97ba-2c43683877db"
/>
<img width="800" height="523" alt="Screenshot 2026-03-19 at 22 53 36"
src="https://github.com/user-attachments/assets/1e447b9a-1cca-428c-bccd-1730f1670b8e"
/>
Now that we have the `Give feedback` button on the Navigation bar,
collpase some of the links below `1280px` so there is more space and
they don't collide with each other...
- Collapse navbar link text to icon-only below 1280px (`xl` breakpoint)
to prevent crowding
- Wallet button shows only the wallet icon below 1280px instead of "Earn
credits" text
- Feedback button shows only the chat icon below 1280px instead of "Give
Feedback" text
- Added `whitespace-nowrap` to feedback button to prevent wrapping
## Changes
- `NavbarLink.tsx`: `lg:block` → `xl:block` for link text
- `Wallet.tsx`: `md:hidden`/`md:inline-block` →
`xl:hidden`/`xl:inline-block`
- `FeedbackButton.tsx`: wrap text in `hidden xl:inline` span, add
`whitespace-nowrap`
## Test plan
- [ ] Resize browser between 1024px–1280px and verify navbar shows only
icons
- [ ] At 1280px+ verify full text labels appear for links, wallet, and
feedback
- [ ] Verify mobile navbar still works correctly below `md` breakpoint
🤖 Generated with [Claude Code](https://claude.com/claude-code)
Co-authored-by: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
## Summary
https://github.com/user-attachments/assets/13da6d36-5f35-429b-a6cf-e18316bb8709
Replaces the flat list of suggestion pills in the CoPilot empty session
with themed prompt categories (Learn, Create, Automate, Organize), each
shown as a popover with contextual prompts.
- **Backend**: Changes `suggested_prompts` from a flat `list[str]` to a
themed `dict[str, list[str]]` keyed by category. Updates Tally
extraction LLM prompt to generate prompts per theme, and the
`/suggested-prompts` API to return grouped themes. Legacy `list[str]`
rows are preserved under a `"General"` key for backward compatibility.
- **Frontend**: Replaces inline pill buttons with a `SuggestionThemes`
popover component. Each theme button (with icon) opens a dropdown of 5
relevant prompts. Falls back to hardcoded defaults when the API has no
personalized prompts. Normalizes partial API responses by padding
missing themes with defaults. Legacy `"General"` prompts are distributed
round-robin across themes so existing users keep their personalized
suggestions.
### Changes 🏗️
- `backend/data/understanding.py`: `suggested_prompts` field changed
from `list[str]` to `dict[str, list[str]]`; legacy list rows preserved
under `"General"` key; list items validated as strings
- `backend/data/tally.py`: LLM prompt updated to generate themed
prompts; validation now per-theme with blank-string rejection
- `backend/api/features/chat/routes.py`: New `SuggestedTheme` model;
endpoint returns `themes[]`
- `frontend/copilot/components/EmptySession/EmptySession.tsx`: Uses
generated API types directly (no cast)
- `frontend/copilot/components/EmptySession/helpers.ts`:
`DEFAULT_THEMES` replaces `DEFAULT_QUICK_ACTIONS`; `getSuggestionThemes`
normalizes partial API responses and distributes legacy `"General"`
prompts across themes
-
`frontend/copilot/components/EmptySession/components/SuggestionThemes/`:
New popover component with theme icons and loading states
### Checklist 📋
#### For code changes:
- [x] I have clearly listed my changes in the PR description
- [x] I have made a test plan
- [x] I have tested my changes according to the test plan:
- [x] Verify themed suggestion buttons render on CoPilot empty session
- [x] Click each theme button and confirm popover opens with prompts
- [x] Click a prompt and confirm it sends the message
- [x] Verify fallback to default themes when API returns no custom
prompts
- [x] Verify legacy users' personalized prompts are preserved and
visible
🤖 Generated with [Claude Code](https://claude.com/claude-code)
---------
Co-authored-by: Claude Opus 4.6 <noreply@anthropic.com>
## Summary
Improves the `pr-address` skill with two fixes:
- **Full comment thread loading**: Adds `--paginate` to the inline
comments fetch and explicit instructions to reconstruct threads using
`in_reply_to_id`, reading root-to-last-reply before acting. Previously,
only the opening comment was visible — missing reviewer replies led to
wrong fixes.
- **Backtick-safe PR descriptions**: Adds instructions to write the PR
body to a temp file via `<<'PREOF'` heredoc before passing to `gh pr
edit/create`. Inlining the body directly causes backticks to be
shell-escaped, breaking markdown rendering.
## Test plan
- [ ] Run `/pr-address` on a PR with multi-reply inline comment threads
— verify the last reply is what gets acted on
- [ ] Update a PR description containing backticks — verify they render
correctly in GitHub
When OpenAI credentials are unavailable (fork PRs, dev envs without API
keys), both builder block search and store agent functionality break:
1. **Block search returns wrong results.** `unified_hybrid_search` falls
back to a zero vector when embedding generation fails. With ~200 blocks
in `UnifiedContentEmbedding`, the zero-vector semantic scores are
garbage, and lexical matching on short block names is too weak — "Store
Value" doesn't appear in the top results for query "Store Value".
2. **Store submission approval fails entirely.**
`review_store_submission` calls `ensure_embedding()` inside a
transaction. When it throws, the entire transaction rolls back — no
store submissions get approved, the `StoreAgent` materialized view stays
empty, and all marketplace e2e tests fail.
3. **Store search returns nothing.** Even when store data exists,
`hybrid_search` queries `UnifiedContentEmbedding` which has no store
agent rows (backfill failed). It succeeds with zero results rather than
throwing, so the existing exception-based fallback never triggers.
### Changes 🏗️
- Replace `unified_hybrid_search` with in-memory text search in
`_hybrid_search_blocks` (-> `_text_search_blocks`). All ~200 blocks are
already loaded in memory, and `_score_primary_fields` provides correct
deterministic text relevance scoring against block name, description,
and input schema field descriptions — the same rich text the embedding
pipeline uses. CamelCase block names are split via `split_camelcase()`
to match the tokenization from PR #12400.
- Make embedding generation in `review_store_submission` best-effort:
catch failures and log a warning instead of rolling back the approval
transaction. The backfill scheduler retries later when credentials
become available.
- Fall through to direct DB search when `hybrid_search` returns empty
results (not just when it throws). The fallback uses ad-hoc
`to_tsvector`/`plainto_tsquery` with `ts_rank_cd` ranking on
`StoreAgent` view fields, restoring the search quality of the original
pre-hybrid implementation (stemming, stop-word removal, relevance
ranking).
- Fix Playwright artifact upload in end-to-end test CI
### Checklist 📋
#### For code changes:
- [x] I have clearly listed my changes in the PR description
- [x] I have made a test plan
- [x] I have tested my changes according to the test plan:
- [x] `build.spec.ts`: 8/8 pass locally (was 0/7 before fix)
- [x] All 79 e2e tests pass in CI (was 15 failures before fix)
---
Co-authored-by: Reinier van der Leer (@Pwuts)
---------
Co-authored-by: Reinier van der Leer <pwuts@agpt.co>
Co-authored-by: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
After the legacy builder was removed in #12082, the TallyPopup component
still showed a "Tutorial" button (bottom-right, next to "Give Feedback")
that navigated to `/build?resetTutorial=true`. Nothing handles that
param anymore, so clicking it did nothing.
This removes the dead button and its associated state/handler from
TallyPopup and useTallyPopup. The working tutorial (Shepherd.js
chalkboard icon in CustomControls) is unaffected.
**Changes:**
- `TallyPopup.tsx`: Remove Tutorial button JSX, unused imports
(`usePathname`, `useSearchParams`), and `isNewBuilder` check
- `useTallyPopup.ts`: Remove `showTutorial` state, `handleResetTutorial`
handler, unused `useRouter` import
Resolves SECRT-2109
---
Co-authored-by: Reinier van der Leer (@Pwuts) <pwuts@agpt.co>
Co-authored-by: Reinier van der Leer (@Pwuts) <pwuts@agpt.co>
### Before
<img width="600" height="489" alt="Screenshot 2026-03-18 at 19 24 41"
src="https://github.com/user-attachments/assets/bb8dc0fa-04cd-4f32-8125-2d7930b4acde"
/>
Formatted headings in user messages would look massive
### After
<img width="600" height="549" alt="Screenshot 2026-03-18 at 19 24 33"
src="https://github.com/user-attachments/assets/51230232-c914-42dd-821f-3b067b80bab4"
/>
Markdown headings (`# H1` through `###### H6`) and setext-style headings
(`====`) in user chat messages rendered at their full HTML heading size,
which looked disproportionately large in the chat bubble context.
### Changes 🏗️
- Added Tailwind CSS overrides on the user message `MessageContent`
wrapper to cap all heading elements (h1-h6) at `text-lg font-semibold`
- Only affects user messages in copilot chat (via `group-[.is-user]`
selector); assistant messages are unchanged
### Checklist 📋
#### For code changes:
- [x] I have clearly listed my changes in the PR description
- [x] I have made a test plan
- [x] I have tested my changes according to the test plan:
- [ ] Send a user message containing `# Heading 1` through `######
Heading 6` and verify they all render at constrained size
- [ ] Send a message with `====` separator pattern and verify it doesn't
render as a mega H1
- [ ] Verify assistant messages with headings still render normally
Co-authored-by: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
## Summary
- When a user has connected GitHub, `GH_TOKEN` is automatically injected
into the Claude Agent SDK subprocess environment so `gh` CLI commands
work without any manual auth step
- When GitHub is **not** connected, the copilot can call a new
`connect_integration(provider="github")` MCP tool, which surfaces the
same credential setup card used by regular GitHub blocks — the user
connects inline without leaving the chat
- After connecting, the copilot is instructed to retry the operation
automatically
## Changes
**Backend**
- `sdk/service.py`: `_get_github_token_for_user()` fetches OAuth2 or API
key credentials and injects `GH_TOKEN` + `GITHUB_TOKEN` into `sdk_env`
before the SDK subprocess starts (per-request, thread-safe via
`ClaudeAgentOptions.env`)
- `tools/connect_integration.py`: new `ConnectIntegrationTool` MCP tool
— returns `SetupRequirementsResponse` for a given provider (`github` for
now); extensible via `_PROVIDER_INFO` dict
- `tools/__init__.py`: registers `connect_integration` in
`TOOL_REGISTRY`
- `prompting.py`: adds GitHub CLI / `connect_integration` guidance to
`_SHARED_TOOL_NOTES`
**Frontend**
- `ConnectIntegrationTool/ConnectIntegrationTool.tsx`: thin wrapper
around the existing `SetupRequirementsCard` with a tailored retry
instruction
- `MessagePartRenderer.tsx`: dispatches `tool-connect_integration` to
the new component
## Test plan
- [ ] User with GitHub credentials: `gh pr list` works without any auth
step in copilot
- [ ] User without GitHub credentials: copilot calls
`connect_integration`, card renders with GitHub credential input, after
connecting copilot retries and `gh` works
- [ ] `GH_TOKEN` is NOT leaked across users (injected via
`ClaudeAgentOptions.env`, not `os.environ`)
- [ ] `connect_integration` with unknown provider returns a graceful
error message
## Summary
- On **amd64**: keep `agent-browser install` (Chrome for Testing —
pinned version tested with Playwright) + restore runtime libs
- On **arm64**: install system `chromium` package (Chrome for Testing
has no ARM64 binary) + skip `agent-browser install`
- An entrypoint script sets
`AGENT_BROWSER_EXECUTABLE_PATH=/usr/bin/chromium` at container startup
on arm64 (detected via presence of `/usr/bin/chromium`); on amd64 the
var is left unset so agent-browser uses Chrome for Testing as before
**Why not system chromium on amd64?** `agent-browser install` downloads
a specific Chrome for Testing version pinned to the Playwright version
in use. Using whatever Debian ships on amd64 could cause protocol
compatibility issues.
Introduced by #12301 (cc @Significant-Gravitas/zamil-majdy)
## Test plan
- [ ] `docker compose up --build` succeeds on ARM64 (Apple Silicon)
- [ ] `docker compose up --build` succeeds on x86_64
- [ ] Copilot browser tools (`browser_navigate`, `browser_act`,
`browser_screenshot`) work in a Copilot session on both architectures
---------
Co-authored-by: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Co-authored-by: Zamil Majdy <zamil.majdy@agpt.co>
Requested by @Swiftyos
The invite gate check in `get_or_activate_user()` runs after Supabase
creates the auth user, resulting in orphaned auth accounts with no
platform access when a non-invited user signs up. Users could create a
Supabase account but had no `User`, `Profile`, or `Onboarding` records —
they could log in but access nothing.
### Changes 🏗️
**Backend** (`v1.py`, `invited_user.py`):
- Add public `POST /api/auth/check-invite` endpoint (no auth required —
this is a pre-signup check)
- Add `check_invite_eligibility()` helper in the data layer
- Returns `{allowed: true}` when `enable_invite_gate` is disabled
- Extracted `is_internal_email()` helper to deduplicate `@agpt.co`
bypass logic (was duplicated between route and `get_or_activate_user`)
- Checks `InvitedUser` table for `INVITED` status
- Added IP-based Redis rate limiting (10 req/60 s per IP, fails open if
Redis unavailable, returns HTTP 429 when exceeded)
- Fixed Redis pipeline atomicity: `incr` + `expire` now sent in a single
pipeline round-trip, preventing a TTL-less key if `expire` had
previously failed after `incr`
- Fixed incorrect `await` on `pipe.incr()` / `pipe.expire()` — redis-py
async pipeline queue methods are synchronous; only `execute()` is
awaitable. The erroneous `await` was silently swallowed by the `except`
block, making the rate limiter completely non-functional
**Frontend** (`signup/actions.ts`):
- Call the generated `postV1CheckIfAnEmailIsAllowedToSignUp` client
(replacing raw `fetch`) before `supabase.auth.signUp()`
- `ApiError` (non-OK HTTP responses) logs a Sentry warning with the HTTP
status; network/other errors capture a Sentry exception
- If not allowed, return `not_allowed` error (existing
`EmailNotAllowedModal` handles this)
- Graceful fallback: if the pre-check fails (backend unreachable), falls
through to the existing flow — `get_or_activate_user()` remains as
defense-in-depth
**Tests** (`v1_test.py`, `invited_user_test.py`):
- 5 route-level tests covering: gate disabled → allowed, `@agpt.co`
bypass, eligible email, ineligible email, rate-limit exceeded
- Rate-limit test mock updated to use pipeline interface
(`pipeline().execute()` returns `[count, True]`)
- Existing `invited_user_test.py` updated to cover
`check_invite_eligibility` branches
**Not changed:**
- Google OAuth flow — already gated by OAuth provider settings
- `get_or_activate_user()` — stays as backend safety net
- All admin invite CRUD routes — unchanged
### Test plan
1. Email/password signup with invited email → signup proceeds normally
2. Email/password signup with non-invited email → `EmailNotAllowedModal`
shown, no Supabase user created
3. `enable_invite_gate=false` → all emails allowed
4. Backend unreachable during pre-check → falls through to existing flow
5. Same IP exceeds 10 requests/60 s → HTTP 429 returned
---
Co-authored-by: Craig Swift (@Swiftyos) <craigswift13@gmail.com>
---------
Co-authored-by: Craig Swift (@Swiftyos) <craigswift13@gmail.com>
Co-authored-by: Zamil Majdy <zamil.majdy@agpt.co>
## Summary
- Adds merge conflict detection as step 2 of the polling loop (between
CI check and comment check), including handling of the transient
`"UNKNOWN"` state
- Adds a "Resolving merge conflicts" section with step-by-step
instructions using 3-way merge (no force push needed since PRs are
squash-merged)
- Validates all three git conflict markers before staging to prevent
committing broken code
- Fixes `args` → `argument-hint` in skill frontmatter
## Test plan
- [ ] Verify skill renders correctly in Claude Code
## Summary
When the Claude SDK returns a prompt-too-long error (e.g. transcript +
query exceeds the model's context window), the streaming loop now
retries with escalating fallbacks instead of failing immediately:
1. **Attempt 1**: Use the transcript as-is (normal path)
2. **Attempt 2**: Compact the transcript via LLM summarization
(`compact_transcript`) and retry
3. **Attempt 3**: Drop the transcript entirely and fall back to
DB-reconstructed context (`_build_query_message`)
If all 3 attempts fail, a `StreamError(code="prompt_too_long")` is
yielded to the frontend.
### Key changes
**`service.py`**
- Add `_is_prompt_too_long(err)` — pattern-matches SDK exceptions for
prompt-length errors (`prompt is too long`, `prompt_too_long`,
`context_length_exceeded`, `request too large`)
- Wrap `async with ClaudeSDKClient` in a 3-attempt retry `for` loop with
compaction/fallback logic
- Move `current_message`, `_build_query_message`, and
`_prepare_file_attachments` before the retry loop (computed once,
reused)
- Skip transcript upload in `finally` when `transcript_caused_error`
(avoids persisting a broken/empty transcript)
- Reset `stream_completed` between retry iterations
- Document outer-scope variable contract in `_run_stream_attempt`
closure (which variables are reassigned between retries vs read-only)
**`transcript.py`**
- Add `compact_transcript(content, log_prefix, model)` — converts JSONL
→ messages → `compress_context` (LLM summarization with truncation
fallback) → JSONL
- Add helpers: `_flatten_assistant_content`,
`_flatten_tool_result_content`, `_transcript_to_messages`,
`_messages_to_transcript`, `_run_compression`
- Returns `None` when compaction fails or transcript is already within
budget (signals caller to fall through to DB fallback)
- Truncation fallback wrapped in 30s timeout to prevent unbounded CPU
time on large transcripts
- Accepts `model` parameter to avoid creating a new `ChatConfig()` on
every call
**`util/prompt.py`**
- Fix `_truncate_middle_tokens` edge case: returns empty string when
`max_tok < 1`, properly handles `max_tok < 3`
**`config.py`**
- E2B sandbox timeout raised from 5 min to 15 min to accommodate
compaction retries
**`prompt_too_long_test.py`** (new, 45 tests)
- `_is_prompt_too_long` positive/negative patterns, case sensitivity,
BaseException handling
- Flatten helpers for assistant/tool_result content blocks
- `_transcript_to_messages` / `_messages_to_transcript` roundtrip,
strippable types, empty content
- `compact_transcript` async tests: too few messages, not compacted,
successful compaction, compression failure
**`retry_scenarios_test.py`** (new, 27 tests)
- Full retry state machine simulation covering all 8 scenarios:
1. Normal flow (no retry)
2. Compact succeeds → retry succeeds
3. Compact fails → DB fallback succeeds
4. No transcript → DB fallback succeeds
5. Double fail → DB fallback on attempt 3
6. All 3 attempts exhausted
7. Non-prompt-too-long error (no retry)
8. Compaction returns identical content → DB fallback
- Edge cases: nested exceptions, case insensitivity, unicode content,
large transcripts, resume-after-compaction flow
**Shared test fixtures** (`conftest.py`)
- Extracted `build_test_transcript` helper used across 3 test files to
eliminate duplication
## Test plan
- [x] `_is_prompt_too_long` correctly identifies prompt-length errors (8
positive, 5 negative patterns)
- [x] `compact_transcript` compacts oversized transcripts via LLM
summarization
- [x] `compact_transcript` returns `None` on failure or when already
within budget
- [x] Retry loop state machine: all 8 scenarios verified with state
assertions
- [x] `TranscriptBuilder` works correctly after loading compacted
transcripts
- [x] `_messages_to_transcript` roundtrip preserves content including
unicode
- [x] `transcript_caused_error` prevents stale transcript upload
- [x] Truncation timeout prevents unbounded CPU time
- [x] All 139 unit tests pass locally
- [x] CI green (tests 3.11/3.12/3.13, types, CodeQL, linting)
Requested by @Torantulino
Add `google/nano-banana-2` (Gemini 3.1 Flash Image) support across all
three image blocks.
### Changes
**`ai_image_customizer.py`**
- Add `NANO_BANANA_2 = "google/nano-banana-2"` to `GeminiImageModel`
enum
- Update block description to reference Nano-Banana models generically
**`ai_image_generator_block.py`**
- Add `NANO_BANANA_2` to `ImageGenModel` enum
- Add generation branch (identical to NBP except model name)
**`flux_kontext.py` (AI Image Editor)**
- Rename `FluxKontextModelName` → `ImageEditorModel` (with
backwards-compatible alias)
- Add `NANO_BANANA_PRO` and `NANO_BANANA_2` to the editor
- Model-aware branching in `run_model()`: NB models use `image_input`
list (not `input_image`), no `seed`, and add `output_format`
**`block_cost_config.py`**
- Add NB2 cost entries for all three blocks (14 credits, matching NBP)
- Add NB Pro cost entry for editor block
- Update editor block refs from `.PRO`/`.MAX` to
`.FLUX_KONTEXT_PRO`/`.FLUX_KONTEXT_MAX`
Resolves SECRT-2047
---------
Co-authored-by: Torantulino <Torantulino@users.noreply.github.com>
Co-authored-by: Abhimanyu Yadav <122007096+Abhi1992002@users.noreply.github.com>
## Summary
- treat AddToListBlock.entry as optional rather than truthy so
0/""/False are appended
- extend block self-tests with a falsy entry case
## Testing
- Not run (pytest not available in environment)
Co-authored-by: DEEVEN SERU <144827577+DEVELOPER-DEEVEN@users.noreply.github.com>
Co-authored-by: Nicholas Tindle <nicholas.tindle@agpt.co>
Fixes issue where filenames with no dots until the end (or massive
extensions) bypassed truncation logic, causing OSError [Errno 36].
Limits extension preservation to 20 chars.
---------
Co-authored-by: DEVELOPER-DEEVEN <144827577+DEVELOPER-DEEVEN@users.noreply.github.com>
- Resolves#10657
- Partially based on #10913
### Changes 🏗️
- Run Pyright separately for each supported Python version
- Move type checking and linting into separate jobs
- Add `--skip-pyright` option to lint script
- Move `linter.py` into `backend/scripts`
- Move other scripts in `backend/` too for consistency
### Checklist 📋
#### For code changes:
- [x] I have clearly listed my changes in the PR description
- [x] I have made a test plan
- [x] I have tested my changes according to the test plan:
- CI
---
Co-authored-by: @Joaco2603 <jpappa2603@gmail.com>
---------
Co-authored-by: Joaco2603 <jpappa2603@gmail.com>
Co-authored-by: Claude Opus 4.6 <noreply@anthropic.com>
Poetry v2.2.1 has bugfixes that are relevant in context of our
`.pre-commit-config.yaml`
### Changes 🏗️
- Update `poetry` from v2.1.1 to v2.2.1 (latest version supported by
Dependabot)
- Re-generate `poetry.lock`
### Checklist 📋
#### For code changes:
- [x] I have clearly listed my changes in the PR description
- [x] I have made a test plan
- [x] I have tested my changes according to the test plan:
- CI
## Summary
- Updates the `pr-address` skill to poll for new PR comments while
waiting for CI, instead of blocking solely on `gh pr checks --watch
--fail-fast`
- Runs CI watch in the background and polls all 3 comment endpoints
every 30s
- Allows bot comments (coderabbitai, sentry) to be addressed in parallel
with CI rather than sequentially
## Test plan
- [ ] Run `/pr-address` on a PR with pending CI and verify it detects
new comments while CI is running
- [ ] Verify CI failures are still handled correctly after the combined
wait
* fix(backend): add resource limits to Jinja2 template rendering
Prevent DoS via computational exhaustion in FillTextTemplateBlock by:
- Subclassing SandboxedEnvironment to intercept ** and * operators
with caps on exponent size (1000) and string repeat length (10K)
- Replacing range() global with a capped version (max 10K items)
- Wrapping template.render() in a ThreadPoolExecutor with a 10s
timeout to kill runaway expressions
Addresses GHSA-ppw9-h7rv-gwq9 (CWE-400).
* address review: move helpers after TextFormatter, drop ThreadPoolExecutor
- Move _safe_range and _RestrictedEnvironment below TextFormatter
(helpers after the function that uses them)
- Remove ThreadPoolExecutor timeout wrapper from format_string() —
it has problematic behavior in async contexts and the static
interception (operator caps, range limit) already covers the
known attack vectors
* address review: extend sequence guard, harden format_email, add tests
- Extend * guard to cover list and tuple repetition, not just strings
(blocks {{ [0] * 999999999 }} and {{ (0,) * 999999999 }})
- Rename MAX_STRING_REPEAT → MAX_SEQUENCE_REPEAT
- Use _RestrictedEnvironment in format_email (defense-in-depth)
- Add tests: list repeat, tuple repeat, negative exponent, nested
exponentiation (18 tests total)
* add async timeout wrapper at block level
Wrap format_string calls in FillTextTemplateBlock and AgentOutputBlock
with asyncio.wait_for(asyncio.to_thread(...), timeout=10s).
This provides defense-in-depth: if an expression somehow bypasses the
static operator checks, the async timeout will cancel it. Uses
asyncio.to_thread for proper async integration (no event loop blocking)
and asyncio.wait_for for real cancellation on timeout.
* make format_string async with timeout kwarg
Move asyncio.wait_for + asyncio.to_thread into format_string() itself
with a timeout kwarg (default 10s). This way all callers get the
timeout automatically — no wrapper needed at each call site.
- format_string() is now async, callers use await
- format_email() is now async (calls format_string internally)
- Updated all callers: text.py, io.py, llm.py, smart_decision_maker.py,
email.py, notifications.py
- Tests updated to use asyncio.run()
* use Jinja2 native async rendering instead of to_thread
Switch from asyncio.to_thread(template.render) to Jinja2's native
enable_async=True + template.render_async(). No thread overhead,
proper async integration. asyncio.wait_for timeout still applies.
---------
Co-authored-by: Reinier van der Leer <pwuts@agpt.co>
## Summary
- Add direct `creator/slug` lookup to `find_agent` marketplace search,
bypassing full-text search when an exact identifier is provided
- Add direct UUID lookup to `find_block`, returning the block
immediately when a valid block ID is given
- Update tool descriptions and parameter hints to document the new
lookup capabilities
## Test plan
- [ ] Verify `find_agent` with a `creator/slug` query returns the exact
agent
- [ ] Verify `find_agent` falls back to search when slug lookup fails
- [ ] Verify `find_block` with a block UUID returns the exact block
- [ ] Verify `find_block` with a non-existent UUID falls through to
search
- [ ] Verify excluded block types/IDs are still filtered in direct
lookup
- Prefer f-strings except for debug statements
- Top-down module/function/class ordering
As suggested by @majdyz, this is more effective than commenting on every
single instance on PRs.
## Summary
- **Path validation fix**: `is_allowed_local_path()` now correctly
handles the SDK's nested conversation UUID path structure
(`<encoded-cwd>/<conversation-uuid>/tool-results/<file>`) instead of
only matching `<encoded-cwd>/tool-results/<file>`
- **`read_workspace_file` fallback**: When the model mistakenly calls
`read_workspace_file` for an SDK tool-result path (local disk, not cloud
storage), the tool now falls back to reading from local disk instead of
returning "file not found"
- **Cross-turn cleanup fix**: Stopped deleting
`~/.claude/projects/<encoded-cwd>/` between turns — tool-result files
now persist across `--resume` turns so the model can re-read them. Added
TTL-based stale directory sweeping (24h) to prevent unbounded disk
growth.
- **System prompt**: Added guidance telling the model to use `read_file`
(not `read_workspace_file`) for SDK tool-result paths
- **Symlink escape fix** (e2b_file_tools.py): Added `readlink -f`
canonicalization inside the E2B sandbox to detect symlink-based path
escapes before writes
- **Stash timeout increase**: `wait_for_stash` timeout increased from
0.5s to 2.0s, with a post-timeout `sleep(0)` fallback
### Root cause
Investigated via Langfuse trace `5116befdca6a6ff9a8af6153753e267d`
(session `d5841fd8`). The model ran 3 Perplexity deep research calls,
SDK truncated large outputs to `~/.claude/projects/.../tool-results/`
files. Model then called `read_workspace_file` (cloud DB) instead of
`read_file` (local disk), getting "file not found". Additionally, the
path validation check didn't account for the SDK's nested UUID directory
structure, and cleanup between turns deleted tool-result files that the
transcript still referenced.
## Test plan
- [x] All 653 copilot tests pass (excluding 1 pre-existing infra test)
- [x] Security test `test_read_claude_projects_settings_json_denied`
still passes — non-tool-result files under the project dir are still
blocked
- [x] `poetry run format` passes all checks
## Summary
- Teaches the `pr-address` skill to use `gh pr checks --watch
--fail-fast` for efficient CI waiting instead of manual polling
- Adds guidance on investigating failures with `gh run view
--log-failed`
- Adds explicit "between CI waits" section: re-fetch and address new bot
comments while CI runs
## Test plan
- [x] Verified the updated skill renders correctly
- [ ] Use `/pr-address` on a PR with pending CI to confirm the new flow
works
SendEmailBlock accepted user-supplied smtp_server and smtp_port inputs
and passed them directly to smtplib.SMTP() with no IP validation,
bypassing the platform's SSRF protections in request.py.
This fix:
- Makes _resolve_and_check_blocked public in request.py so non-HTTP
blocks can reuse the same IP validation
- Validates the SMTP server hostname via resolve_and_check_blocked()
before connecting
- Restricts allowed SMTP ports to standard values (25, 465, 587, 2525)
- Catches SMTPConnectError and SMTPServerDisconnected to prevent TCP
banner leakage in error messages
Fixes GHSA-4jwj-6mg5-wrwf
* fix(backend): add HMAC signing to Redis cache to prevent pickle deserialization attacks
Add HMAC-SHA256 integrity verification to all values stored in the shared
Redis cache. This prevents cache poisoning attacks where an attacker with
Redis access injects malicious pickled payloads that execute arbitrary code
on deserialization.
Changes:
- Sign pickled values with HMAC-SHA256 before storing in Redis
- Verify HMAC signature before deserializing cached values
- Reject tampered or unsigned (legacy) cache entries gracefully
(treated as cache misses, logged as warnings)
- Derive HMAC key from redis_password or unsubscribe_secret_key
- Add tests for HMAC round-trip, tamper detection, and legacy rejection
Fixes GHSA-rfg2-37xq-w4m9
* improve log message
---------
Co-authored-by: Reinier van der Leer <pwuts@agpt.co>
Replace NamedTemporaryFile(delete=False) with a direct Response,
preventing unbounded disk consumption on the public download endpoint.
Fixes: GHSA-374w-2pxq-c9jp
<!-- Clearly explain the need for these changes: -->
The previous confetti implementation using party-js was causing lag.
Replaced it with canvas-confetti for smoother, more performant
celebrations with enhanced visual effects.
### Changes 🏗️
- **New Confetti Component**: Reusable canvas-confetti wrapper with
AutoGPT purple color palette and Storybook stories demonstrating various
effects
- **Enhanced Wallet Confetti**: Dual simultaneous bursts at 45° and 135°
angles with larger particles (scalar 1.2) for better visibility
- **Enhanced Task Celebration**: Dual-burst confetti for task group and
individual task completion events
- **Onboarding Congrats Page**: Replaced party-js with canvas-confetti
for side-cannon animation effect
- **Dependency**: Added canvas-confetti v1.9.4, removed party-js
### Checklist 📋
#### For code changes:
- [x] I have clearly listed my changes in the PR description
- [x] I have made a test plan
- [x] I have tested my changes according to the test plan:
- [x] Trigger task completion in wallet to see dual-burst confetti at
45° and 135° angles
- [x] Complete tasks/groups to verify celebration confetti displays with
larger particles
- [x] Visit onboarding congratulations page to see side-cannon effect
- [x] Verify confetti rendering performance and no console errors
### Changes
- Replace skipped legacy builder tests with 8 working Playwright e2e
tests
targeting the new Flow Editor
- Rewrite `BuildPage` page object to match new `data-id`/`data-testid`
selectors
- Update `agent-activity.spec.ts` to use new `BuildPage` API
### Tests added
- Build page loads successfully (canvas + control buttons)
- Add a block via block menu search
- Add multiple blocks
- Remove a block (select + Backspace)
- Save an agent (name/description, verify flowID in URL)
- Save and verify run button becomes enabled
- Copy and paste a node (Cmd+C/V)
- Run an agent from the builder
### Test plan
- [x] All 8 builder tests pass locally (`pnpm test:no-build
src/tests/build.spec.ts`)
- [x] `pnpm format`, `pnpm lint`, `pnpm types` all clean
- [x] CI passes
## Summary
- Detect transient Anthropic API errors (ECONNRESET, "socket connection
was closed unexpectedly") across all error paths in the copilot SDK
streaming loop
- Replace raw technical error messages with user-friendly text:
**"Anthropic connection interrupted — please retry"**
- Add `retryable` field to `StreamError` model so the frontend can
distinguish retryable errors
- Add **"Try Again" button** on the error card for transient errors,
which re-sends the last user message
### Background
Sentry issue
[AUTOGPT-SERVER-875](https://significant-gravitas.sentry.io/issues/AUTOGPT-SERVER-875)
— 25+ events since March 13, caused by Anthropic API infrastructure
instability (confirmed by their status page). Same SDK/code on dev and
prod, prod-only because of higher volume of long-running streaming
sessions.
### Changes
**Backend (`constants.py`, `service.py`, `response_adapter.py`,
`response_model.py`):**
- `is_transient_api_error()` — pattern-matching helper for known
transient error strings
- Intercept transient errors in 3 places: `AssistantMessage.error`,
stream exceptions, `BaseException` handler
- Use friendly message in error markers persisted to session (so it
shows properly on page refresh too)
- `StreamError.retryable` field for frontend consumption
**Frontend (`ChatContainer`, `ChatMessagesContainer`,
`MessagePartRenderer`):**
- Thread `onRetry` callback from `ChatContainer` →
`ChatMessagesContainer` → `MessagePartRenderer`
- Detect transient error text in error markers and show "Try Again"
button via existing `ErrorCard.onRetry`
- Clicking "Try Again" re-sends the last user message (backend
auto-cleans stale error markers)
Fixes SECRT-2128, SECRT-2129, SECRT-2130
## Test plan
- [ ] Verify transient error detection with `is_transient_api_error()`
for known patterns
- [ ] Confirm error card shows "Anthropic connection interrupted —
please retry" instead of raw socket error
- [ ] Confirm "Try Again" button appears on transient error cards
- [ ] Confirm "Try Again" re-sends the last user message successfully
- [ ] Confirm non-transient errors (e.g., "Prompt is too long") still
show original error text without retry button
- [ ] Verify error marker persists correctly on page refresh
### Changes
- Integrates the existing graph search components into the new builder's
control panel
- Search by block name/title, block type, node inputs/outputs, and
description with fuzzy matching
(Jaro-Winkler)
- Clicking a result zooms/navigates to the node on the canvas
- Keyboard shortcut Cmd/Ctrl+F to open search
- Arrow key navigation and Enter to select within results
- Styled to match the new builder's block menu card pattern
https://github.com/user-attachments/assets/41ed676d-83b1-4f00-8611-00d20987a7af
### Test plan
- [x] Open builder with a graph containing multiple nodes
- [x] Click magnifying glass icon in control panel — search panel opens
- [x] Type a query — results filter by name, type, inputs, outputs
- [x] Click a result — canvas zooms to that node
- [x] Use arrow keys + Enter to navigate and select results
- [x] Press Cmd/Ctrl+F — search panel opens
- [x] Press Escape or click outside — search panel closes and query
clears
Replaces the custom LibraryTabs component with the design system's
TabsLine component throughout the library page for better UI
consistency. Also wires up favorite animation refs and removes the
unused `agentGraphVersion` field from the test fixture.
### Changes 🏗️
- Replace `LibraryTabs` with `TabsLine` from design system in
`FavoritesSection`, `LibrarySubSection`, and `page.tsx`
- Add favorite animation ref registration in `FavoritesSection` and
`LibrarySubSection`
- Inline tab type definition as `{ id: string; title: string; icon: Icon
}` in component props
- Remove unused `agentGraphVersion` field from `load_store_agents.py`
test
### Checklist 📋
#### For code changes:
- [x] I have clearly listed my changes in the PR description
- [x] I have made a test plan
- [x] I have tested my changes according to the test plan:
- [x] Library page renders with both "All" and "Favorites" tabs using
TabsLine component
- [x] Tab switching between all agents and favorites works correctly
- [x] Favorite animations reference the correct tab element
## Summary
Two bugs causing blocks to be invisible in CoPilot search:
### Bug 1: CamelCase block names not tokenized
Block names like `AITextGeneratorBlock` were indexed as single tokens in
the search database. PostgreSQL's `plainto_tsquery('english', ...)` and
the BM25 tokenizer both treat CamelCase as one word, so searching for
"text generator" produced zero lexical/BM25 match.
**Fix:** Split CamelCase names into separate words before indexing (e.g.
`"AI Text Generator Block"`) and in the BM25 tokenizer.
### Bug 2: Disabled blocks exhausting batch budget (root cause of 36
missing blocks)
The `batch_size` limit in `get_missing_items()` was applied **before**
filtering out disabled blocks. With 120+ disabled blocks and
`batch_size=100`, the first 100 missing entries were all disabled
(skipped via `continue`), leaving the 36 enabled blocks beyond the slice
boundary **never indexed**. This made core blocks like
`AITextGeneratorBlock`, `AIConversationBlock`, `AIListGeneratorBlock`,
etc. completely invisible to search.
**Fix:** Filter disabled blocks from the missing list before slicing by
`batch_size`.
### Changes
- **`content_handlers.py`**:
- Split CamelCase block names into space-separated words when building
`searchableText`
- Filter disabled blocks before applying `batch_size` slice so enabled
blocks aren't starved
- **`hybrid_search.py`**: Updated BM25 `tokenize()` to split CamelCase
tokens
### Evidence from local DB
```
Indexed blocks: 341
Total blocks: 497 (156 missing from index)
Missing (non-disabled): 36 — including AITextGeneratorBlock, AIConversationBlock, etc.
# batch_size analysis:
First 100 missing: 0 enabled, 100 disabled ← batch exhausted by disabled blocks
After 100: 36 enabled ← never reached!
```
## Test plan
- [ ] Verify CamelCase splitting: `AITextGeneratorBlock` → `AI Text
Generator Block`
- [ ] Run `poetry run pytest backend/api/features/store/` for
regressions
- [ ] After deploy, trigger embedding backfill and verify all 36 blocks
get indexed
- [ ] Search for "text generator" in CoPilot and verify
`AITextGeneratorBlock` appears
### Changes
- Add YouTube/Vimeo embed support to `VideoRenderer` — URLs render as
embedded
iframe players instead of plain text
- Add new `AudioRenderer` — HTTP audio URLs (.mp3, .wav, .ogg, .m4a,
.aac,
.flac) and data URIs render as inline audio players
- Add new `LinkRenderer` — any HTTP/HTTPS URL not claimed by a media
renderer
becomes a clickable link with an external-link icon
- Add media preview button to `FileInput` — uploaded audio, video, and
image
files show an Eye icon that opens a preview dialog reusing the
OutputRenderer
system
- Update `ContentRenderer` shortContent gate to allow new renderers
through in
node previews
https://github.com/user-attachments/assets/eea27fb7-3870-4a1e-8d08-ba23b6e07d74
### Test plan
- [x] `pnpm vitest run src/components/contextual/OutputRenderers/` — 36
tests
passing
- [x] `pnpm format && pnpm lint && pnpm types` — all clean
- [x] Manual: run a block that outputs a YouTube URL → embedded player
- [x] Manual: run a block that outputs an audio file URL → audio player
- [x] Manual: run a block that outputs a generic URL → clickable link
- [x] Manual: upload an audio/video/image file to a file input → Eye
icon
appears, clicking opens preview dialog
### Summary
- SECRT-2094: Fix store image delete button accidentally submitting the
edit form — the remove image <button> in ThumbnailImages.tsx was missing
type="button", causing it to act as a form submit inside the
EditAgentForm. This closed the modal and showed a success toast without
the user clicking "Update submission".
https://github.com/user-attachments/assets/86cbdd7d-90b1-473c-9709-e75e956dea6b
### Changes
- `frontend/.../ThumbnailImages.tsx` — added type="button" to image
remove button
### Changes 🏗️
- Added comprehensive architecture documentation at
`docs/platform/workspace-media-architecture.md` covering:
- Database models (`UserWorkspace`, `UserWorkspaceFile`)
- `WorkspaceManager` API with session scoping
- `store_media_file()` media normalization pipeline (input types, return
formats)
- Virus scanning responsibility boundaries
- Decision tree for choosing `WorkspaceManager` vs `store_media_file()`
- Configuration reference including `clamav_max_concurrency` and
`clamav_mark_failed_scans_as_clean`
- Common patterns with error handling examples
- Updated `autogpt_platform/backend/CLAUDE.md` with a "Workspace & Media
Files" section referencing the new docs
- Removed duplicate `scan_content_safe()` call from
`WriteWorkspaceFileTool` — `WorkspaceManager.write_file()` already scans
internally, so the tool was double-scanning every file
- Replaced removed comment in `workspace.py` with explicit ownership
comment clarifying that `WorkspaceManager` is the single scanning
boundary
### Checklist 📋
#### For code changes:
- [x] I have clearly listed my changes in the PR description
- [x] I have made a test plan
- [x] I have tested my changes according to the test plan:
- [x] Verified `scan_content_safe()` is called inside
`WorkspaceManager.write_file()` (workspace.py:186)
- [x] Verified `store_media_file()` scans all input branches including
local paths (file.py:351)
- [x] Verified documentation accuracy against current source code after
merge with dev
- [x] CI checks all passing
<!-- CURSOR_SUMMARY -->
---
> [!NOTE]
> **Low Risk**
> Mostly adds documentation and internal developer guidance; the only
code change is a comment clarifying `WorkspaceManager.write_file()` as
the single virus-scanning boundary, with no behavior change.
>
> **Overview**
> Adds a new `docs/platform/workspace-media-architecture.md` describing
the Workspace storage layer vs the `store_media_file()` media pipeline,
including session scoping and virus-scanning/persistence responsibility
boundaries.
>
> Updates backend `CLAUDE.md` to point contributors to the new doc when
working on CoPilot uploads/downloads or
`WorkspaceManager`/`store_media_file()`, and clarifies in
`WorkspaceManager.write_file()` (comment-only) that callers should not
duplicate virus scanning.
>
> <sup>Written by [Cursor
Bugbot](https://cursor.com/dashboard?tab=bugbot) for commit
18fcfa03f8. This will update automatically
on new commits. Configure
[here](https://cursor.com/dashboard?tab=bugbot).</sup>
<!-- /CURSOR_SUMMARY -->
---------
Co-authored-by: Claude Opus 4.6 <noreply@anthropic.com>
Our CI costs are skyrocketing, most of it because of
`platform-fullstack-ci.yml`. The `types` job currently uses in a
`big-boi` runner (= expensive), but doesn't need to.
Additionally, the "end-to-end tests" job is currently in
`platform-frontend-ci.yml` instead of `platform-fullstack-ci.yml`,
causing it not to run on backend changes (which it should).
### Changes 🏗️
- Simplify `check-api-types` job (renamed from `types`) and make it use
regular `ubuntu-latest` runner
- Export API schema from backend through CLI (instead of spinning it up
in docker)
- Fix dependency caching in `platform-fullstack-ci.yml` (based on recent
improvements in `platform-frontend-ci.yml`)
- Move `e2e_tests` job to `platform-fullstack-ci.yml`
Out-of-scope but necessary:
- Eliminate module-level init of OpenAI client in
`backend.copilot.service`
### Checklist 📋
#### For code changes:
- [x] I have clearly listed my changes in the PR description
- [x] I have made a test plan
- [x] I have tested my changes according to the test plan:
- CI
The `@@agptfile:` expansion system previously used content-sniffing
(trying
`json.loads` then `csv.Sniffer`) to decide whether to parse file content
as
structured data. This was fragile — a file containing just `"42"` would
be
parsed as an integer, and the heuristics could misfire on ambiguous
content.
This PR replaces content-sniffing with **extension/MIME-based format
detection**.
When the file has a well-known extension (`.json`, `.csv`, etc.) or MIME
type
fragment (`workspace://id#application/json`), the content is parsed
accordingly.
Unknown formats or parse failures always fall back to plain string — no
surprises.
> [!NOTE]
> This PR builds on the `@@agptfile:` file reference protocol introduced
in #12332 and the structured data auto-parsing added in #12390.
>
> **What is `@@agptfile:`?**
> It is a special URI prefix (e.g. `@@agptfile:workspace:///report.csv`)
that the CoPilot SDK expands inline before sending tool arguments to
blocks. This lets the AI reference workspace files by name, and the SDK
automatically reads and injects the file content. See #12332 for the
full design.
### Changes 🏗️
**New utility: `backend/util/file_content_parser.py`**
- `infer_format(uri)` — determines format from file extension or MIME
fragment
- `parse_file_content(content, fmt)` — parses content, never raises
- Supported text formats: JSON, JSONL/NDJSON, CSV, TSV, YAML, TOML
- Supported binary formats: Parquet (via pyarrow), Excel/XLSX (via
openpyxl)
- JSON scalars (strings, numbers, booleans, null) stay as strings — only
containers (arrays, objects) are promoted
- CSV/TSV require ≥1 row and ≥2 columns to qualify as tabular data
- Added `openpyxl` dependency for Excel reading via pandas
- Case-insensitive MIME fragment matching per RFC 2045
- Shared `PARSE_EXCEPTIONS` constant to avoid duplication between
modules
**Updated `expand_file_refs_in_args` in `file_ref.py`**
- Bare refs now use `infer_format` + `parse_file_content` instead of the
old `_try_parse_structured` content-sniffing function
- Binary formats (parquet, xlsx) read raw bytes via `read_file_bytes`
- Embedded refs (text around `@@agptfile:`) still produce plain strings
- **Size guards**: Workspace and sandbox file reads now enforce a 10 MB
limit
(matching the existing local file limit) to prevent OOM on large files
**Updated `blocks/github/commits.py`**
- Consolidated `_create_blob` and `_create_binary_blob` into a single
function
with an `encoding` parameter
**Updated copilot system prompt**
- Documents the extension-based structured data parsing and supported
formats
**66 new tests** in `file_content_parser_test.py` covering:
- Format inference (extension, MIME, case-insensitive, precedence)
- All 8 format parsers (happy path + edge cases + fallbacks)
- Binary format handling (string input fallback, invalid bytes fallback)
- Unknown format passthrough
### Checklist 📋
#### For code changes:
- [x] I have clearly listed my changes in the PR description
- [x] I have made a test plan
- [x] I have tested my changes according to the test plan:
- [x] All 66 file_content_parser_test.py tests pass
- [x] All 31 file_ref_test.py tests pass
- [x] All 13 file_ref_integration_test.py tests pass
- [x] `poetry run format` passes clean (including pyright)
Adds a "Jump Back In" CTA at the top of the Library page to encourage
users to quickly rerun their most recently successful agent.
Closes SECRT-1536
### Changes 🏗️
- New `JumpBackIn` component with `useJumpBackIn` hook at
`library/components/JumpBackIn/`
- Fetches first page of library agents sorted by `updatedAt`
- Finds the first agent with a `COMPLETED` execution in
`recent_executions`
- Shows banner with agent name + "Jump Back In" button linking to
`/library/agents/{id}`
- Returns `null` (hidden) when loading or when no agent with a
successful run exists
### Checklist 📋
#### For code changes:
- [x] I have clearly listed my changes in the PR description
- [x] I have made a test plan
- [x] I have tested my changes according to the test plan:
- [x] `pnpm format`, `pnpm lint`, `pnpm types` all pass
- [x] Verified banner is hidden when no successful runs exist (edge
case)
- [x] Verified library page renders correctly with no visual regressions
🤖 Generated with [Claude Code](https://claude.com/claude-code)
---------
Co-authored-by: Claude Opus 4.6 <noreply@anthropic.com>
Refactors the Claude Code skills for a cleaner, more intuitive dev loop.
### Changes 🏗️
- **`/pr-review` (new)**: Actual code review skill — reads the PR diff,
fetches existing comments to avoid duplicates, and posts inline GitHub
comments with structured feedback (Blockers / Should Fix / Nice to Have
/ Nit) covering correctness, security, code quality, architecture, and
testing.
- **`/pr-address` (was `/babysit-pr`)**: Addresses review comments and
monitors CI until green. Renamed from `/babysit-pr` to `/pr-address` to
better reflect its purpose. Handles bot-specific feedback
(autogpt-reviewer, sentry, coderabbitai) and loops until all comments
are addressed and CI is green.
- **`/backend-check` + `/frontend-check` → `/check`**: Unified into a
single `/check` skill that auto-detects whether backend (Python) or
frontend (TypeScript) code changed and runs the appropriate formatting,
linting, type checking, and tests. Shared code quality rules applied to
both.
- **`/code-style` enhanced**: Now covers both Python and
TypeScript/React. Added learnings from real PR work: lazy `%s` logging,
TOCTOU awareness, SSE protocol rules (`data:` vs `: comment`), FastAPI
`Security()` vs `Depends()`, Redis pipeline atomicity, error path
sanitization, mock target rules after refactoring.
- **`/worktree` fixed**: Normal `git worktree` is now the default (was
branchlet-first). Branchlet moved to optional section. All paths derived
from `git rev-parse --show-toplevel`.
- **`/pr-create`, `/openapi-regen`, `/new-block` cleaned up**: Reference
`/check` and `/code-style` instead of duplicating instructions.
### Checklist 📋
#### For code changes:
- [x] I have clearly listed my changes in the PR description
- [x] I have made a test plan
- [x] I have tested my changes according to the test plan:
- [x] Verified all skill files parse correctly (valid YAML frontmatter)
- [x] Verified skill auto-detection triggers updated in descriptions
- [x] Verified old backend-check and frontend-check directories removed
- [x] Verified pr-review and pr-address directories created with correct
content
## Summary
- Replaces all user-facing "block" terminology in the CoPilot activity
stream with plain-English labels ("Step failed", "action",
"Credentials", etc.)
- Adds `humanizeFileName()` utility to display file names without
extensions, with title-case and spaces (e.g. `executive_memo.md` →
`"Executive Memo"`)
- Updates error messages across RunBlock, RunAgent, and FindBlocks tools
to use friendly language
## Test plan
- [ ] Open CoPilot and trigger a block execution — verify animation text
says "Running" / "Step failed" instead of "Running the block" / "Error
running block"
- [ ] Trigger a file read/write action — verify the activity shows
humanized file names (e.g. `Reading "Executive Memo"` not `Reading
executive_memo.md`)
- [ ] Trigger FindBlocks — verify labels say "Searching for actions" and
"Results" instead of "Searching for blocks" and "Block results"
- [ ] Check the work-done stats bar — verify it shows "action" /
"actions" instead of "block run" / "block runs"
- [ ] Trigger a setup requirements card — verify labels say
"Credentials" and "Inputs" instead of "Block credentials" and "Block
inputs"
- [ ] Visit `/copilot/styleguide` — verify error test data no longer
contains "Block execution" text
Resolves: [SECRT-2025](https://linear.app/autogpt/issue/SECRT-2025)
🤖 Generated with [Claude Code](https://claude.com/claude-code)
---------
Co-authored-by: Claude Opus 4.6 <noreply@anthropic.com>
Adds a "Run now" action to the schedule detail view and sidebar
dropdown, allowing users to immediately trigger a scheduled agent run
without waiting for the next cron execution.
### Changes 🏗️
- **`useSelectedScheduleActions.ts`**: Added
`usePostV1ExecuteGraphAgent` hook and `handleRunNow` function that
executes the agent using the schedule's stored `input_data` and
`input_credentials`. On success, invalidates runs query and navigates to
the new run
- **`SelectedScheduleActions.tsx`**: Added Play icon button as first
action button, with loading spinner while running
- **`SelectedScheduleView.tsx`**: Threads `onSelectRun` prop and
`schedule` object to action components (both mobile and desktop layouts)
- **`NewAgentLibraryView.tsx`**: Passes `onSelectRun` handler to enable
navigation to the new run after execution
- **`ScheduleActionsDropdown.tsx`**: Added "Run now" dropdown menu item
with same execution logic
- **`ScheduleListItem.tsx`**: Added `onRunCreated` prop passed to
dropdown
- **`SidebarRunsList.tsx`**: Connects sidebar dropdown to run
selection/navigation
### Checklist 📋
#### For code changes:
- [x] I have clearly listed my changes in the PR description
- [x] I have made a test plan
- [x] I have tested my changes according to the test plan:
- [x] `pnpm format`, `pnpm lint`, `pnpm types` all pass
- [x] Code review: follows existing patterns (mirrors "Run Again" in
SelectedRunActions)
- [x] No visual regressions on agent detail page
🤖 Generated with [Claude Code](https://claude.com/claude-code)
Co-authored-by: Claude Opus 4.6 <noreply@anthropic.com>
Requested by @majdyz
When a user asks for Google Sheets integration, the CoPilot agent skips
block discovery entirely (despite 55+ Google Sheets blocks being
available), jumps straight to MCP, guesses a fake URL
(`https://sheets.googleapis.com/mcp`), and gets a raw HTML 404 error
page dumped into the conversation.
**Changes:**
1. **MCP guide** (`mcp_tool_guide.md`): Added "Check blocks first"
section directing the agent to use `find_block` before attempting MCP
for any service not in the known servers list. Explicitly prohibits
guessing/constructing MCP server URLs.
2. **Error handling** (`run_mcp_tool.py`): Detects HTML error pages in
HTTP responses (e.g. raw 404 pages from non-MCP endpoints) and returns a
clean one-liner like "This URL does not appear to host an MCP server"
instead of dumping the full HTML body.
**Note:** The main CoPilot system prompt (managed externally, not in
repo) should also be updated to reinforce block-first behavior in the
Capability Check section. This PR covers the in-repo changes.
Session reference: `9216df83-5f4a-48eb-9457-3ba2057638ae` (turn 3)
Ticket: [SECRT-2116](https://linear.app/autogpt/issue/SECRT-2116)
---
Co-authored-by: Zamil Majdy (@majdyz) <majdyz@gmail.com>
---------
Co-authored-by: Zamil Majdy (@majdyz) <majdyz@gmail.com>
Co-authored-by: Zamil Majdy <zamil.majdy@agpt.co>
Agent validation failures are expected when the LLM generates invalid
agent graphs (wrong block IDs, missing required inputs, bad output field
names). The validator catches these and returns proper error responses.
However, `validator.py:938` used `logger.error()`, which Sentry captures
as error events — flooding #platform-alerts with non-errors.
This changes it to `logger.warning()`, keeping the log visible for
debugging without triggering Sentry alerts.
Fixes SECRT-2120
---
Co-authored-by: Zamil Majdy (@majdyz) <zamil.majdy@agpt.co>
## Summary
- **Root cause**: `TranscriptBuilder` accumulates all raw SDK stream
messages including pre-compaction content. When the CLI compacts
mid-stream, the uploaded transcript was still uncompacted, causing
"Prompt is too long" errors on the next `--resume` turn.
- **Fix**: Detect mid-stream compaction via the `PreCompact` hook, read
the CLI's session file to get the compacted entries (summary +
post-compaction messages), and call
`TranscriptBuilder.replace_entries()` to sync it with the CLI's active
context. This ensures the uploaded transcript always matches what the
CLI sees.
- **Key changes**:
- `CompactionTracker`: stores `transcript_path` from `PreCompact` hook,
one-shot `compaction_just_ended` flag that correctly resets for multiple
compactions
- `read_compacted_entries()`: reads CLI session JSONL, finds
`isCompactSummary: true` entry, returns it + all entries after. Includes
path validation against the CLI projects directory.
- `TranscriptBuilder.replace_entries()`: clears and replaces all entries
with compacted ones, preserving `isCompactSummary` entries (which have
`type: "summary"` that would normally be stripped)
- `load_previous()`: also preserves `isCompactSummary` entries when
loading a previously compacted transcript
- Service stream loop: after compaction ends, reads compacted entries
and syncs TranscriptBuilder
## Test plan
- [x] 69 tests pass across `compaction_test.py` and `transcript_test.py`
- [x] Tests cover: one-shot flag behavior, multiple compactions within a
query, transcript path storage, path traversal rejection,
`read_compacted_entries` (7 tests), `replace_entries` (4 tests),
`load_previous` with compacted content (2 tests)
- [x] Pre-commit hooks pass (lint, format, typecheck)
- [ ] Manual test: trigger compaction in a multi-turn session and verify
the uploaded transcript reflects compaction
Requested by @torantula
Add support for shareable AutoPilot URLs that contain a prompt in the
URL hash fragment, inspired by [Lovable's
implementation](https://docs.lovable.dev/integrations/build-with-url).
**URL format:**
- `/copilot#prompt=URL-encoded-text` — pre-fills the input for the user
to review before sending
- `/copilot?autosubmit=true#prompt=...` — auto-creates a session and
sends the prompt immediately
**Example:**
```
https://platform.agpt.co/copilot#prompt=Create%20a%20todo%20apphttps://platform.agpt.co/copilot?autosubmit=true#prompt=Create%20a%20todo%20app
```
**Key design decisions:**
- Uses URL fragment (`#`) instead of query params — fragments never hit
the server, so prompts stay client-side only (better for privacy, no
backend URL length limits)
- URL is cleaned via `history.replaceState` immediately after extraction
to prevent re-triggering on navigation/reload
- Leverages existing `pendingMessage` + `createSession()` flow for
auto-submit — no new backend APIs needed
- For populate-only mode, passes `initialPrompt` down through component
tree to pre-fill the chat input
**Files changed:**
- `useCopilotPage.ts` — URL hash extraction logic + `initialPrompt`
state
- `CopilotPage.tsx` — passes `initialPrompt` to `ChatContainer`
- `ChatContainer.tsx` — passes `initialPrompt` to `EmptySession`
- `EmptySession.tsx` — passes `initialPrompt` to `ChatInput`
- `ChatInput.tsx` / `useChatInput.ts` — accepts `initialValue` to
pre-fill the textarea
Fixes SECRT-2119
---
Co-authored-by: Toran Bruce Richards (@Torantulino) <toran@agpt.co>
2026-03-13 23:54:54 +07:00
3737 changed files with 283931 additions and 845583 deletions
description: Run the full backend formatting, linting, and test suite. Ensures code quality before commits and PRs. TRIGGER when backend Python code has been modified and needs validation.
user-invocable: true
metadata:
author: autogpt-team
version: "1.0.0"
---
# Backend Check
## Steps
1.**Format**: `poetry run format` — runs formatting AND linting. NEVER run ruff/black/isort individually
2.**Fix** any remaining errors manually, re-run until clean
3.**Test**: `poetry run test` (runs DB setup + pytest). For specific files: `poetry run pytest -s -vvv <test_files>`
4.**Snapshots** (if needed): `poetry run pytest path/to/test.py --snapshot-update` — review with `git diff`
description: Python code style preferences for the AutoGPT backend. Apply when writing or reviewing Python code. TRIGGER when writing new Python code, reviewing PRs, or refactoring backend code.
user-invocable: false
metadata:
author: autogpt-team
version: "1.0.0"
---
# Code Style
## Imports
- **Top-level only** — no local/inner imports. Move all imports to the top of the file.
## Typing
- **No duck typing** — avoid `hasattr`, `getattr`, `isinstance` for type dispatch. Use proper typed interfaces, unions, or protocols.
- **Pydantic models** over dataclass, namedtuple, or raw dict for structured data.
- **No linter suppressors** — avoid `# type: ignore`, `# noqa`, `# pyright: ignore` etc. 99% of the time the right fix is fixing the type/code, not silencing the tool.
## Code Structure
- **List comprehensions** over manual loop-and-append.
- **Early return** — guard clauses first, avoid deep nesting.
- **Flatten inline** — prefer short, concise expressions. Reduce `if/else` chains with direct returns or ternaries when readable.
- **Modular functions** — break complex logic into small, focused functions rather than long blocks with nested conditionals.
## Review Checklist
Before finishing, always ask:
- Can any function be split into smaller pieces?
- Is there unnecessary nesting that an early return would eliminate?
description: Run the full frontend formatting, linting, and type checking suite. Ensures code quality before commits and PRs. TRIGGER when frontend TypeScript/React code has been modified and needs validation.
user-invocable: true
metadata:
author: autogpt-team
version: "1.0.0"
---
# Frontend Check
## Steps (in order)
1.**Format**: `pnpm format` — NEVER run individual formatters
2.**Lint**: `pnpm lint` — fix errors, re-run until clean
3.**Types**: `pnpm types` — if it keeps failing after multiple attempts, stop and ask the user
description: Create a new backend block following the Block SDK Guide. Guides through provider configuration, schema definition, authentication, and testing. TRIGGER when user asks to create a new block, add a new integration, or build a new node for the graph editor.
user-invocable: true
metadata:
author: autogpt-team
version: "1.0.0"
---
# New Block Creation
Read `docs/platform/block-sdk-guide.md` first for the full guide.
## Steps
1.**Provider config** (if external service): create `_config.py` with `ProviderBuilder`
2.**Block file** in `backend/blocks/` (from `autogpt_platform/backend/`):
- Generate a UUID once with `uuid.uuid4()`, then **hard-code that string** as `id` (IDs must be stable across imports)
-`Input(BlockSchema)` and `Output(BlockSchema)` classes
-`async def run` that `yield`s output fields
3.**Files**: use `store_media_file()` with `"for_block_output"` for outputs
4.**Test**: `poetry run pytest 'backend/blocks/test/test_block.py::test_available_blocks[MyBlock]' -xvs`
5.**Format**: `poetry run format`
## Rules
- Analyze interfaces: do inputs/outputs connect well with other blocks in a graph?
description: Open a pull request with proper PR template, test coverage, and review workflow. Guides agents through creating a PR that follows repo conventions, ensures existing behaviors aren't broken, covers new behaviors with tests, and handles review via bot when local testing isn't possible. TRIGGER when user asks to "open a PR", "create a PR", "make a PR", "submit a PR", "open pull request", "push and create PR", or any variation of opening/submitting a pull request.
user-invocable: true
args: "[base-branch] — optional target branch (defaults to dev)."
metadata:
author: autogpt-team
version: "1.0.0"
---
# Open a Pull Request
## Step 1: Pre-flight checks
Before opening the PR:
1. Ensure all changes are committed
2. Ensure the branch is pushed to the remote (`git push -u origin <branch>`)
3. Run linters/formatters across the whole repo (not just changed files) and commit any fixes
## Step 2: Test coverage
**This is critical.** Before opening the PR, verify:
### Existing behavior is not broken
- Identify which modules/components your changes touch
- Run the existing test suites for those areas
- If tests fail, fix them before opening the PR — do not open a PR with known regressions
### New behavior has test coverage
- Every new feature, endpoint, or behavior change needs tests
- If you added a new block, add tests for that block
- If you changed API behavior, add or update API tests
- If you changed frontend behavior, verify it doesn't break existing flows
If you cannot run the full test suite locally, note which tests you ran and which you couldn't in the test plan.
## Step 3: Create the PR using the repo template
Read the canonical PR template at `.github/PULL_REQUEST_TEMPLATE.md` and use it **verbatim** as your PR body:
1. Read the template: `cat .github/PULL_REQUEST_TEMPLATE.md`
2. Preserve the exact section titles and formatting, including:
-`### Why / What / How`
-`### Changes 🏗️`
-`### Checklist 📋`
3. Replace HTML comment prompts (`<!-- ... -->`) with actual content; do not leave them in
4.**Do not pre-check boxes** — leave all checkboxes as `- [ ]` until each step is actually completed
5. Do not alter the template structure, rename sections, or remove any checklist items
**PR title must use conventional commit format** (e.g., `feat(backend): add new block`, `fix(frontend): resolve routing bug`, `dx(skills): update PR workflow`). See CLAUDE.md for the full list of scopes.
Use `gh pr create` with the base branch (defaults to `dev` if no `[base-branch]` was provided). Use `--body-file` to avoid shell interpretation of backticks and special characters:
```bash
BASE_BRANCH="${BASE_BRANCH:-dev}"
PR_BODY=$(mktemp)
cat > "$PR_BODY"<< 'PREOF'
<filled-in template from .github/PULL_REQUEST_TEMPLATE.md>
### If you have a workspace that allows testing (docker, running backend, etc.)
- Run `/pr-test` to do E2E manual testing of the PR using docker compose, agent-browser, and API calls. This is the most thorough way to validate your changes before review.
- After testing, run `/pr-review` to self-review the PR for correctness, security, code quality, and testing gaps before requesting human review.
### If you do NOT have a workspace that allows testing
This is common for agents running in worktrees without a full stack. In this case:
1. Run `/pr-review` locally to catch obvious issues before pushing
2.**Comment `/review` on the PR** after creating it to trigger the review bot
3.**Poll for the review** rather than blindly waiting — check for new review comments every 30 seconds using `gh api repos/Significant-Gravitas/AutoGPT/pulls/{N}/reviews --paginate` and the GraphQL inline threads query. The bot typically responds within 30 minutes, but polling lets the agent react as soon as it arrives.
4. Do NOT proceed or merge until the bot review comes back
5. Address any issues the bot raises — use `/pr-address` which has a full polling loop with CI + comment tracking
```bash
# After creating the PR:
PR_NUMBER=$(gh pr view --json number -q .number)
gh pr comment "$PR_NUMBER" --body "/review"
# Then use /pr-address to poll for and address the review when it arrives
```
## Step 5: Address review feedback
Once the review bot or human reviewers leave comments:
- Run `/pr-address` to address review comments. It will loop until CI is green and all comments are resolved.
- Do not merge without human approval.
## Related skills
| Skill | When to use |
|---|---|
| `/pr-test` | E2E testing with docker compose, agent-browser, API calls — use when you have a running workspace |
| `/pr-review` | Review for correctness, security, code quality — use before requesting human review |
| `/pr-address` | Address reviewer comments and loop until CI green — use after reviews come in |
## Step 6: Post-creation
After the PR is created and review is triggered:
- Share the PR URL with the user
- If waiting on the review bot, let the user know the expected wait time (~30 min)
description: Regenerate the OpenAPI spec and frontend API client. Starts the backend REST server, fetches the spec, and regenerates the typed frontend hooks. TRIGGER when API routes change, new endpoints are added, or frontend API types are stale.
user-invocable: true
metadata:
author: autogpt-team
version: "1.0.0"
---
# OpenAPI Spec Regeneration
## Steps
1.**Run end-to-end** in a single shell block (so `REST_PID` persists):
description: "Meta-agent supervisor that manages a fleet of Claude Code agents running in tmux windows. Auto-discovers spare worktrees, spawns agents, monitors state, kicks idle agents, approves safe confirmations, and recycles worktrees when done. TRIGGER when user asks to supervise agents, run parallel tasks, manage worktrees, check agent status, or orchestrate parallel work."
user-invocable: true
argument-hint: "any free text — e.g. 'start 3 agents on X Y Z', 'show status', 'add task: implement feature A', 'stop', 'how many are free?'"
metadata:
author: autogpt-team
version: "6.0.0"
---
# Orchestrate — Agent Fleet Supervisor
One tmux session, N windows — each window is one agent working in its own worktree. Speak naturally; Claude maps your intent to the right scripts.
| `verify-complete.sh WINDOW` | Verify PR is done: checkpoints ✓ + 0 unresolved threads + CI green + no fresh CHANGES_REQUESTED. Repo auto-derived from state file `.repo` or git remote. |
| `notify.sh MESSAGE` | Send notification via Discord webhook (env `DISCORD_WEBHOOK_URL` or state `.discord_webhook`), macOS notification center, and stdout |
| `capacity.sh [REPO_ROOT]` | Print available + in-use worktrees |
| `status.sh` | Print fleet status + live pane commands |
| `classify-pane.sh WINDOW` | Classify one pane state |
## Supervision model
```
Orchestrating Claude (this Claude session — IS the supervisor)
└── Reads pane output, checks CI, intervenes with targeted guidance
run-loop.sh (separate tmux window, every 30s)
└── Mechanical only: idle restart, dialog approval, recycle on ORCHESTRATOR:DONE
```
**You (the orchestrating Claude)** are the supervisor. After spawning agents, stay in this conversation and actively monitor: poll each agent's pane every 2-3 minutes, check CI, nudge stalled agents, and verify completions. Do not spawn a separate supervisor Claude window — it loses context, is hard to observe, and compounds context compression problems.
**run-loop.sh** is the mechanical layer — zero tokens, handles things that need no judgment: restart crashed agents, press Enter on dialogs, recycle completed worktrees (only after `verify-complete.sh` passes).
## Checkpoint protocol
Agents output checkpoints as they complete each required step:
```
CHECKPOINT:<step-name>
```
Required steps are passed as args to `spawn-agent.sh` (e.g. `pr-address pr-test`). `run-loop.sh` will not recycle a window until all required checkpoints are found in the pane output. If `verify-complete.sh` fails, the agent is re-briefed automatically.
verify-complete.sh: checkpoints ✓ + 0 threads + CI green + no fresh CHANGES_REQUESTED
↓
state → "done", notify, window KEPT OPEN
↓
user/orchestrator explicitly requests recycle
↓
recycle-agent.sh → spare/N (free again)
```
**Windows are never auto-killed.** The worktree stays on its branch, the session stays alive. The agent is done working but the window, git state, and Claude session are all preserved until you choose to recycle.
**To resume a done or crashed session:**
```bash
# Resume by stored session ID (preferred — exact session, full context)
claude --resume SESSION_ID --permission-mode bypassPermissions
# Or resume most recent session in that worktree directory
cd /path/to/worktree && claude --continue --permission-mode bypassPermissions
-`repo` — GitHub `owner/repo` for CI/thread checks. Auto-derived from git remote if omitted.
-`discord_webhook` — Discord webhook URL for completion notifications. Also reads `DISCORD_WEBHOOK_URL` env var.
Per-agent fields:
-`session_id` — UUID passed to `claude --session-id` at spawn; use with `claude --resume UUID` to restore exact session context after a crash or window close.
-`last_rebriefed_at` — Unix timestamp of last re-brief; enforces 5-min cooldown to prevent spam.
`done` means verified complete — window is still open, session still alive, worktree still on task branch. Not recycled yet.
## Serial /pr-test rule
`/pr-test` and `/pr-test --fix` run local Docker + integration tests that use shared ports, a shared database, and shared build caches. **Running two `/pr-test` jobs simultaneously will cause port conflicts and database corruption.**
**Rule: only one `/pr-test` runs at a time. The orchestrator serializes them.**
You (the orchestrating Claude) own the test queue:
1. Agents do `pr-review` and `pr-address` in parallel — that's safe (they only push code and reply to GitHub).
2. When a PR needs local testing, add it to your mental queue — don't give agents a `pr-test` step.
3. Run `/pr-test https://github.com/OWNER/REPO/pull/PR_NUMBER --fix` yourself, sequentially.
4. Feed results back to the relevant agent via `tmux send-keys`:
```bash
tmux send-keys -t SESSION:WIN "Local tests for PR #N: <paste failure output or 'all passed'>. Fix any failures and push, then output ORCHESTRATOR:DONE."
sleep 0.3
tmux send-keys -t SESSION:WIN Enter
```
5. Wait for CI to confirm green before marking the agent done.
If multiple PRs need testing at the same time, pick the one furthest along (fewest pending CI checks) and test it first. Only start the next test after the previous one completes.
## Session restore (tested and confirmed)
Agent sessions are saved to disk. To restore a closed or crashed session:
tmux send-keys -t "SESSION:${NEW_WIN}" "cd /path/to/worktree && claude --resume SESSION_ID --permission-mode bypassPermissions" Enter
# If no session_id (use --continue for most recent session in that directory):
tmux send-keys -t "SESSION:${NEW_WIN}" "cd /path/to/worktree && claude --continue --permission-mode bypassPermissions" Enter
```
`--continue` restores the full conversation history including all tool calls, file edits, and context. The agent resumes exactly where it left off. After restoring, update the window address in the state file:
```bash
jq --arg old "SESSION:OLD_WIN" --arg new "SESSION:NEW_WIN" \
Use an existing session. **Never create a tmux session from within Claude** — it becomes a child of Claude's process and dies when the session ends. If no session exists, tell the user to run `tmux new-session -d -s autogpt1` in their terminal first, then re-invoke `/orchestrate`.
`spawn-agent.sh` writes the initial agent record (window, worktree_path, branch, objective, state, etc.) to the state file automatically — **do not append the record again after calling it.** The record already exists and `pr_number`/`steps` are patched in by the script itself.
### 6. Begin supervising directly in this conversation
You are the supervisor. After spawning, immediately start your first poll loop (see **Supervisor duties** below) and continue every 2-3 minutes. Do NOT spawn a separate supervisor Claude window.
## Adding an agent
Find the next spare worktree, then spawn and append to state — same as steps 2–4 above but for a single task. If no spare worktrees are available, tell the user.
## Supervisor duties (YOUR job, every 2-3 min in this conversation)
You are the supervisor. Run this poll loop directly in your Claude session — not in a separate window.
### Poll loop mechanism
You are reactive — you only act when a tool completes or the user sends a message. To create a self-sustaining poll loop without user involvement:
1. Start each poll with `run_in_background: true` + a sleep before the work:
and capture every window listed. If you manually added a window outside spawn-agent.sh, ensure it's in the state file first.
### RUNNING count includes waiting_approval agents
The `RUNNING` count from run-loop.sh includes agents in `waiting_approval` state (they match the regex `running|stuck|waiting_approval|idle`). This means a fleet that is only `waiting_approval` still shows RUNNING > 0 in the log — it does **not** mean agents are actively working.
When you see `RUNNING > 0` in the run-loop log but suspect agents are actually blocked, check state directly:
A count of `running=1 waiting=1` in the log actually means one agent is waiting for approval — the orchestrator should check and approve, not wait.
### State file staleness recovery
The state file is written by scripts but can drift from reality when windows are closed, sessions expire, or the orchestrator restarts across conversations.
**Signs of stale state:**
- `loop_window` points to a window that no longer exists in the tmux session
- An agent's `state` is `running` but tmux window is closed or shows a shell prompt (not claude)
- `last_seen_at` is hours old but state still says `running`
4. **After any state repair, re-run `status.sh` to confirm coherence before resuming supervision.**
### Strict ORCHESTRATOR:DONE gate
`verify-complete.sh` handles the main checks automatically (checkpoints, threads, CI green, spawned_at, and CHANGES_REQUESTED). Run it:
**CHANGES_REQUESTED staleness rule**: a `CHANGES_REQUESTED` review only blocks if it was submitted *after* the latest commit. If the latest commit postdates the review, the review is considered stale (feedback already addressed) and does not block. This avoids false negatives when a bot reviewer hasn't re-reviewed after the agent's fixing commits.
```bash
SKILLS_DIR=~/.claude/orchestrator/scripts
bash $SKILLS_DIR/verify-complete.sh SESSION:WIN
```
If it passes → run-loop.sh will recycle the window automatically. No manual action needed.
If it fails → re-brief the agent with the failure reason. Never manually mark state `done` to bypass this.
### Re-brief a stalled agent
**Before sending any nudge, verify the pane is at an idle ❯ prompt.** Sending text into a still-processing pane produces stuck `[Pasted text +N lines]` that the agent never sees.
2. **Reference the path in the objective string**:
```bash
OBJECTIVE="Implement the layout shown in /tmp/orchestrator-context-1234567890.png. Read that image first with the Read tool to understand the design."
```
3. The agent uses its `Read` tool to view the image at startup — Claude Code agents are multimodal and can read image files directly.
**Rule**: always use `/tmp/orchestrator-context-<timestamp>.png` as the naming convention so the supervisor knows what to look for if it needs to re-brief an agent with the same image.
---
## Orchestrator final evaluation (YOU decide, not the script)
`verify-complete.sh` is a gate — it blocks premature marking. But it cannot tell you if the work is actually good. That is YOUR job.
When run-loop marks an agent `pending_evaluation` and you're notified, do all of these before marking done:
### 1. Run /pr-test (required, serialized, use TodoWrite to queue)
`/pr-test` is the only reliable confirmation that the objective is actually met. Run it yourself, not the agent.
**When multiple PRs reach `pending_evaluation` at the same time, use TodoWrite to queue them:**
Context: This PR implements <objective from state file>. Key files: <list>.
Please verify: <specific behaviors to check>.
```
Only one `/pr-test` at a time — they share ports and DB.
### /pr-test result evaluation
**PARTIAL on any headline feature scenario is an immediate blocker.** Do not approve, do not mark done, do not let the agent output `ORCHESTRATOR:DONE`.
| `/pr-test` result | Action |
|---|---|
| All headline scenarios **PASS** | Proceed to evaluation step 2 |
| Any headline scenario **PARTIAL** | Re-brief the agent immediately — see below |
| Any headline scenario **FAIL** | Re-brief the agent immediately |
**What PARTIAL means**: the feature is only partly working. Example: the Apply button never appeared, or the AI returned no action blocks. The agent addressed part of the objective but not all of it.
**When any headline scenario is PARTIAL or FAIL:**
1. Do NOT mark the agent done or accept `ORCHESTRATOR:DONE`
2. Re-brief the agent with the specific scenario that failed and what was missing:
```bash
tmux send-keys -t SESSION:WIN "PARTIAL result on /pr-test — S5 (Apply button) never appeared. The AI must output JSON action blocks for the Apply button to render. Fix this before re-running /pr-test."
4. Wait for new `ORCHESTRATOR:DONE`, then re-run `/pr-test` from scratch
**Rule: only ALL-PASS qualifies for approval.** A mix of PASS + PARTIAL is a failure.
> **Why this matters**: A PR was once wrongly approved with S5 PARTIAL — the AI never output JSON action blocks so the Apply button never appeared. The fix was already in the agent's reach but slipped through because PARTIAL was not treated as blocking.
### 2. Do your own evaluation
1. **Read the PR diff and objective** — does the code actually implement what was asked? Is anything obviously missing or half-done?
2. **Read the resolved threads** — were comments addressed with real fixes, or just dismissed/resolved without changes?
3. **Check CI run names** — any suspicious retries that shouldn't have passed?
4. **Check the PR description** — title, summary, test plan complete?
### 3. Decide
- `/pr-test` all scenarios PASS + evaluation looks good → mark `done` in state, tell the user the PR is ready, ask if window should be closed
- `/pr-test` any scenario PARTIAL or FAIL → re-brief the agent with the specific failing scenario, set state back to `running` (see `/pr-test result evaluation` above)
- Evaluation finds gaps even with all PASS → re-brief the agent with specific gaps, set state back to `running`
**Never mark done based purely on script output.** You hold the full objective context; the script does not.
Stop the fleet (`active = false`) when **all** of the following are true:
| Check | How to verify |
|---|---|
| All agents are `done` or `escalated` | `jq '[.agents[] | select(.state | test("running\|stuck\|idle\|waiting_approval"))] | length' ~/.claude/orchestrator-state.json` == 0 |
| All PRs have 0 unresolved review threads | GraphQL `isResolved` check per PR |
| All PRs have green CI **on a run triggered after the agent's last push** | `gh run list --branch BRANCH --limit 1` timestamp > `spawned_at` in state |
| No fresh CHANGES_REQUESTED (after latest commit) | `verify-complete.sh` checks this — stale pre-commit reviews are ignored |
| No agents are `escalated` without human review | If any are escalated, surface to user first |
**Do NOT stop just because agents output `ORCHESTRATOR:DONE`.** That is a signal to verify, not a signal to stop.
**Do stop** if the user explicitly says "stop", "shut down", or "kill everything", even with agents still running.
Does **not** recycle running worktrees — agents may still be mid-task. Run `capacity.sh` to see what's still in progress.
## tmux send-keys pattern
**Always split long messages into text + Enter as two separate calls with a sleep between them.** If sent as one call (`"text" Enter`), Enter can fire before the full string is buffered into Claude's input — leaving the message stuck as `[Pasted text +N lines]` unsent.
```bash
# CORRECT — text then Enter separately
tmux send-keys -t "$WINDOW" "your long message here"
sleep 0.3
tmux send-keys -t "$WINDOW" Enter
# WRONG — Enter may fire before text is buffered
tmux send-keys -t "$WINDOW" "your long message here" Enter
```
Short single-character sends (`y`, `Down`, empty Enter for dialog approval) are safe to combine since they have no buffering lag.
---
## Protected worktrees
Some worktrees must **never** be used as spare worktrees for agent tasks because they host files critical to the orchestrator itself:
| Worktree | Protected branch | Why |
|---|---|---|
| `AutoGPT1` | `dx/orchestrate-skill` | Hosts the orchestrate skill scripts. `recycle-agent.sh` would check out `spare/1`, wiping `.claude/skills/` and breaking all subsequent `spawn-agent.sh` calls. |
**Rule**: when selecting spare worktrees via `find-spare.sh`, skip any worktree whose CURRENT branch matches a protected branch. If you accidentally spawn an agent in a protected worktree, do not let `recycle-agent.sh` run on it — manually restore the branch after the agent finishes.
When `dx/orchestrate-skill` is merged into `dev`, `AutoGPT1` becomes a normal spare again.
---
## Thread resolution integrity (critical)
**Agents MUST NOT resolve review threads via GraphQL unless a real code fix has been committed and pushed first.**
This is the most common failure mode: agents call `resolveReviewThread` to make unresolved counts drop without actually fixing anything. This produces a false "done" signal that gets past verify-complete.sh.
**The only valid resolution sequence:**
1. Read the thread and understand what it's asking
2. Make the actual code change
3. `git commit` and `git push`
4. Reply to the thread with the commit SHA (e.g. "Fixed in `abc1234`")
5. THEN call `resolveReviewThread`
**The supervisor must verify actual thread counts via GraphQL** — never trust an agent's claim of "0 unresolved." After any agent's ORCHESTRATOR:DONE, always run:
If unresolved > 0, the agent is NOT done — re-brief with the actual count and the rule.
**Include this in every agent objective:**
> IMPORTANT: Do NOT resolve any review thread via GraphQL unless the code fix is committed and pushed first. Fix the code → commit → push → reply with SHA → then resolve. Never resolve without a real commit. "Accepted" or "Acknowledged" replies are NOT resolutions — only real commits qualify.
### Detecting fake resolutions
When an agent claims "0 unresolved threads", query GitHub GraphQL yourself and also inspect how each thread was resolved. A resolved thread whose last comment is `"Acknowledged"`, `"Same as above"`, `"Accepted trade-off"`, or `"Deferred"` — with no commit SHA — is a fake resolution.
To spot these, paginate all pages and collect resolved threads with missing SHA links:
```bash
# Paginate all pages — first:100 misses threads beyond page 1 on large PRs
Any resolved thread whose last comment does NOT contain `"Fixed in"`, `"Removed in"`, or `"Addressed in"` (with a commit link) should be investigated — either the agent falsely resolved it, or it was a genuine false positive that needs explanation.
## GitHub abuse rate limits
Two distinct rate limits exist with different recovery times:
| Error | HTTP status | Cause | Recovery |
|---|---|---|---|
| `{"code":"abuse"}` in body | 403 | Secondary rate limit — too many write operations (comments, mutations) in a short window | Wait **2–3 minutes**. 60s is often not enough. |
| `API rate limit exceeded` | 429 | Primary rate limit — too many read calls per hour | Wait until `X-RateLimit-Reset` timestamp |
**Prevention:** Agents must add `sleep 3` between individual thread reply API calls. For >20 unresolved threads, increase to `sleep 5`.
If you see a 403 `abuse` error from an agent's pane:
1. Nudge the agent: `"You hit a GitHub secondary rate limit (403). Stop all API writes. Wait 2 minutes, then resume with sleep 3 between each thread reply."`
2. Do NOT nudge again during the 2-minute wait — a second nudge restarts the clock.
Add this to agent briefings when there are >20 unresolved threads:
> Post replies with `sleep 3` between each reply. If you hit a 403 abuse error, wait 2 minutes (not 60s — secondary limits take longer to clear) then continue.
## Key rules
1. **Scripts do all the heavy lifting** — don't reimplement their logic inline in this file
2. **Never ask the user to pick a worktree** — auto-assign from `find-spare.sh` output
3. **Never restart a running agent** — only restart on `idle` kicks (foreground is a shell)
4. **Auto-dismiss settings dialogs** — if "Enter to confirm" appears, send Down+Enter
5. **Always `--permission-mode bypassPermissions`** on every spawn
6. **Escalate after 3 kicks** — mark `escalated`, surface to user
7. **Atomic state writes** — always write to `.tmp` then `mv`
8. **Never approve destructive commands** outside the worktree scope — when in doubt, escalate
9. **Never recycle without verification** — `verify-complete.sh` must pass before recycling
10. **No TASK.md files** — commit risk; use state file + `gh pr view` for agent context persistence
11. **Re-brief stalled agents** — read objective from state file + `gh pr view`, send via tmux
12. **ORCHESTRATOR:DONE is a signal to verify, not to accept** — always run `verify-complete.sh` and check CI run timestamp before recycling
13. **Protected worktrees** — never use the worktree hosting the skill scripts as a spare
14. **Images via file path** — save screenshots to `/tmp/orchestrator-context-<ts>.png`, pass path in objective; agents read with the `Read` tool
15. **Split send-keys** — always separate text and Enter with `sleep 0.3` between calls for long strings
16. **Poll ALL windows from state file** — never hardcode window count. Derive active windows dynamically: `jq -r '.agents[] | select(.state | test("running|idle|stuck")) | .window' ~/.claude/orchestrator-state.json`. If you added a window mid-session outside spawn-agent.sh, add it to the state file immediately.
20. **Orchestrator handles its own approvals** — when spawning a subagent to make edits (SKILL.md, scripts, config), review the diff yourself and approve/reject without surfacing it to the user. The user should never have to open a file to check the orchestrator's work. Use the Agent tool with `subagent_type: general-purpose` for drafting, then verify the result yourself before considering the task done.
17.**Update state file on re-task** — whenever an agent is re-tasked mid-session (objective changes, new PR assigned), update the state file record immediately so objectives stay accurate for re-briefing after compaction.
18.**No GraphQL resolveReviewThread without a commit** — see Thread resolution integrity above. This is rule #1 for pr-address work.
19.**Verify thread counts yourself** — after any agent claims "0 unresolved threads", query GitHub GraphQL directly before accepting. Never trust the agent's self-report.
# Wait up to 60s for claude to be fully interactive and idle (❯ visible, no spinner).
if ! _wait_idle "$WINDOW" 60;then
echo"[spawn-agent] WARNING: timed out waiting for idle ❯ prompt on $WINDOW — sending objective anyway" >&2
fi
# Send the task. Split text and Enter — if combined, Enter can fire before the string
# is fully buffered, leaving the message stuck as "[Pasted text +N lines]" unsent.
tmux send-keys -t "$WINDOW""${OBJECTIVE} Output each completed step as CHECKPOINT:<step-name>. When ALL steps are done, output ORCHESTRATOR:DONE on its own line."
sleep 0.3
tmux send-keys -t "$WINDOW" Enter
# Only output the window address — nothing else (callers parse this)
description: Address PR review comments and loop until CI green and all comments resolved. TRIGGER when user asks to address comments, fix PR feedback, respond to reviewers, or babysit/monitor a PR.
user-invocable: true
argument-hint: "[PR number or URL] — if omitted, finds PR for current branch."
metadata:
author: autogpt-team
version: "1.0.0"
---
# PR Address
## Find the PR
```bash
gh pr list --head $(git branch --show-current) --repo Significant-Gravitas/AutoGPT
gh pr view {N}
```
## Read the PR description
Understand the **Why / What / How** before addressing comments — you need context to make good fixes:
```bash
gh pr view {N} --json body --jq '.body'
```
> If GraphQL is rate-limited, `gh pr view` fails. See [GitHub rate limits](#github-rate-limits) for REST fallbacks.
> ⚠️ **WARNING — PAGINATE ALL PAGES BEFORE ADDRESSING ANYTHING**
>
> `reviewThreads(first: 100)` returns at most 100 threads per page AND returns threads **oldest-first**. On a PR with many review cycles (e.g. 373 threads), the oldest 100–200 threads are from past cycles and are **all already resolved**. Filtering client-side with `select(.isResolved == false)` on page 1 therefore yields **0 results** — even though pages 2–4 contain many unresolved threads from recent review cycles.
>
> **This is the most common failure mode:** agent fetches page 1, sees 0 unresolved after filtering, stops pagination, reports "done" — while hundreds of unresolved threads sit on later pages.
>
> One observed PR had 142 total threads: page 1 returned 0 unresolved (all old/resolved), while pages 2–3 had 111 unresolved. Another with 373 threads across 4 pages also had page 1 entirely resolved.
>
> **The rule: ALWAYS paginate to `hasNextPage == false` regardless of the per-page unresolved count. Never stop early because a page returns 0 unresolved.**
**Step 1 — Fetch total count and sanity-check the newest threads:**
```bash
# Get total count and the newest 100 threads (last: 100 returns newest-first)
If `total > 100`, you have multiple pages — you **must** paginate all of them regardless of what `newest_unresolved` shows. The `last: 100` check is a sanity signal only; the full loop below is mandatory.
**Step 2 — Collect all unresolved thread IDs across all pages:**
```bash
# Accumulate all unresolved threads — loop until hasNextPage == false
**Step 3 — Address every thread in `ALL_THREADS`, then resolve.**
Only after this loop completes (all pages fetched, count confirmed) should you begin making fixes.
> **Why reverse?** GraphQL returns threads oldest-first and exposes no `orderBy` option. A PR with 373 threads has ~4 pages; threads from the latest review cycle land on the last pages. Processing in reverse ensures the newest, most blocking comments are addressed first — the earlier pages mostly contain outdated threads from prior cycles.
**Filter to unresolved threads only** — skip any thread where `isResolved: true`. `comments(last: 1)` returns the most recent comment in the thread — act on that; it reflects the reviewer's final ask. Use the thread `id` (Relay global ID) to track threads across polls.
> If GraphQL is rate-limited, see [GitHub rate limits](#github-rate-limits) for the REST fallback (flat comment list — no thread grouping or `isResolved`).
### 2. Top-level reviews — REST (MUST paginate)
```bash
gh api repos/Significant-Gravitas/AutoGPT/pulls/{N}/reviews --paginate
```
> **Already REST — unaffected by GraphQL rate limits or outages. Continue polling reviews normally even when GraphQL is exhausted.**
**CRITICAL — always `--paginate`.** Reviews default to 30 per page. PRs can have 80–170+ reviews (mostly empty resolution events). Without pagination you miss reviews past position 30 — including `autogpt-reviewer`'s structured review which is typically posted after several CI runs and sits well beyond the first page.
Two things to extract:
- **Overall state**: look for `CHANGES_REQUESTED` or `APPROVED` reviews.
- **Actionable feedback**: non-empty bodies only. Empty-body reviews are thread-resolution events — they indicate progress but have no feedback to act on.
**Where each reviewer posts:**
-`autogpt-reviewer` — posts detailed structured reviews ("Blockers", "Should Fix", "Nice to Have") as **top-level reviews**. Not present on every PR. Address ALL items.
-`sentry[bot]` — posts bug predictions as **inline threads**. Fix real bugs, explain false positives.
-`coderabbitai[bot]` — posts summaries as **top-level reviews** AND actionable items as **inline threads**. Address actionable items.
- Human reviewers — can post in any source. Address ALL non-empty feedback.
### 3. PR conversation comments — REST
```bash
gh api repos/Significant-Gravitas/AutoGPT/issues/{N}/comments --paginate
```
> **Already REST — unaffected by GraphQL rate limits.**
Mostly contains: bot summaries (`coderabbitai[bot]`), CI/conflict detection (`github-actions[bot]`), and author status updates. Scan for non-empty messages from non-bot human reviewers that aren't the PR author — those are the ones that need a response.
## For each unaddressed comment
**CRITICAL: The only valid sequence is fix → commit → push → reply → resolve. Never resolve a thread without a real code commit.**
Resolving a thread via `resolveReviewThread` without an actual fix is the most common failure mode — it makes unresolved counts drop without any real change, producing a false "done" signal. If the issue was genuinely a false positive (no code change needed), reply explaining why and then resolve. Otherwise:
Address comments **one at a time**: fix → commit → push → inline reply → resolve.
1. Read the referenced code, make the fix (or reply explaining why it's not needed)
2. Commit and push the fix
3. Reply **inline** (not as a new top-level comment) referencing the fixing commit — this is what resolves the conversation for bot reviewers (coderabbitai, sentry):
Use a **markdown commit link** so GitHub renders it as a clickable reference. Always get the full SHA with `git rev-parse HEAD`**after** committing — never copy a SHA from a previous commit or hardcode one:
```bash
FULL_SHA=$(git rev-parse HEAD)
gh api repos/Significant-Gravitas/AutoGPT/pulls/{N}/comments/{ID}/replies \
-f body="🤖 Fixed in [${FULL_SHA:0:9}](https://github.com/Significant-Gravitas/AutoGPT/commit/${FULL_SHA}): <description>"
```
| Comment type | How to reply |
|---|---|
| Inline review (`pulls/{N}/comments`) | `gh api repos/Significant-Gravitas/AutoGPT/pulls/{N}/comments/{ID}/replies -f body="🤖 Fixed in [abc1234](https://github.com/Significant-Gravitas/AutoGPT/commit/FULL_SHA): <description>"` |
| Conversation (`issues/{N}/comments`) | `gh api repos/Significant-Gravitas/AutoGPT/issues/{N}/comments -f body="🤖 Fixed in [abc1234](https://github.com/Significant-Gravitas/AutoGPT/commit/FULL_SHA): <description>"` |
### What counts as a valid resolution
Only two situations justify calling `resolveReviewThread`:
1.**Real code fix**: you changed the code, committed + pushed, and replied with the SHA. The commit diff must actually address the concern — not just touch the same file.
2.**Genuine false positive**: the reviewer's concern does not apply to this code, and you can give a specific technical reason (e.g. "Not applicable — `sdk_cwd` is pre-validated by `_make_sdk_cwd()` which applies normpath + prefix assertion before reaching this point").
**Anti-patterns that look resolved but aren't — never do these:**
-`"Accepted, tracked as follow-up"` — a deferral, not a fix. The concern is still open. Do not resolve.
-`"Acknowledged"` or `"Same as above"` — these are acknowledgements, not fixes. Do not resolve.
-`"Fixed in abc1234"` where `abc1234` is a commit that doesn't actually change the flagged line/logic — dishonest. Verify `git show abc1234 -- path/to/file` changes the right thing before posting.
- Resolving without replying — the reviewer never sees what happened.
When in doubt: if a code change is needed, make it. A deferred issue means the thread stays open until the follow-up PR is merged.
## Codecov coverage
Codecov patch target is **80%** on changed lines. Checks are **informational** (not blocking) but should be green.
### Running coverage locally
**Backend** (from `autogpt_platform/backend/`):
```bash
poetry run pytest -s -vv --cov=backend --cov-branch --cov-report term-missing
2. For each uncovered file — extract inline logic to `helpers.ts`/`helpers.py` and test those (highest ROI). Colocate tests as `*_test.py` (backend) or `__tests__/*.test.ts` (frontend).
3. Run coverage locally to verify, commit, push.
## Format and commit
After fixing, format the changed code:
- **Backend** (from `autogpt_platform/backend/`): `poetry run format`
Never manually edit files in `src/app/api/__generated__/`.
Then commit and **push immediately** — never batch commits without pushing. Each fix should be visible on GitHub right away so CI can start and reviewers can see progress.
**Never push empty commits** (`git commit --allow-empty`) to re-trigger CI or bot checks. When a check fails, investigate the root cause (unchecked PR checklist, unaddressed review comments, code issues) and fix those directly. Empty commits add noise to git history.
For backend commits in worktrees: `poetry run git commit` (pre-commit hooks).
## Coverage
Codecov enforces patch coverage on new/changed lines — new code you write must be tested. Before pushing, verify you haven't left new lines uncovered:
```bash
cd autogpt_platform/backend
poetry run pytest --cov=. --cov-report=term-missing {path/to/changed/module}
```
Look for lines marked `miss` — those are uncovered. Add tests for any new code you wrote as part of addressing comments.
**Rules:**
- New code you add should have tests
- Don't remove existing tests when fixing comments
- If a reviewer asks you to delete code, also delete its tests, but verify coverage hasn't dropped on remaining lines
## The loop
```text
address comments → format → commit → push
→ wait for CI (while addressing new comments) → fix failures → push
→ re-check comments after CI settles
→ repeat until: all comments addressed AND CI green AND no new comments arriving
```
### Polling for CI + new comments
After pushing, poll for **both** CI status and new comments in a single loop. Do not use `gh pr checks --watch` — it blocks the tool and prevents reacting to new comments while CI is running.
> **Note:** `gh pr checks --watch --fail-fast` is tempting but it blocks the entire Bash tool call, meaning the agent cannot check for or address new comments until CI fully completes. Always poll manually instead.
Parse the results: if every check has `bucket` of `"pass"` or `"skipping"`, CI is green. If any has `"fail"`, CI has failed. Otherwise CI is still pending.
If the result is `"CONFLICTING"`, the PR has a merge conflict — see "Resolving merge conflicts" below. If `"UNKNOWN"`, GitHub is still computing mergeability — wait and re-check next poll.
3. Check for new/changed comments (all three sources):
**Inline threads** — re-run the GraphQL query from "Fetch comments". For each unresolved thread, record `{thread_id, last_comment_databaseId}` as your baseline. On each poll, action is needed if:
- A new thread `id` appears that wasn't in the baseline (new thread), OR
- An existing thread's `last_comment_databaseId` has changed (new reply on existing thread)
**Conversation comments:**
```bash
gh api repos/Significant-Gravitas/AutoGPT/issues/{N}/comments --paginate
```
Compare total count and newest `id` against baseline. Filter to non-empty, non-bot, non-author-update messages.
**Top-level reviews:**
```bash
gh api repos/Significant-Gravitas/AutoGPT/pulls/{N}/reviews --paginate
```
Watch for new non-empty reviews (`CHANGES_REQUESTED` or `COMMENTED` with body). Compare total count and newest `id` against baseline.
4. **React in this precedence order (first match wins):**
| Mergeability is `UNKNOWN` | GitHub is still computing mergeability. Sleep 30 seconds, then restart polling from the top. |
| New comments detected | Address them (fix → commit → push → reply). After pushing, re-fetch all comments to update your baseline, then restart this polling loop from the top (new commits invalidate CI status). |
| CI failed (bucket == "fail") | Get failed check links: `gh pr checks {N} --repo Significant-Gravitas/AutoGPT --json bucket,link --jq '.[] \| select(.bucket == "fail") \| .link'`. Extract run ID from link (format: `.../actions/runs/<run-id>/job/...`), read logs with `gh run view <run-id> --repo Significant-Gravitas/AutoGPT --log-failed`. Fix → commit → push → restart polling. |
| CI green + no new comments | **Do not exit immediately.** Bots (coderabbitai, sentry) often post reviews shortly after CI settles. Continue polling for **2 more cycles (60s)** after CI goes green. Only exit after 2 consecutive green+quiet polls. |
| CI pending + no new comments | Sleep 30 seconds, then poll again. |
**The loop ends when:** CI fully green + all comments addressed + **2 consecutive polls with no new comments after CI settled.**
git remote -v # find the remote pointing to Significant-Gravitas/AutoGPT (typically 'upstream' in forks, 'origin' for direct contributors)
```
2. Pull the latest base branch with a 3-way merge:
```bash
git pull {base-remote} {base-branch} --no-rebase
```
3. Resolve conflicting files, then verify no conflict markers remain:
```bash
if grep -R -n -E '^(<<<<<<<|=======|>>>>>>>)' <conflicted-files>; then
echo "Unresolved conflict markers found — resolve before proceeding."
exit 1
fi
```
4. Stage and push:
```bash
git add <conflicted-files>
git commit -m "Resolve merge conflicts with {base-branch}"
git push
```
5. Restart the polling loop from the top — new commits reset CI status.
## GitHub rate limits
Three distinct rate limits exist — they have different causes, error shapes, and recovery times:
| Error | HTTP code | Cause | Recovery |
|---|---|---|---|
| `{"code":"abuse"}` | 403 | Secondary rate limit — too many write operations (comments, mutations) in a short window | Wait **2–3 minutes**. 60s is often not enough. |
| `{"message":"API rate limit exceeded"}` | 429 | Primary REST rate limit — 5000 calls/hr per user | Wait until `X-RateLimit-Reset` header timestamp |
| `GraphQL: API rate limit already exceeded for user ID ...` | 403 on stderr, `gh` exits 1 | **GraphQL-specific** per-user limit — distinct from REST's 5000/hr and from the abuse secondary limit. Trips faster than REST because point costs per query. | Wait until the GraphQL window resets (typically ~1 hour from the first call in the window). REST still works — use fallbacks below. |
**Prevention:** Add `sleep 3` between individual thread reply API calls. When posting >20 replies, increase to `sleep 5`.
### Detection
The `gh` CLI surfaces the GraphQL limit on stderr with the exact string `GraphQL: API rate limit already exceeded for user ID <id>` and exits 1 — any `gh api graphql ...` **or** `gh pr view ...` call fails. Check current quota and reset time via the REST endpoint that reports GraphQL quota (this call is REST and still works whether GraphQL is rate-limited OR fully down):
gh api rate_limit --jq '.resources.graphql.reset' | xargs -I{} date -r {}
```
Retry when `remaining > 0`. If you need to proceed sooner, sleep 2–5 min and probe again — the limit is per user, not per machine, so other concurrent agents under the same token also consume it.
### What keeps working
When GraphQL is unavailable (rate-limited or outage):
- **Keeps working (REST):** top-level reviews fetch, conversation comments fetch, all inline-comment replies, CI status (`gh pr checks`), and the `gh api rate_limit` probe.
- **Degraded:** inline thread list — fall back to flat `/pulls/{N}/comments` REST, which drops thread grouping, `isResolved`, and Relay thread IDs. You still get comment bodies and the `databaseId` as `id`, enough to read and reply.
- **Blocked:** `gh pr view`, the `resolveReviewThread` mutation, and any new `gh api graphql` queries — wait for the quota to reset.
### Fall back to REST
**PR metadata reads** — `gh pr view` uses GraphQL under the hood; use the REST pulls endpoint instead, which returns the full PR object:
```bash
gh api repos/Significant-Gravitas/AutoGPT/pulls/{N} --jq '.body' # == --json body
gh api repos/Significant-Gravitas/AutoGPT/pulls/{N} --jq '.base.ref' # == --json baseRefName
gh api repos/Significant-Gravitas/AutoGPT/pulls/{N} --jq '.mergeable' # == --json mergeable
```
Note: REST `mergeable` returns `true|false|null`; GraphQL returns `MERGEABLE|CONFLICTING|UNKNOWN`. The `null` case maps to `UNKNOWN` — treat it the same (still computing; poll again).
**Inline comments (flat list)** — no thread grouping or `isResolved`, but enough to read and reply:
```bash
gh api repos/Significant-Gravitas/AutoGPT/pulls/{N}/comments --paginate \
Use this degraded mode to make progress on the fix → reply loop, then return to GraphQL for `resolveReviewThread` once the rate limit resets.
**Replies** — already REST-native (`/pulls/{N}/comments/{ID}/replies`); no change needed, use the same command as the main flow.
**`resolveReviewThread`** — **no REST equivalent**; GitHub does not expose a REST endpoint for thread resolution. Queue the thread IDs needing resolution, wait for the GraphQL limit to reset, then run the resolve mutations in a batch (with `sleep 3` between calls, per the secondary-limit guidance).
### Recovery from secondary rate limit (403 abuse)
4. If 403 persists after 2 min, wait another 2 min before retrying
Never batch all replies in a tight loop — always space them out.
## Parallel thread resolution
When a PR has more than 10 unresolved threads, addressing one commit per thread is slow. Use this strategy instead:
### Group by file, batch per commit
1. Sort `ALL_THREADS` by `path` — threads in the same file can share a single commit.
2. Fix all threads in one file → `git commit` → `git push` → reply to **all** those threads with the same SHA → resolve them all.
3. Move to the next file group and repeat.
This reduces N commits to (number of files touched), which is usually 3–5 instead of 15–30.
### Posting replies concurrently (for large batches)
For truly independent thread groups (different files, no shared logic), you can post replies in parallel using background subshells — but always space out API writes:
```bash
# Post replies to a batch of threads concurrently, 3s apart
(
sleep 3
gh api repos/Significant-Gravitas/AutoGPT/pulls/{N}/comments/{ID1}/replies \
-f body="🤖 Fixed in [${FULL_SHA:0:9}](https://github.com/Significant-Gravitas/AutoGPT/commit/${FULL_SHA}): ..."
) &
(
sleep 6
gh api repos/Significant-Gravitas/AutoGPT/pulls/{N}/comments/{ID2}/replies \
-f body="🤖 Fixed in [${FULL_SHA:0:9}](https://github.com/Significant-Gravitas/AutoGPT/commit/${FULL_SHA}): ..."
) &
wait # wait for all background replies before resolving
```
Then resolve sequentially (GraphQL mutations):
```bash
for THREAD_ID in "$THREAD1" "$THREAD2" "$THREAD3"; do
**Always sleep 3s between individual API writes** — GitHub's secondary rate limit (403) triggers on bursts of >20 writes. Increase to `sleep 5` when posting more than 20 replies in a batch.
## Resolving threads via GraphQL
Use `resolveReviewThread` **only after** the commit is pushed and the reply is posted:
**Never call this mutation before committing the fix.** The orchestrator will verify actual unresolved counts via GraphQL after you output `ORCHESTRATOR:DONE` — false resolutions will be caught and you will be re-briefed.
> `resolveReviewThread` is GraphQL-only — no REST equivalent. If GraphQL is rate-limited, see [GitHub rate limits](#github-rate-limits) for the queue-and-retry flow.
### Verify actual count before outputting ORCHESTRATOR:DONE
Before claiming "0 unresolved threads", always query GitHub directly — don't rely on your own bookkeeping. Paginate all pages — a single `first: 100` query misses threads beyond page 1:
description: Create a pull request for the current branch. TRIGGER when user asks to create a PR, open a pull request, push changes for review, or submit work for merging.
user-invocable: true
metadata:
author: autogpt-team
version: "1.0.0"
---
# Create Pull Request
## Steps
1.**Check for existing PR**: `gh pr view --json url -q .url 2>/dev/null` — if a PR already exists, output its URL and stop
description: Alternate /pr-review and /pr-address on a PR until the PR is truly mergeable — no new review findings, zero unresolved inline threads, zero unaddressed top-level reviews or issue comments, all CI checks green, and two consecutive quiet polls after CI settles. Use when the user wants a PR polished to merge-ready without setting a fixed number of rounds.
user-invocable: true
argument-hint: "[PR number or URL] — if omitted, finds PR for current branch."
metadata:
author: autogpt-team
version: "1.0.0"
---
# PR Polish
**Goal.** Drive a PR to merge-ready by alternating `/pr-review` and `/pr-address` until **all** of the following hold:
1. The most recent `/pr-review` produces **zero new findings** (no new inline comments, no new top-level reviews with a non-empty body).
2. Every inline review thread reachable via GraphQL reports `isResolved: true`.
3. Every non-bot, non-author top-level review has been acknowledged (replied-to) OR resolved via a thread it spawned.
4. Every non-bot, non-author issue comment has been acknowledged (replied-to).
5. Every CI check is `conclusion: "success"` or `"skipped"` / `"neutral"` — none `"failure"` or still pending.
6.**Two consecutive post-CI polls** (≥60s apart) stay clean — no new threads, no new non-empty reviews, no new issue comments. Bots (coderabbitai, sentry, autogpt-reviewer) frequently post late after CI settles; a single green snapshot is not sufficient.
**Do not stop at a fixed number of rounds.** If round N introduces new comments, round N+1 is required. Cap at `_MAX_ROUNDS = 10` as a safety valve, but expect 2–5 in practice.
## TodoWrite
Before starting, write two todos so the user can see the loop progression:
-`Round {current}: /pr-review + /pr-address on PR #{N}` — current iteration.
-`Final polish polling: 2 consecutive clean polls, CI green, 0 unresolved` — runs after the last non-empty review round.
Update the `current` round counter at the start of each iteration; mark `completed` only when the round's address step finishes (all new threads addressed + resolved).
## Find the PR
```bash
ARG_PR="${ARG:-}"
# Normalize URL → numeric ID if the skill arg is a pull-request URL.
# Issue comments — count + latest id per non-bot, non-author comment.
# Bots are filtered by User.type == "Bot" (GitHub sets this for app/bot
# accounts like coderabbitai, github-actions, sentry-io). The author is
# filtered by comparing login to the PR author — export it so jq can see it.
AUTHOR=$(gh api "repos/Significant-Gravitas/AutoGPT/pulls/${PR}" --jq '.user.login')
gh api "repos/Significant-Gravitas/AutoGPT/issues/${PR}/comments" --paginate \
--jq --arg author "$AUTHOR"\
'[.[] | select(.user.type != "Bot" and .user.login != $author)
| {id, user: .user.login, created_at}]'\
> /tmp/baseline_issue_comments.json
```
### Diffing after a review
After `/pr-review` runs, any of these counting as "new findings" means another address round is needed:
- New inline thread `id` not in the baseline.
- An existing thread whose latest comment `databaseId` is higher than the baseline's (new reply on an old thread).
- A new top-level review `id` with a non-empty body.
- A new issue comment `id` from a non-bot, non-author user.
If any of the four buckets is non-empty → not done; invoke `/pr-address` and loop.
## Polish polling
Once `/pr-review` produces zero new findings, do **not** exit yet. Bots (coderabbitai, sentry, autogpt-reviewer) commonly post late reviews after CI settles — 30–90 seconds after the final push. Poll at 60-second intervals:
baseline = snapshot_state(PR) # reset — the address loop just dealt with these,
# otherwise they stay "new" relative to the old baseline forever
clean_polls = 0
continue
# 3. Mergeability gate
mergeable = gh api repos/.../pulls/${PR} --jq '.mergeable'
if mergeable == false (CONFLICTING):
resolve_conflicts(PR) # see pr-address skill
clean_polls = 0
continue
if mergeable is null (UNKNOWN):
sleep 60; continue
clean_polls += 1
sleep 60
```
Only after `clean_polls == 2` do you report `ORCHESTRATOR:DONE`.
### Why 2 clean polls, not 1
A single green snapshot can be misleading — the final CI check often completes ~30s before a bot posts its delayed review. One quiet cycle does not prove the PR is stable; two consecutive cycles with no new threads, reviews, or issue comments arriving gives high confidence nothing else is incoming.
### Why checking every source each poll
`/pr-address` polling inside a single round already re-checks its own comments, but `/pr-polish` sits a level above and must also catch:
- New top-level reviews (autogpt-reviewer sometimes posts structured feedback only after several CI green cycles).
- Issue comments from human reviewers (not caught by inline thread polling).
- Sentry bug predictions that land on new line numbers post-push.
- Merge conflicts introduced by a race between your push and a merge to `dev`.
## Invocation pattern
Delegate to existing skills with the `Skill` tool; do not re-implement the review or address logic inline. This keeps the polish loop focused on orchestration and lets the child skills evolve independently.
```python
Skill(skill="pr-review",args=pr_url)
Skill(skill="pr-address",args=pr_url)
```
After each child invocation, re-query GitHub state directly — never trust a summary for the stop condition. The orchestrator's `ORCHESTRATOR:DONE` is verified against actual GraphQL / REST responses per the rules in `pr-address`'s "Verify actual count before outputting ORCHESTRATOR:DONE" section.
### **Auto-continue: do NOT end your response between child skills**
`/pr-polish` is a single orchestration task — one invocation drives the PR all the way to merge-ready. When a child `Skill()` call returns control to you:
- Do NOT summarize and stop.
- Do NOT wait for user confirmation to continue.
- Immediately, in the same response, perform the next loop step: state diff → decide next action → next `Skill()` call or polling sleep.
The child skill returning is a **loop iteration boundary**, not a conversation turn boundary. You are expected to keep going until one of the exit conditions in the opening section is met (2 consecutive clean polls, `_MAX_ROUNDS` hit, or an unrecoverable error).
If the user needs to approve a risky action mid-loop (e.g., a force-push or a destructive git operation), pause there — but not at the routine "round N finished, round N+1 needed" boundary. Those are silent transitions.
## GitHub rate limits
This skill issues many GraphQL calls (one review-thread query per outer iteration plus per-poll queries inside polish polling). Expect the GraphQL budget to be tight on large PRs. When `gh api rate_limit --jq .resources.graphql.remaining` drops below ~200, back off:
- Fall back to REST for reads (flat `/pulls/{N}/comments`, `/pulls/{N}/reviews`, `/issues/{N}/comments`) per the `pr-address` skill's GraphQL-fallback section.
- Queue thread resolutions (GraphQL-only) until the budget resets; keep making progress on fixes + REST replies meanwhile.
-`sleep 5` between any batch of ≥20 writes to avoid secondary rate limits.
## Safety valves
-`_MAX_ROUNDS = 10` — if review+address rounds exceed this, stop and escalate to the user with a summary of what's still unresolved. A PR that cannot converge in 10 rounds has systemic issues that need human judgment.
- After each commit, run `poetry run format` / `pnpm format && pnpm lint && pnpm types` per the target codebase's conventions. A failing format check is CI `failure` that will never self-resolve.
- Every `/pr-review` round checks for **duplicate** concerns first (via `pr-review`'s own "Fetch existing review comments" step) so the loop does not re-post the same finding that a prior round already resolved.
## Reporting
When the skill finishes (either via two clean polls or hitting `_MAX_ROUNDS`), produce a compact summary:
description: Address all open PR review comments systematically. Fetches comments, addresses each one, reacts +1/-1, and replies when clarification is needed. Keeps iterating until all comments are addressed and CI is green. TRIGGER when user shares a PR URL, asks to address review comments, fix PR feedback, or respond to reviewer comments.
description: Review a PR for correctness, security, code quality, and testing issues. TRIGGER when user asks to review a PR, check PR quality, or give feedback on a PR.
user-invocable: true
args: "[PR number or URL] — if omitted, finds PR for current branch."
8.**Re-fetch comments** immediately — address any new unreacted ones before waiting on CI
9.**Stay productive while CI runs** — don't idle. In priority order:
- Run any pending local tests (`poetry run pytest`, e2e, etc.) and fix failures
- Address any remaining comments
- Only poll `gh pr checks {N}` as the last resort when there's truly nothing left to do
10.**If CI fails** — fix, go back to step 6
11.**Re-fetch comments again** after CI is green — address anything that appeared while CI was running
12.**Done** only when: all comments reacted AND CI is green.
```bash
gh pr list --head $(git branch --show-current) --repo Significant-Gravitas/AutoGPT
gh pr view {N}
```
## CRITICAL: Do Not Stop
## Read the PR description
**Loop is: address → format → commit → push → re-check comments → run local tests → wait CI → re-check comments → repeat.**
Before reading code, understand the **why**, **what**, and **how** from the PR description:
Never idle. If CI is running and you have nothing to address, run local tests. Waiting on CI is the last resort.
```bash
gh pr view {N} --json body --jq '.body'
```
## Rules
Every PR should have a Why / What / How structure. If any of these are missing, note it as feedback.
- One todo per comment
- For inline review comments: reply on existing threads. For PR conversation comments: post a new issue comment (API doesn't support threaded replies)
- React to every comment: +1 addressed, -1 disagreed (with explanation)
## Read the diff
```bash
gh pr diff {N}
```
## Fetch existing review comments
Before posting anything, fetch existing inline comments to avoid duplicates:
```bash
gh api repos/Significant-Gravitas/AutoGPT/pulls/{N}/comments --paginate
gh api repos/Significant-Gravitas/AutoGPT/pulls/{N}/reviews
```
## What to check
**Description quality:** Does the PR description cover Why (motivation/problem), What (summary of changes), and How (approach/implementation details)? If any are missing, request them — you can't judge the approach without understanding the problem and intent.
**Security:** input validation at boundaries, no injection (command, XSS, SQL), secrets not logged, file paths sanitized (`os.path.basename()` in error messages).
**Code quality:** apply rules from backend/frontend CLAUDE.md files.
**Architecture:** DRY, single responsibility, modular functions. `Security()` vs `Depends()` for FastAPI auth. `data:` for SSE events, `: comment` for heartbeats. `transaction=True` for Redis pipelines.
**Testing:** edge cases covered, colocated `*_test.py` (backend) / `__tests__/` (frontend), mocks target where symbol is **used** not defined, `AsyncMock` for async.
## Output format
Every comment **must** be prefixed with `🤖` and a criticality badge:
| Tier | Badge | Meaning |
|---|---|---|
| Blocker | `🔴 **Blocker**` | Must fix before merge |
| Should Fix | `🟠 **Should Fix**` | Important improvement |
| Nice to Have | `🟡 **Nice to Have**` | Minor suggestion |
| Nit | `🔵 **Nit**` | Style / wording |
Example: `🤖 🔴 **Blocker**: Missing error handling for X — suggest wrapping in try/except.`
## Post inline comments
For each finding, post an inline comment on the PR (do not just write a local report):
```bash
# Get the latest commit SHA for the PR
COMMIT_SHA=$(gh api repos/Significant-Gravitas/AutoGPT/pulls/{N} --jq '.head.sha')
# Post an inline comment on a specific file/line
gh api repos/Significant-Gravitas/AutoGPT/pulls/{N}/comments \
description: Initialize a worktree-based repo layout for parallel development. Creates a main worktree, a reviews worktree for PR reviews, and N numbered work branches. Handles .env creation, dependency installation, and branchlet config. TRIGGER when user asks to set up the repo from scratch, initialize worktrees, bootstrap their dev environment, "setup repo", "setup worktrees", "initialize dev environment", "set up branches", or when a freshly cloned repo has no sibling worktrees.
user-invocable: true
args: "No arguments — interactive setup via prompts."
metadata:
author: autogpt-team
version: "1.0.0"
---
# Repository Setup
This skill sets up a worktree-based development layout from a freshly cloned repo. It creates:
- A **main** worktree (the primary checkout)
- A **reviews** worktree (for PR reviews)
- **N work branches** (branch1..branchN) for parallel development
## Step 1: Identify the repo
Determine the repo root and parent directory:
```bash
ROOT=$(git rev-parse --show-toplevel)
REPO_NAME=$(basename "$ROOT")
PARENT=$(dirname "$ROOT")
```
Detect if the repo is already inside a worktree layout by counting sibling worktrees (not just checking the directory name, which could be anything):
```bash
# Count worktrees that are siblings (live under $PARENT but aren't $ROOT itself)
**Do NOT assume .env files exist.** For each worktree (including main if needed):
1. Check if `.env` exists in the source worktree for each path
2. If `.env` exists, copy it
3. If only `.env.default` or `.env.example` exists, copy that as `.env`
4. If neither exists, warn the user and list which env files are missing
Env file locations to check (same as the `/worktree` skill — keep these in sync):
-`autogpt_platform/.env`
-`autogpt_platform/backend/.env`
-`autogpt_platform/frontend/.env`
> **Note:** This env copying logic intentionally mirrors the `/worktree` skill's approach. If you update the path list or fallback logic here, update `/worktree` as well.
```bash
SOURCE="$ROOT"
WORKTREES="reviews"
for i in $(seq 1"$COUNT");doWORKTREES="$WORKTREES branch$i";done
FOUND_ANY_ENV=0
for wt in $WORKTREES;do
TARGET="$PARENT/$wt"
for envpath in autogpt_platform autogpt_platform/backend autogpt_platform/frontend;do
description: Set up a new git worktree for parallel development. Creates the worktree, copies .env files, installs dependencies, generates Prisma client, and optionally starts the app (with port conflict resolution) or runs tests. TRIGGER when user asks to set up a worktree, work on a branch in isolation, or needs a separate environment for a branch or PR.
user-invocable: true
metadata:
author: autogpt-team
version: "1.0.0"
---
# Worktree Setup
## Preferred: Use Branchlet
The repo has a `.branchlet.json` config — it handles env file copying, dependency installation, and Prisma generation automatically.
description: Set up a new git worktree for parallel development. Creates the worktree, copies .env files, installs dependencies, and generates Prisma client. TRIGGER when user asks to set up a worktree, work on a branch in isolation, or needs a separate environment for a branch or PR.
user-invocable: true
args: "[name] — optional worktree name (e.g., 'AutoGPT7'). If omitted, uses next available AutoGPT<N>."
metadata:
author: autogpt-team
version: "3.0.0"
---
# Worktree Setup
## Create the worktree
Derive paths from the git toplevel. If a name is provided as argument, use it. Otherwise, check `git worktree list` and pick the next `AutoGPT<N>`.
```bash
ROOT=$(git rev-parse --show-toplevel)
PARENT=$(dirname "$ROOT")
# From an existing branch
git worktree add "$PARENT/<NAME>" <branch-name>
# From a new branch off dev
git worktree add -b <new-branch> "$PARENT/<NAME>" dev
```
## Copy environment files
Copy `.env` from the root worktree. Falls back to `.env.default` if `.env` doesn't exist.
```bash
ROOT=$(git rev-parse --show-toplevel)
TARGET="$(dirname "$ROOT")/<NAME>"
for envpath in autogpt_platform/backend autogpt_platform/frontend autogpt_platform;do
description: "Analyze the current branch diff against dev, plan integration tests for changed frontend pages/components, and write them. TRIGGER when user asks to write frontend tests, add test coverage, or 'write tests for my changes'."
user-invocable: true
args: "[base branch] — defaults to dev. Optionally pass a specific base branch to diff against."
metadata:
author: autogpt-team
version: "1.0.0"
---
# Write Frontend Tests
Analyze the current branch's frontend changes, plan integration tests, and write them.
## References
Before writing any tests, read the testing rules and conventions:
git diff "$BASE_BRANCH"...HEAD -- src/ | head -500
```
## Step 2: Categorize changes and find test targets
For each changed file, determine:
1.**Is it a page?** (`page.tsx`) — these are the primary test targets
2.**Is it a hook?** (`use*.ts`) — test via the page/component that uses it; avoid direct `renderHook()` tests unless it is a shared reusable hook with standalone business logic
3.**Is it a component?** (`.tsx` in `components/`) — test via the parent page unless it's complex enough to warrant isolation
4.**Is it a helper?** (`helpers.ts`, `utils.ts`) — unit test directly if pure logic
**Priority order:**
1. Pages with new/changed data fetching or user interactions
2. Components with complex internal logic (modals, forms, wizards)
3. Shared hooks with standalone business logic when UI-level coverage is impractical
set +e # Ignore non-zero exit codes and continue execution
echo "Running the following command: poetry run agbenchmark --maintain --mock"
poetry run agbenchmark --maintain --mock
EXIT_CODE=$?
set -e # Stop ignoring non-zero exit codes
# Check if the exit code was 5, and if so, exit with 0 instead
if [ $EXIT_CODE -eq 5 ]; then
echo "regression_tests.json is empty."
fi
echo "Running the following command: poetry run agbenchmark --mock"
poetry run agbenchmark --mock
echo "Running the following command: poetry run agbenchmark --mock --category=data"
poetry run agbenchmark --mock --category=data
echo "Running the following command: poetry run agbenchmark --mock --category=coding"
poetry run agbenchmark --mock --category=coding
# echo "Running the following command: poetry run agbenchmark --test=WriteFile"
# poetry run agbenchmark --test=WriteFile
cd ../benchmark
poetry install
echo "Adding the BUILD_SKILL_TREE environment variable. This will attempt to add new elements in the skill tree. If new elements are added, the CI fails because they should have been pushed"
gh issue comment $PR_NUMBER --body "You changed AutoGPT's behaviour on ${{ runner.os }}. The cassettes have been updated and will be merged to the submodule when this Pull Request gets merged."
This guide provides context for Codex when updating the **autogpt_platform** folder.
This guide provides context for coding agents when updating the **autogpt_platform** folder.
## Directory overview
@@ -30,7 +30,7 @@ See `/frontend/CONTRIBUTING.md` for complete patterns. Quick reference:
- Regenerate with `pnpm generate:api`
- Pattern: `use{Method}{Version}{OperationName}`
4.**Styling**: Tailwind CSS only, use design tokens, Phosphor Icons only
5.**Testing**: Add Storybook stories for new components, Playwright for E2E
5.**Testing**: Integration tests (Vitest + RTL + MSW) are the default (~90%, page-level). Playwright for E2E critical flows. Storybook for design system components. See `autogpt_platform/frontend/TESTING.md`
6.**Code conventions**: Function declarations (not arrow functions) for components/handlers
- Component props should be `interface Props { ... }` (not exported) unless the interface needs to be used outside the component
@@ -47,7 +47,9 @@ See `/frontend/CONTRIBUTING.md` for complete patterns. Quick reference:
## Testing
- Backend: `poetry run test` (runs pytest with a docker based postgres + prisma).
- Frontend: `pnpm test` or `pnpm test-ui` for Playwright tests. See `docs/content/platform/contributing/tests.md` for tips.
@@ -83,13 +83,13 @@ The AutoGPT frontend is where users interact with our powerful AI automation pla
**Agent Builder:** For those who want to customize, our intuitive, low-code interface allows you to design and configure your own AI agents.
**Workflow Management:** Build, modify, and optimize your automation workflows with ease. You build your agent by connecting blocks, where each block performs a single action.
**Workflow Management:** Build, modify, and optimize your automation workflows with ease. You build your agent by connecting blocks, where each block performs a single action.
**Deployment Controls:** Manage the lifecycle of your agents, from testing to production.
**Ready-to-Use Agents:** Don't want to build? Simply select from our library of pre-configured agents and put them to work immediately.
**Agent Interaction:** Whether you've built your own or are using pre-configured agents, easily run and interact with them through our user-friendly interface.
**Agent Interaction:** Whether you've built your own or are using pre-configured agents, easily run and interact with them through our user-friendly interface.
**Monitoring and Analytics:** Keep track of your agents' performance and gain insights to continually improve your automation processes.
1.`.env.default` files provide base configuration (tracked in git)
2.`.env` files provide user-specific overrides (gitignored)
3. Docker Compose `environment:` sections provide service-specific overrides
4. Shell environment variables have highest precedence
#### Key Points
- All services use hardcoded defaults in docker-compose files (no `${VARIABLE}` substitutions)
- The `env_file` directive loads variables INTO containers at runtime
- Backend/Frontend services use YAML anchors for consistent configuration
- Supabase services (`db/docker/docker-compose.yml`) follow the same pattern
### Branching Strategy
- **`dev`** is the main development branch. All PRs should target `dev`.
- **`master`** is the production branch. Only used for production releases.
### Creating Pull Requests
- Create the PR against the `dev` branch of the repository.
- **Split PRs by concern** — each PR should have a single clear purpose. For example, "usage tracking" and "credit charging" should be separate PRs even if related. Combining multiple concerns makes it harder for reviewers to understand what belongs to what.
- Ensure the branch name is descriptive (e.g., `feature/add-new-block`)
- Use conventional commit messages (see below)
- **Structure the PR description with Why / What / How** — Why: the motivation (what problem it solves, what's broken/missing without it); What: high-level summary of changes; How: approach, key implementation details, or architecture decisions. Reviewers need all three to judge whether the approach fits the problem.
- Fill out the .github/PULL_REQUEST_TEMPLATE.md template as the PR description
- Always use `--body-file` to pass PR body — avoids shell interpretation of backticks and special characters:
```bash
PR_BODY=$(mktemp)
cat > "$PR_BODY" << 'PREOF'
## Summary
- use `backticks` freely here
PREOF
gh pr create --title "..." --body-file "$PR_BODY" --base dev
rm "$PR_BODY"
```
- Run the github pre-commit hooks to ensure code quality.
### Test-Driven Development (TDD)
When fixing a bug or adding a feature, follow a test-first approach:
1. **Write a failing test first** — create a test that reproduces the bug or validates the new behavior, marked with `@pytest.mark.xfail` (backend) or `.fixme` (Playwright). Run it to confirm it fails for the right reason.
2. **Implement the fix/feature** — write the minimal code to make the test pass.
3. **Remove the xfail marker** — once the test passes, remove the `xfail`/`.fixme` annotation and run the full test suite to confirm nothing else broke.
This ensures every change is covered by a test and that the test actually validates the intended behavior.
### Reviewing/Revising Pull Requests
Use `/pr-review` to review a PR or `/pr-address` to address comments.
When fetching comments manually:
- `gh api repos/Significant-Gravitas/AutoGPT/pulls/{N}/reviews --paginate` — top-level reviews
- `gh api repos/Significant-Gravitas/AutoGPT/pulls/{N}/comments --paginate` — inline review comments (always paginate to avoid missing comments beyond page 1)
- `gh api repos/Significant-Gravitas/AutoGPT/issues/{N}/comments` — PR conversation comments
### Conventional Commits
Use this format for commit messages and Pull Request titles:
**Conventional Commit Types:**
- `feat`: Introduces a new feature to the codebase
- `fix`: Patches a bug in the codebase
- `refactor`: Code change that neither fixes a bug nor adds a feature; also applies to removing features
- `ci`: Changes to CI configuration
- `docs`: Documentation-only changes
- `dx`: Improvements to the developer experience
**Recommended Base Scopes:**
- `platform`: Changes affecting both frontend and backend
- `frontend`
- `backend`
- `infra`
- `blocks`: Modifications/additions of individual blocks
**Subscope Examples:**
- `backend/executor`
- `backend/db`
- `frontend/builder` (includes changes to the block UI component)
- `infra/prod`
Use these scopes and subscopes for clarity and consistency in commit messages.
1.`.env.default` files provide base configuration (tracked in git)
2.`.env` files provide user-specific overrides (gitignored)
3. Docker Compose `environment:` sections provide service-specific overrides
4. Shell environment variables have highest precedence
#### Key Points
- All services use hardcoded defaults in docker-compose files (no `${VARIABLE}` substitutions)
- The `env_file` directive loads variables INTO containers at runtime
- Backend/Frontend services use YAML anchors for consistent configuration
- Supabase services (`db/docker/docker-compose.yml`) follow the same pattern
### Branching Strategy
- **`dev`** is the main development branch. All PRs should target `dev`.
- **`master`** is the production branch. Only used for production releases.
### Creating Pull Requests
- Create the PR against the `dev` branch of the repository.
- Ensure the branch name is descriptive (e.g., `feature/add-new-block`)
- Use conventional commit messages (see below)
- Fill out the .github/PULL_REQUEST_TEMPLATE.md template as the PR description
- Run the github pre-commit hooks to ensure code quality.
### Reviewing/Revising Pull Requests
- When the user runs /pr-comments or tries to fetch them, also run gh api /repos/Significant-Gravitas/AutoGPT/pulls/[issuenum]/reviews to get the reviews
- Use gh api /repos/Significant-Gravitas/AutoGPT/pulls/[issuenum]/reviews/[review_id]/comments to get the review contents
- Use gh api /repos/Significant-Gravitas/AutoGPT/issues/9924/comments to get the pr specific comments
### Conventional Commits
Use this format for commit messages and Pull Request titles:
**Conventional Commit Types:**
-`feat`: Introduces a new feature to the codebase
-`fix`: Patches a bug in the codebase
-`refactor`: Code change that neither fixes a bug nor adds a feature; also applies to removing features
-`ci`: Changes to CI configuration
-`docs`: Documentation-only changes
-`dx`: Improvements to the developer experience
**Recommended Base Scopes:**
-`platform`: Changes affecting both frontend and backend
-`frontend`
-`backend`
-`infra`
-`blocks`: Modifications/additions of individual blocks
**Subscope Examples:**
-`backend/executor`
-`backend/db`
-`frontend/builder` (includes changes to the block UI component)
-`infra/prod`
Use these scopes and subscopes for clarity and consistency in commit messages.
This file provides guidance to coding agents when working with the backend.
## Essential Commands
To run something with Python package dependencies you MUST use `poetry run ...`.
```bash
# Install dependencies
poetry install
# Run database migrations
poetry run prisma migrate dev
# Start all services (database, redis, rabbitmq, clamav)
docker compose up -d
# Run the backend as a whole
poetry run app
# Run tests
poetry run test
# Run specific test
poetry run pytest path/to/test_file.py::test_function_name
# Run block tests (tests that validate all blocks work correctly)
poetry run pytest backend/blocks/test/test_block.py -xvs
# Run tests for a specific block (e.g., GetCurrentTimeBlock)
poetry run pytest 'backend/blocks/test/test_block.py::test_available_blocks[GetCurrentTimeBlock]' -xvs
# Lint and format
# prefer format if you want to just "fix" it and only get the errors that can't be autofixed
poetry run format # Black + isort
poetry run lint # ruff
```
More details can be found in @TESTING.md
### Creating/Updating Snapshots
When you first write a test or when the expected output changes:
```bash
poetry run pytest path/to/test.py --snapshot-update
```
⚠️ **Important**: Always review snapshot changes before committing! Use `git diff` to verify the changes are expected.
## Architecture
- **API Layer**: FastAPI with REST and WebSocket endpoints
- **Database**: PostgreSQL with Prisma ORM, includes pgvector for embeddings
- **Queue System**: RabbitMQ for async task processing
- **Execution Engine**: Separate executor service processes agent workflows
- **Authentication**: JWT-based with Supabase integration
- **Security**: Cache protection middleware prevents sensitive data caching in browsers/proxies
## Code Style
- **Top-level imports only** — no local/inner imports (lazy imports only for heavy optional deps like `openpyxl`)
- **Absolute imports** — use `from backend.module import ...` for cross-package imports. Single-dot relative (`from .sibling import ...`) is acceptable for sibling modules within the same package (e.g., blocks). Avoid double-dot relative imports (`from ..parent import ...`) — use the absolute path instead
- **No duck typing** — no `hasattr`/`getattr`/`isinstance` for type dispatch; use typed interfaces/unions/protocols
- **Pydantic models** over dataclass/namedtuple/dict for structured data
- **No linter suppressors** — no `# type: ignore`, `# noqa`, `# pyright: ignore`; fix the type/code
- **List comprehensions** over manual loop-and-append
- **Early return** — guard clauses first, avoid deep nesting
- **f-strings vs printf syntax in log statements** — Use `%s` for deferred interpolation in `debug` statements, f-strings elsewhere for readability: `logger.debug("Processing %s items", count)`, `logger.info(f"Processing {count} items")`
- **Sanitize error paths** — `os.path.basename()` in error messages to avoid leaking directory structure
- **TOCTOU awareness** — avoid check-then-act patterns for file access and credit charging
- **`Security()` vs `Depends()`** — use `Security()` for auth deps to get proper OpenAPI security spec
- **Redis pipelines** — `transaction=True` for atomicity on multi-step operations
- **`max(0, value)` guards** — for computed values that should never be negative
- **SSE protocol** — `data:` lines for frontend-parsed events (must match Zod schema), `: comment` lines for heartbeats/status
- **File length** — keep files under ~300 lines; if a file grows beyond this, split by responsibility (e.g. extract helpers, models, or a sub-module into a new file). Never keep appending to a long file.
- **Function length** — keep functions under ~40 lines; extract named helpers when a function grows longer. Long functions are a sign of mixed concerns, not complexity.
- **Top-down ordering** — define the main/public function or class first, then the helpers it uses below. A reader should encounter high-level logic before implementation details.
## Testing Approach
- Uses pytest with snapshot testing for API responses
- Test files are colocated with source files (`*_test.py`)
- Mock at boundaries — mock where the symbol is **used**, not where it's **defined**
- After refactoring, update mock targets to match new module paths
- Use `AsyncMock` for async functions (`from unittest.mock import AsyncMock`)
### Test-Driven Development (TDD)
When fixing a bug or adding a feature, write the test **before** the implementation:
```python
# 1. Write a failing test marked xfail
@pytest.mark.xfail(reason="Bug #1234: widget crashes on empty input")
deftest_widget_handles_empty_input():
result=widget.process("")
assertresult==Widget.EMPTY_RESULT
# 2. Run it — confirm it fails (XFAIL)
# poetry run pytest path/to/test.py::test_widget_handles_empty_input -xvs
# 3. Implement the fix
# 4. Remove xfail, run again — confirm it passes
deftest_widget_handles_empty_input():
result=widget.process("")
assertresult==Widget.EMPTY_RESULT
```
This catches regressions and proves the fix actually works. **Every bug fix should include a test that would have caught it.**
## Database Schema
Key models (defined in `schema.prisma`):
-`User`: Authentication and profile data
-`AgentGraph`: Workflow definitions with version control
-`AgentGraphExecution`: Execution history and results
-`AgentNode`: Individual nodes in a workflow
-`StoreListing`: Marketplace listings for sharing agents
Follow the comprehensive [Block SDK Guide](@../../docs/platform/block-sdk-guide.md) which covers:
- Provider configuration with `ProviderBuilder`
- Block schema definition
- Authentication (API keys, OAuth, webhooks)
- Testing and validation
- File organization
Quick steps:
1. Create new file in `backend/blocks/`
2. Configure provider using `ProviderBuilder` in `_config.py`
3. Inherit from `Block` base class
4. Define input/output schemas using `BlockSchema`
5. Implement async `run` method
6. Generate unique block ID using `uuid.uuid4()`
7. Test with `poetry run pytest backend/blocks/test/test_block.py`
Note: when making many new blocks analyze the interfaces for each of these blocks and picture if they would go well together in a graph-based editor or would they struggle to connect productively?
ex: do the inputs and outputs tie well together?
If you get any pushback or hit complex block conditions check the new_blocks guide in the docs.
#### Handling files in blocks with `store_media_file()`
When blocks need to work with files (images, videos, documents), use `store_media_file()` from `backend.util.file`. The `return_format` parameter determines what you get back:
| Format | Use When | Returns |
|--------|----------|---------|
| `"for_local_processing"` | Processing with local tools (ffmpeg, MoviePy, PIL) | Local file path (e.g., `"image.png"`) |
| `"for_external_api"` | Sending content to external APIs (Replicate, OpenAI) | Data URI (e.g., `"data:image/png;base64,..."`) |
| `"for_block_output"` | Returning output from your block | Smart: `workspace://` in CoPilot, data URI in graphs |
**Examples:**
```python
# INPUT: Need to process file locally with ffmpeg
local_path=awaitstore_media_file(
file=input_data.video,
execution_context=execution_context,
return_format="for_local_processing",
)
# local_path = "video.mp4" - use with Path/ffmpeg/etc
# INPUT: Need to send to external API like Replicate
image_b64=awaitstore_media_file(
file=input_data.image,
execution_context=execution_context,
return_format="for_external_api",
)
# image_b64 = "data:image/png;base64,iVBORw0..." - send to API
# OUTPUT: Returning result from block
result_url=awaitstore_media_file(
file=generated_image_url,
execution_context=execution_context,
return_format="for_block_output",
)
yield"image_url",result_url
# In CoPilot: result_url = "workspace://abc123"
# In graphs: result_url = "data:image/png;base64,..."
```
**Key points:**
-`for_block_output` is the ONLY format that auto-adapts to execution context
- Always use `for_block_output` for block outputs unless you have a specific reason not to
- Never hardcode workspace checks - let `for_block_output` handle it
### Modifying the API
1. Update route in `backend/api/features/`
2. Add/update Pydantic models in same directory
3. Write tests alongside the route file
4. Run `poetry run test` to verify
## Workspace & Media Files
**Read [Workspace & Media Architecture](../../docs/platform/workspace-media-architecture.md) when:**
- Working on CoPilot file upload/download features
- Building blocks that handle `MediaFileType` inputs/outputs
- Modifying `WorkspaceManager` or `store_media_file()`
- Debugging file persistence or virus scanning issues
Covers: `WorkspaceManager` (persistent storage with session scoping), `store_media_file()` (media normalization pipeline), and responsibility boundaries for virus scanning and persistence.
## Security Implementation
### Cache Protection Middleware
- Located in `backend/api/middleware/security.py`
- Default behavior: Disables caching for ALL endpoints with `Cache-Control: no-store, no-cache, must-revalidate, private`
- Uses an allow list approach - only explicitly permitted paths can be cached
- Cacheable paths include: static assets (`static/*`, `_next/static/*`), health checks, public store pages, documentation
- Prevents sensitive data (auth tokens, API keys, user data) from being cached by browsers/proxies
- To allow caching for a new endpoint, add it to `CACHEABLE_PATHS` in the middleware
- Applied to both main API server and external API applications
Follow the comprehensive [Block SDK Guide](@../../docs/content/platform/block-sdk-guide.md) which covers:
- Provider configuration with `ProviderBuilder`
- Block schema definition
- Authentication (API keys, OAuth, webhooks)
- Testing and validation
- File organization
Quick steps:
1. Create new file in `backend/blocks/`
2. Configure provider using `ProviderBuilder` in `_config.py`
3. Inherit from `Block` base class
4. Define input/output schemas using `BlockSchema`
5. Implement async `run` method
6. Generate unique block ID using `uuid.uuid4()`
7. Test with `poetry run pytest backend/blocks/test/test_block.py`
Note: when making many new blocks analyze the interfaces for each of these blocks and picture if they would go well together in a graph-based editor or would they struggle to connect productively?
ex: do the inputs and outputs tie well together?
If you get any pushback or hit complex block conditions check the new_blocks guide in the docs.
#### Handling files in blocks with `store_media_file()`
When blocks need to work with files (images, videos, documents), use `store_media_file()` from `backend.util.file`. The `return_format` parameter determines what you get back:
| Format | Use When | Returns |
|--------|----------|---------|
| `"for_local_processing"` | Processing with local tools (ffmpeg, MoviePy, PIL) | Local file path (e.g., `"image.png"`) |
| `"for_external_api"` | Sending content to external APIs (Replicate, OpenAI) | Data URI (e.g., `"data:image/png;base64,..."`) |
| `"for_block_output"` | Returning output from your block | Smart: `workspace://` in CoPilot, data URI in graphs |
**Examples:**
```python
# INPUT: Need to process file locally with ffmpeg
local_path=awaitstore_media_file(
file=input_data.video,
execution_context=execution_context,
return_format="for_local_processing",
)
# local_path = "video.mp4" - use with Path/ffmpeg/etc
# INPUT: Need to send to external API like Replicate
image_b64=awaitstore_media_file(
file=input_data.image,
execution_context=execution_context,
return_format="for_external_api",
)
# image_b64 = "data:image/png;base64,iVBORw0..." - send to API
# OUTPUT: Returning result from block
result_url=awaitstore_media_file(
file=generated_image_url,
execution_context=execution_context,
return_format="for_block_output",
)
yield"image_url",result_url
# In CoPilot: result_url = "workspace://abc123"
# In graphs: result_url = "data:image/png;base64,..."
```
**Key points:**
-`for_block_output` is the ONLY format that auto-adapts to execution context
- Always use `for_block_output` for block outputs unless you have a specific reason not to
- Never hardcode workspace checks - let `for_block_output` handle it
### Modifying the API
1. Update route in `backend/api/features/`
2. Add/update Pydantic models in same directory
3. Write tests alongside the route file
4. Run `poetry run test` to verify
## Security Implementation
### Cache Protection Middleware
- Located in `backend/api/middleware/security.py`
- Default behavior: Disables caching for ALL endpoints with `Cache-Control: no-store, no-cache, must-revalidate, private`
- Uses an allow list approach - only explicitly permitted paths can be cached
- Cacheable paths include: static assets (`static/*`, `_next/static/*`), health checks, public store pages, documentation
- Prevents sensitive data (auth tokens, API keys, user data) from being cached by browsers/proxies
- To allow caching for a new endpoint, add it to `CACHEABLE_PATHS` in the middleware
- Applied to both main API server and external API applications
Some files were not shown because too many files have changed in this diff
Show More
Reference in New Issue
Block a user
Blocking a user prevents them from interacting with repositories, such as opening or commenting on pull requests or issues. Learn more about blocking a user.