The cancel endpoint runs in the AgentServer process while the asyncio
auto-approve task lives in the CoPilotExecutor process — separate memory.
The in-process dict cancel from the previous commit was a no-op across
processes.
- cancel_auto_approve now SETs a Redis key with TTL as the primary cancel
signal, plus best-effort in-process task.cancel() for single-worker.
- _run_auto_approve checks the Redis key before firing. If set, skips.
- Tests stub get_redis_async with a fake to avoid real Redis connections.
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Blocker fix: the server-side auto-approve timer fired even when the user
was editing steps via Modify, potentially building an agent against a plan
the user had explicitly chosen to change.
- backend: change _auto_approve_tasks set → _pending_auto_approvals dict
keyed by session_id. Add cancel_auto_approve(session_id) that looks up
and cancels the pending asyncio task.
- backend: new POST /sessions/{id}/cancel-auto-approve endpoint in
chat/routes.py, following the existing cancel_session_task pattern.
- frontend: handleModify() now fires postV2CancelAutoApproveTask
(generated hook) as a best-effort cancel before entering edit mode.
- helpers.tsx: import DecompositionStepModel from generated API types
instead of hand-rolling the interface. TaskDecompositionOutput stays
hand-rolled (runtime shape differs from generated type for created_at).
- Add session_id to TaskDecompositionOutput so the cancel call has it.
- Default step.status to "pending" where the generated type is optional.
- 2 new tests: cancel_auto_approve cancels pending task + returns false
for unknown session.
- Regenerate openapi.json with the new endpoint.
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
The created_at field was added to TaskDecompositionResponse a few commits
back but openapi.json was never regenerated, so the check-api-types CI
job (which re-exports the schema and asserts no diff) was failing.
Re-exporting via poetry run export-api-schema and prettier.
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Pre-existing formatting issue inherited from the dev merge — black wants
one blank line between TestUsdToMicrodollars and TestMaskEmail, not two.
This is unrelated to the decomposition feature but blocks CI lint.
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
If the user reopened the tab between 60s and 90s after a decomposition
was created, the lazy initializer for ``secondsLeft`` would return 0
(server-stamped deadline already elapsed). The auto-approve useEffect
fires whenever ``secondsLeft === 0``, so it would silently send the
"Approved" message on mount with no user interaction — even if the user
came back specifically to click Modify.
Track in a ref whether the lazy init returned 0 because the deadline
had already passed (vs. 0 because the timer counted down from a
positive value), and skip the auto-approve in that case. The server's
own fallback timer (running 30s longer than the client) handles the
"user never returns" path, so the client doesn't need to silently fire
on mount. The user can still click Approve or Modify manually; the
server will inject its own approval at 90s if neither happens.
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
The auto-approve task was firing a duplicate "Approved" message after the
agent had already been built manually. The predicate compared
ChatMessage.sequence against a baseline, but _save_session_to_db assigns
sequences in the DB without writing them back to the in-memory message
objects, and cache_chat_session writes those (sequence=None) objects to
Redis. So the predicate's loaded-from-cache view had None sequences for
freshly-appended messages, treated them as 0, and missed the user's
"Approved" entirely — leaving the timer to fire after the build had
already completed and re-injecting "Approved" for a duplicate turn.
Fix: capture len(session.messages) at schedule time and check for any
user-role message at index >= baseline. Indices are monotonic and require
no DB-side sequence bookkeeping.
Adds a regression test that constructs a session with sequence=None on
the user message, asserting the predicate detects it.
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
After the build plan box appears, the assistant continues streaming a
short summary text. Clicking Approve or Modify in that 1-2s window failed
because the chat session is locked to the in-flight turn — sending a new
user message gets rejected.
- ChatMessagesContainer now forwards isCurrentlyStreaming through
renderSegments → MessagePartRenderer → DecomposeGoalTool.
- DecomposeGoalTool computes actionsEnabled = showActions && !streaming
and uses it to (a) disable the Approve, Modify, and timer buttons and
(b) gate the auto-approve effect so the timer can hit 0 mid-stream
without firing — the effect re-runs and approves once streaming ends.
- The countdown ring keeps ticking during streaming so it stays in sync
with the server-side timer.
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
## Why
The platform cost tracking system had several gaps that made the admin
dashboard less accurate and harder to reason about:
**Q: Do we have per-model granularity on the provider page?**
The `model` column was stored in `PlatformCostLog` but the SQL
aggregation grouped only by `(provider, tracking_type)`, so all models
for a given provider collapsed into one row. Now grouped by `(provider,
tracking_type, model)` — each model gets its own row.
**Q: Why does Anthropic show `per_run` for OrchestratorBlock?**
Bug: `OrchestratorBlock._call_llm()` was building `NodeExecutionStats`
with only `input_token_count` and `output_token_count` — it dropped
`resp.provider_cost` entirely. For OpenRouter calls this silently
discarded the `cost_usd`. For the SDK (autopilot) path,
`ResultMessage.total_cost_usd` was never read. When `provider_cost` is
None and token counts are 0 (e.g. SDK error path), `resolve_tracking`
falls through to `per_run`. Fixed by propagating all cost/cache fields.
**Q: Why can't we get `cost_usd` for Anthropic direct API calls?**
The Anthropic Messages API does not return a dollar amount — only token
counts. OpenRouter returns cost via response headers, so it uses
`cost_usd` directly. The Claude Agent SDK *does* compute
`total_cost_usd` internally, so SDK-mode OrchestratorBlock runs now get
`cost_usd` tracking. For direct Anthropic LLM blocks the estimate uses
per-token rates (see cache section below).
**Q: What about labeling by source (autopilot vs block)?**
Already tracked: `block_name` stores `copilot:SDK`, `copilot:Baseline`,
or the actual block name. Visible in the raw logs table. Not added to
the provider group-by (would explode row count); use the logs table
filter instead.
**Q: Is there double-counting between `tokens`, `per_run`, and
`cost_usd`?**
No. `resolve_tracking()` uses a strict preference hierarchy — exactly
one tracking type per execution: `cost_usd` > `tokens` > provider
heuristics > `per_run`. A single execution produces exactly one
`PlatformCostLog` row.
**Q: Should we track Anthropic prompt cache tokens (PR #12725)?**
Yes — PR #12725 adds `cache_control` markers to Anthropic API calls,
which causes the API to return `cache_read_input_tokens` and
`cache_creation_input_tokens` alongside regular `input_tokens`. These
have different billing rates:
- Cache reads: **10%** of base input rate (much cheaper)
- Cache writes: **125%** of base input rate (slightly more expensive,
one-time)
- Uncached input: **100%** of base rate
Without tracking them separately, a flat-rate estimate on
`total_input_tokens` would be wrong in both directions.
## What
- **Per-model provider table**: SQL now groups by `(provider,
tracking_type, model)`. `ProviderCostSummary` and the frontend
`ProviderTable` show a model column.
- **Cache token columns**: New `cacheReadTokens` and
`cacheCreationTokens` columns in `PlatformCostLog` with matching
migration.
- **LLM block cache tracking**: `LLMResponse` captures
`cache_read_input_tokens` / `cache_creation_input_tokens` from Anthropic
responses. `NodeExecutionStats` gains `cache_read_token_count` /
`cache_creation_token_count`. Both propagate to `PlatformCostEntry` and
the DB.
- **Copilot path**: `token_tracking.persist_and_record_usage` now writes
cache tokens as dedicated `PlatformCostEntry` fields (was
metadata-only).
- **OrchestratorBlock bug fix**: `_call_llm()` now includes
`resp.provider_cost`, `resp.cache_read_tokens`,
`resp.cache_creation_tokens` in the stats merge. SDK path captures
`ResultMessage.total_cost_usd` as `provider_cost`.
- **Accurate cost estimation**: `estimateCostForRow` uses
token-type-specific rates for `tokens` rows (uncached=100%, reads=10%,
writes=125% of configured base rate).
## How
`resolve_tracking` priority is unchanged. For Anthropic LLM blocks the
tracking type remains `tokens` (Anthropic API returns no dollar amount).
For OrchestratorBlock in SDK/autopilot mode it now correctly uses
`cost_usd` because the Claude Agent SDK computes and returns
`total_cost_usd`. For OpenRouter through OrchestratorBlock it now
correctly uses `cost_usd` (was silently dropped before).
## Checklist 📋
#### For code changes:
- [x] I have clearly listed my changes in the PR description
- [x] I have made a test plan
- [x] I have tested my changes according to the test plan:
- [x] `ProviderCostSummary` SQL updated
- [x] Cache token fields present in `PlatformCostEntry` and
`PlatformCostLogCreateInput`
- [x] Prisma client regenerated — all type checks pass
- [x] Frontend `helpers.test.ts` updated for new `rateKey` format
- [x] Pre-commit hooks pass (Black, Ruff, isort, tsc, Prisma generate)
Reopening a session was restarting the client countdown from a fresh 60s,
even though the server had been counting the whole time. Now the timer
reflects real elapsed time so the user sees the actual remaining seconds
(or 0, which auto-approves immediately).
- backend: stamp UTC created_at on TaskDecompositionResponse via a default
factory. The timestamp is set when the tool returns and persisted in the
message content JSON, so it survives DB round-trips.
- frontend: lazy-init secondsLeft from (auto_approve_seconds -
(Date.now() - created_at)), clamped to [0, total]. Older messages
without created_at fall back to a fresh full countdown (existing
behaviour).
- Test: assert created_at is stamped within the duration of _execute().
Note: openapi.json regen is skipped in this commit because the existing
REST server is in use; the frontend reads tool output as opaque JSON via
custom helpers, so the regen is not required for the feature to work.
Regen later for completeness.
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
### Why / What / How
**Why:** The `ask_question` copilot tool previously only accepted a
single question per invocation. When the LLM needs to ask multiple
clarifying questions simultaneously, it either crams them into one text
field (requiring users to format numbered answers manually) or makes
multiple sequential tool calls (slow and disruptive UX).
**What:** Replace the single `question`/`options`/`keyword` parameters
with a `questions` array parameter so the LLM can ask multiple questions
in one tool call, each rendered as its own input box.
**How:** Simplified the tool to accept only `questions` (array of
question objects). Each item has `question` (required), `options`, and
`keyword`. The frontend `ClarificationQuestionsCard` already supports
rendering multiple questions — no frontend changes needed.
### Changes 🏗️
- `backend/copilot/tools/ask_question.py`: Replaced dual
question/questions schema with single `questions` array. Extracted
parsing into module-level `_parse_questions` and `_parse_one` helpers.
Follows backend code style: early returns, list comprehensions, top-down
ordering, functions under 40 lines.
- `backend/copilot/tools/ask_question_test.py`: Rewritten with 18
focused tests covering happy paths, keyword handling, options filtering,
and invalid input handling.
### Checklist 📋
#### For code changes:
- [x] I have clearly listed my changes in the PR description
- [x] I have made a test plan
- [ ] I have tested my changes according to the test plan:
- [ ] Run `poetry run pytest backend/copilot/tools/ask_question_test.py`
— all tests pass
🤖 Generated with [Claude Code](https://claude.com/claude-code)
---------
Co-authored-by: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
The decompose_goal countdown was purely client-side: if the user closed the
tab before the timer ran out, the agent never got built. Add a server-side
timer that fires the same approval message even when no client is connected.
- backend/copilot/model.py: add append_message_if helper that appends a
message inside the session lock only if a predicate is satisfied. Used
by the auto-approve task to no-op when the user has already acted.
- backend/copilot/tools/decompose_goal.py: when the tool returns, schedule
a fire-and-forget asyncio task (same _background_tasks pattern as
agent_browser.py) that sleeps 90s, re-checks the session, and if no user
message has appeared since, appends "Approved. Please build the agent."
and enqueues a new copilot turn. Stays in process; restart-resilience
is a documented follow-up.
- backend/copilot/tools/models.py: expose auto_approve_seconds on
TaskDecompositionResponse so the frontend countdown is sourced from the
backend instead of a hard-coded constant.
- frontend DecomposeGoal.tsx: seed secondsLeft from output.auto_approve_seconds
with a 60s fallback for older sessions.
- Regenerate openapi.json with the new field.
- Tests: 9 new unit tests covering the predicate, the auto-approve flow
(idle / user-acted / errors swallowed) and _schedule_auto_approve.
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
- Import get_subscription_price_id in v1.py
- get_subscription_status now calls stripe.Price.retrieve for PRO/BUSINESS
tiers to return actual unit_amount instead of hardcoded zeros
- UI will now show correct monthly costs when LD price IDs are configured
- Fix Button import from __legacy__ to design system in SubscriptionTierSection
- Update subscription status tests to mock the new Stripe price lookup
Drop the dual question/questions schema in favor of a single
`questions` array parameter. This removes ~175 lines of complexity
(the _execute_single path, duplicate params, precedence logic).
Restructured per backend code style rules:
- Top-down ordering: public _execute first, helpers below
- Early return with guard clauses, no deep nesting
- List comprehensions via walrus operator in _parse_questions
- Helpers extracted as module-level functions (not methods)
- Functions under 40 lines each
The frontend ClarificationQuestionsCard already renders arrays of
any length — no UI changes needed.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Add isinstance narrowing in test_execute_multiple_questions_ignores_single_params
to fix Pyright type-check CI failure (reportAttributeAccessIssue).
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Tests that access `result.questions` without first narrowing the type
from `ToolResponseBase` to `ClarificationNeededResponse` cause Pyright
type-check failures. Added `assert isinstance(result,
ClarificationNeededResponse)` before accessing `.questions` in 4 tests.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
- Remove setInitialPrompt() from handleModify() — the inline editor is the
sole editing UX; pre-filling the chat input simultaneously creates a
conflicting interface where chat-input submission loses inline edits
- Add useEffect to reset isEditing when showActions goes false (new message
arrives while editing), preventing users from being stuck in edit mode with
no way to submit
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
The ToolName Literal must stay in sync with TOOL_REGISTRY keys. Adds
'decompose_goal' to the platform tools section to fix CI test failures.
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
The API schema was missing DecompositionStepModel and TaskDecompositionResponse
after the merge. Regenerated with export-api-schema and formatted with prettier.
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Merge upstream dev changes (Graphiti memory responses) alongside the
TaskDecompositionResponse added in this PR.
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
- Prevent simultaneous pending + error state when output-error has null payload:
isPending is now false when isError is true
- Filter out steps with empty descriptions before building the approval
message, preventing malformed input from reaching the LLM
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Add showActions to the auto-approve useEffect dependency array and
condition. This prevents the approval from firing after isLastMessage
becomes false (e.g. when a new message arrives just as the timer
expires), closing the race condition flagged by Sentry.
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
- Add TaskDecompositionResponse to ToolResponseUnion for OpenAPI codegen
- Remove LLM-controllable require_approval param (hardcoded to True)
- Validate each step is a dict before calling .get()
- Validate step descriptions are non-empty
- Validate action values against allowlist, coerce unknown to DEFAULT_ACTION
- Align MAX_STEPS=8 with agent_generation_guide.md (was 10)
- Add DEFAULT_ACTION constant; use enum in schema
- Add model_validator to sync step_count with len(steps)
- Fix handleModify: pre-fill chat input via setInitialPrompt instead of sending dangling message
- Add approvedRef guard on handleModify to prevent double-clicks
- Fix eslint-disable: rewrite auto-approve effect without dependency suppression
- Fix hardcoded light-mode colors (bg-white, border-slate-200, text-zinc-800) → semantic tokens
- Fix error card: render ToolErrorCard whenever isError=true, not only when output is present
- Fix hint text: only show approve hint when requires_approval=true
- Remove dead `action` prop from StepItem
- Add aria-label to all StepStatusIcon states
- Tighten parseOutput type guards (Array.isArray check, no false positives)
- Rename isOperating → isPending for clarity
- Add backend unit tests for DecomposeGoalTool (16 cases)
- Add frontend unit tests for helpers.tsx (20 cases)
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
- Replace <input type="text"> with <textarea> for step descriptions
- Use ref callback to set height from scrollHeight on every render so
long descriptions wrap to multiple lines by default without interaction
- Bump countdown ring container from 20px to 24px and text from 9px to
11px for better legibility
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
- Replace static Approve/Modify buttons with a 99s countdown timer that
auto-approves when it expires
- Timer ring animates inline within "Starting in [N]s" text using SVG
strokeDasharray; hover on the text swaps it to "Start now" via Tailwind
named groups (group/label)
- Clicking Modify stops the timer, enters editable mode where steps can be
renamed, deleted, or inserted between existing steps
- In edit mode only Approve is shown; timer and Modify are hidden
- showActions gated on isLastMessage (server-derived) so the timer never
re-appears when returning to a session with prior messages
- Forward isLastMessage through ChatMessagesContainer → MessagePartRenderer
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
When navigating back to a cached session, appliedActionKeys was reset to empty
but messages were preserved. This caused previously applied actions to reappear
as unapplied in the UI, allowing them to be re-applied and creating duplicate
undo entries. Clearing messages unconditionally on navigation ensures the
displayed action buttons always reflect the actual applied state.
- Restore top-level `required: ["question"]` in schema for LLM tool-
calling compatibility; validation handles the questions-only path
- Fix keyword null bug: `item.get("keyword")` returning None now
correctly falls back to `question-{idx}` instead of producing "None"
- Filter empty-string options in _build_question (`str(o).strip()`)
to avoid artifacts like "Email, , Slack"
- Revert session type hint to `ChatSession` to match base class contract
- Add tests for null keyword and empty-string options filtering
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
- Remove top-level `required: ["question"]` from schema so the
`questions`-only calling convention is valid for schema-compliant LLMs
- Move logger assignment below all imports (PEP 8 / isort)
- Remove duplicated option filtering in `_execute_single`; let
`_build_question` own that responsibility
- Fix `session` type hint to `ChatSession | None` to match the guard
- Add test for `questions` as non-list type (falls back to single path)
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
- Fix falsy option filtering: use `if o is not None` instead of `if o`
so valid values like "0" are preserved
- Improve multi-question `message` field: join all questions with ";"
instead of only using the first question's text
- Add logging warnings for skipped invalid items in multi-question path
instead of silently dropping them
- Simplify schema: use `"required": ["question"]` instead of empty
required + anyOf (more LLM-friendly)
- Add missing test cases: session=None, single-item questions array,
duplicate keywords, falsy option values
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
The ask_question tool previously only accepted a single question per
invocation, forcing the LLM to cram multiple queries into one text box
or make multiple sequential tool calls. This adds a `questions` parameter
(list of question objects) so multiple input fields render at once.
Backward-compatible: the existing `question`/`options`/`keyword` params
still work. When `questions` (plural) is provided, they take precedence.
The frontend ClarificationQuestionsCard already supports rendering
multiple questions — no frontend changes needed.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Tests for GET/POST /credits/subscription covering:
- GET returns current tier (PRO, FREE default when None)
- POST FREE skips Stripe when payment disabled
- POST PRO sets tier directly for beta users (payment disabled)
- POST paid tier rejects missing success_url/cancel_url with 422
- POST paid tier creates Stripe Checkout Session and returns URL
- POST FREE with payment enabled cancels active Stripe subscription
- Remove useCallback from changeTier (not needed per project guidelines)
- Block self-service tier changes for ENTERPRISE users (admin-managed)
- Preserve current tier on unrecognized Stripe price_id instead of
defaulting to FREE (prevents accidental downgrades during price migration)
Tests for:
- Unknown/mismatched Stripe price_id defaults to FREE (not early return)
- None from LaunchDarkly price flags defaults to FREE
- BUSINESS tier mapping
- StripeError during cancel_stripe_subscription is logged, not raised
When sync_subscription_from_stripe encounters an unrecognized price_id
(e.g. LD flags unconfigured or price changed), it no longer returns early
leaving the user on a stale tier. Instead it defaults to FREE and logs a
warning, keeping the DB state consistent with Stripe's subscription status.
Also guard against None pro_price/biz_price from LaunchDarkly before
comparison to avoid silent mismatches.
EditAgentTool and RunAgentTool call useCopilotChatActions() which throws
if no provider is in the tree. Wrap the panel content with
CopilotChatActionsProvider wired to sendRawMessage so tool components
can send retry prompts without crashing.
The user message was saved to DB before the <user_context> prefix was added
to session.messages. Subsequent upsert_chat_session calls only append new
messages (slicing by existing_message_count), so the prefixed content was
never written to the DB. On page reload or --resume, the unprefixed version
was loaded, losing personalisation.
Fix: add update_message_content_by_sequence to db.py and call it after
injecting the prefix in both sdk/service.py and baseline/service.py.
Add customer.subscription.created to the sync handler so user tier is
upgraded immediately when the subscription is first created (not just on
subsequent updates/deletions).