- Add "do NOT redirect to the Builder for credential setup" guardrail to
run_agent description, making it symmetric with create_agent/edit_agent
- Scrub error message text from race-path warning logs; log only node IDs
and field names to avoid leaking credential IDs/provider details
- Add code comment explaining the None-vs-filtering trade-off in
_build_setup_requirements_from_validation_error
- Add E2E tests for structural-error fallback on both run and schedule
paths (verifies ErrorResponse returned, not setup_requirements card)
Previously covered accidentally by the CRED_ERR_INVALID_PREFIX prefix
rule. Adding an explicit exact-match entry makes the intent clear and
prevents breakage if the prefix constant ever changes.
- Add credential-routing guardrail to create_agent and edit_agent tool
descriptions so it's always visible to the LLM (not just in the guide)
- Change issubclass to identity check for GraphValidationError in
_handle_call_method_response to avoid fragile subclass dispatch
- Add new_callable=AsyncMock for consistency in execution race test
- Move orjson and RemoteCallError imports to module level in service_test
- Add Literal type annotation to _CREDENTIAL_ERROR_MARKERS match mode
Two related issues on the GraphValidationError-over-RPC path:
1. RemoteCallError.extras was typed as Optional[dict[str, Any]], which
lost all validation guarantees at the wire-format boundary. Today
the field only carries node_errors (a known-JSON-safe nested dict),
but Any left the door open for a future exception to stuff a
non-serializable payload in — blowing up inside FastAPI's JSON
encoder at handler time instead of at test time. Introduce a typed
RemoteCallExtras Pydantic model with an explicit
node_errors: Optional[dict[str, dict[str, str]]] field and update
both the server-side packer and client-side unpacker to use it.
Non-serializable sneak-ins now fail at model validation instead of
inside JSONResponse.
2. The previous tests only exercised the client-side deserialisation
with a mocked wire payload — there was no symmetric test that the
server handler actually packs node_errors into extras. If someone
accidentally dropped the isinstance(exc, GraphValidationError)
branch in _handle_internal_http_error, the client tests would still
pass because they forged the payload directly. Add
test_graph_validation_error_server_handler_packs_node_errors that
invokes the server handler with a real GraphValidationError and
validates the resulting JSONResponse body, plus
test_graph_validation_error_round_trips_through_handlers that wires
the server handler output through the client handler end to end —
either side drifting now fails this test.
is_credential_validation_error_message was matched against hand-typed
strings that had to be kept in sync with the four raise sites inside
_validate_node_input_credentials by convention only. Adding a new
credential error string would have silently regressed the copilot's
credential-race fallback to a plain text error, with nothing to catch
the drift at test time.
Extract the four error templates to module-level constants
(CRED_ERR_REQUIRED, CRED_ERR_INVALID_PREFIX, CRED_ERR_NOT_AVAILABLE_PREFIX,
CRED_ERR_UNKNOWN_PREFIX, CRED_ERR_INVALID_TYPE_MISMATCH) used from
both raise sites and the matcher, with a _CREDENTIAL_ERROR_MARKERS
table that classifies each as exact-match or prefix-match. Add a
parity test that asserts each constant-emitted message is recognised
by the public matcher (including typical runtime suffixes from the
f-strings), plus a case-insensitive check and a negative test. Now
adding a new credential error means adding a constant, and the
test_credential_error_markers_cover_all_raise_sites test fails if the
matcher drifts.
Both _run_agent and _schedule_agent silently swallowed the
GraphValidationError that triggers the credential-race fallback and
returned the inline setup card with no log. That left the race invisible
to oncall — every recovered request looked identical to a cold-cache
first attempt, and credential-drift rates could not be monitored.
Add a logger.warning at both race catch sites that captures user_id,
graph_id, and the raw node_errors so SRE can track how often the
prereq check drifts from the executor/scheduler re-validation. Keeps
the user-facing behaviour unchanged — still returns the inline card —
but makes the race observable.
When GraphValidationError contains a mix of credential and structural errors
(e.g., missing inputs), the previous `any()` check would surface the credentials
setup card and silently discard the structural errors. The user would fix
credentials, re-run, and only then see the structural error — a confusing
two-step failure.
Change the guard from `any(...)` to `all(...)` so the credentials setup card
is only shown when every error in node_errors is credential-related. Mixed
errors fall through to the plain ErrorResponse path.
Adds a regression test for the mixed-error case.
- _build_setup_requirements_from_validation_error now passes None to
build_missing_credentials_from_graph so all credential fields show up
as missing_credentials in the race scenario (prev-matched creds
became invalid between prereq check and executor call).
- Add test_run_agent_execution_credential_race_returns_setup_card to
mirror the existing schedule-path race test for the run path.
- Add credential-setup guardrail text to create_agent and edit_agent
tool descriptions so the LLM doesn't call them for credential setup.
Address CodeRabbit feedback on the new GraphValidationError round-trip
tests: the file already has a ``_SupportsGetReturn`` Protocol pattern
and coding guidelines forbid ``# type: ignore`` suppressors.
Add ``_SupportsHandleCallMethodResponse`` Protocol alongside the
existing one, cast the test client to it, and drop the two
``# type: ignore[attr-defined]`` suppressions added for the new tests.
Pre-existing suppressions on older tests are left untouched (separate
concern, out of scope for this PR).
``GraphValidationError`` carries a structured ``node_errors`` mapping
in addition to its top-level message, but RPC serialisation only
copied ``exc.args`` — so by the time the exception reached the
copilot client (for the schedule path, which goes via
``get_scheduler_client``), ``node_errors`` was ``{}``.
That broke the credential-race fallback introduced in the earlier
commits in this PR: the helper saw an empty ``node_errors``,
concluded it wasn't a credential error, and fell back to the generic
``ErrorResponse`` — exactly the symptom this PR is trying to fix.
Fix:
- Add an optional ``extras`` dict to ``RemoteCallError`` for
exception types that carry structured attributes beyond ``args``.
- In ``_handle_internal_http_error``, when the exception is a
``GraphValidationError``, pack ``node_errors`` into ``extras``.
- In ``_handle_call_method_response``, when reconstructing a
``GraphValidationError``, read ``extras.node_errors`` and pass it
to the constructor.
- Add two tests for the round-trip (with and without extras), the
latter to ensure backwards compatibility with old server responses.
Address review feedback on the credential-race setup card:
- Thread `graph_credentials` through
`_build_setup_requirements_from_validation_error` so
`missing_credentials` only lists fields the user hasn't connected
yet. Previously it was computed with `None`, so the inline card
showed every connected credential as missing during a race —
the opposite of the UX fix.
- Promote the credential-error-string matcher to a public helper
`backend.executor.utils.is_credential_validation_error_message` and
reuse it from both the dry-run path in
`_construct_starting_node_execution_input` and from the copilot
tool, so adding a new credential error string only requires touching
one file.
- Make the setup-card message action-neutral ("try again" instead of
"try scheduling again") — the helper is shared by the run and
schedule paths and the previous wording misled run-path users.
- Extend tests: cover the shared helper, add a test that verifies
connected credentials are filtered out of `missing_credentials`,
and assert the message is path-neutral.
Per review feedback: spamming the "do NOT use this tool to set up
credentials" warning across `create_agent` / `edit_agent` / `run_agent`
descriptions duplicates guidance that belongs in one place.
Move the "use run_agent / connect_integration, never the Builder" rule
into the existing **Credentials** key rule in `agent_generation_guide.md`
(loaded on demand by `get_agent_building_guide`), and revert the three
tool descriptions to their previous wording.
Black would reformat this file (extra blank line between two test
classes), which fails the `lint` CI job for any PR branched from the
current dev. Tiny drive-by fix to keep this PR's CI green.
The credential gate in `_check_prerequisites` only fires before the
scheduler/executor call. If credentials are deleted (or otherwise drift)
between the prereq check and the actual call, the scheduler raises a
generic `GraphValidationError` and the user sees a plain error string —
in the worst case the LLM falls back to `create_agent`/`edit_agent`,
which return an `AgentSavedResponse` linking the user to the Builder.
Catch `GraphValidationError` in `_run_agent` and `_schedule_agent`,
detect credential-flavoured node errors, and rebuild the inline
`SetupRequirementsResponse` so the credential setup card always appears
without the user leaving the chat.
Also updates the `run_agent` / `create_agent` / `edit_agent` tool
descriptions to explicitly tell the LLM to use `run_agent` (or
`connect_integration`) for credential setup — never the Builder.
## Why
The platform cost tracking system had several gaps that made the admin
dashboard less accurate and harder to reason about:
**Q: Do we have per-model granularity on the provider page?**
The `model` column was stored in `PlatformCostLog` but the SQL
aggregation grouped only by `(provider, tracking_type)`, so all models
for a given provider collapsed into one row. Now grouped by `(provider,
tracking_type, model)` — each model gets its own row.
**Q: Why does Anthropic show `per_run` for OrchestratorBlock?**
Bug: `OrchestratorBlock._call_llm()` was building `NodeExecutionStats`
with only `input_token_count` and `output_token_count` — it dropped
`resp.provider_cost` entirely. For OpenRouter calls this silently
discarded the `cost_usd`. For the SDK (autopilot) path,
`ResultMessage.total_cost_usd` was never read. When `provider_cost` is
None and token counts are 0 (e.g. SDK error path), `resolve_tracking`
falls through to `per_run`. Fixed by propagating all cost/cache fields.
**Q: Why can't we get `cost_usd` for Anthropic direct API calls?**
The Anthropic Messages API does not return a dollar amount — only token
counts. OpenRouter returns cost via response headers, so it uses
`cost_usd` directly. The Claude Agent SDK *does* compute
`total_cost_usd` internally, so SDK-mode OrchestratorBlock runs now get
`cost_usd` tracking. For direct Anthropic LLM blocks the estimate uses
per-token rates (see cache section below).
**Q: What about labeling by source (autopilot vs block)?**
Already tracked: `block_name` stores `copilot:SDK`, `copilot:Baseline`,
or the actual block name. Visible in the raw logs table. Not added to
the provider group-by (would explode row count); use the logs table
filter instead.
**Q: Is there double-counting between `tokens`, `per_run`, and
`cost_usd`?**
No. `resolve_tracking()` uses a strict preference hierarchy — exactly
one tracking type per execution: `cost_usd` > `tokens` > provider
heuristics > `per_run`. A single execution produces exactly one
`PlatformCostLog` row.
**Q: Should we track Anthropic prompt cache tokens (PR #12725)?**
Yes — PR #12725 adds `cache_control` markers to Anthropic API calls,
which causes the API to return `cache_read_input_tokens` and
`cache_creation_input_tokens` alongside regular `input_tokens`. These
have different billing rates:
- Cache reads: **10%** of base input rate (much cheaper)
- Cache writes: **125%** of base input rate (slightly more expensive,
one-time)
- Uncached input: **100%** of base rate
Without tracking them separately, a flat-rate estimate on
`total_input_tokens` would be wrong in both directions.
## What
- **Per-model provider table**: SQL now groups by `(provider,
tracking_type, model)`. `ProviderCostSummary` and the frontend
`ProviderTable` show a model column.
- **Cache token columns**: New `cacheReadTokens` and
`cacheCreationTokens` columns in `PlatformCostLog` with matching
migration.
- **LLM block cache tracking**: `LLMResponse` captures
`cache_read_input_tokens` / `cache_creation_input_tokens` from Anthropic
responses. `NodeExecutionStats` gains `cache_read_token_count` /
`cache_creation_token_count`. Both propagate to `PlatformCostEntry` and
the DB.
- **Copilot path**: `token_tracking.persist_and_record_usage` now writes
cache tokens as dedicated `PlatformCostEntry` fields (was
metadata-only).
- **OrchestratorBlock bug fix**: `_call_llm()` now includes
`resp.provider_cost`, `resp.cache_read_tokens`,
`resp.cache_creation_tokens` in the stats merge. SDK path captures
`ResultMessage.total_cost_usd` as `provider_cost`.
- **Accurate cost estimation**: `estimateCostForRow` uses
token-type-specific rates for `tokens` rows (uncached=100%, reads=10%,
writes=125% of configured base rate).
## How
`resolve_tracking` priority is unchanged. For Anthropic LLM blocks the
tracking type remains `tokens` (Anthropic API returns no dollar amount).
For OrchestratorBlock in SDK/autopilot mode it now correctly uses
`cost_usd` because the Claude Agent SDK computes and returns
`total_cost_usd`. For OpenRouter through OrchestratorBlock it now
correctly uses `cost_usd` (was silently dropped before).
## Checklist 📋
#### For code changes:
- [x] I have clearly listed my changes in the PR description
- [x] I have made a test plan
- [x] I have tested my changes according to the test plan:
- [x] `ProviderCostSummary` SQL updated
- [x] Cache token fields present in `PlatformCostEntry` and
`PlatformCostLogCreateInput`
- [x] Prisma client regenerated — all type checks pass
- [x] Frontend `helpers.test.ts` updated for new `rateKey` format
- [x] Pre-commit hooks pass (Black, Ruff, isort, tsc, Prisma generate)
### Why / What / How
**Why:** The `ask_question` copilot tool previously only accepted a
single question per invocation. When the LLM needs to ask multiple
clarifying questions simultaneously, it either crams them into one text
field (requiring users to format numbered answers manually) or makes
multiple sequential tool calls (slow and disruptive UX).
**What:** Replace the single `question`/`options`/`keyword` parameters
with a `questions` array parameter so the LLM can ask multiple questions
in one tool call, each rendered as its own input box.
**How:** Simplified the tool to accept only `questions` (array of
question objects). Each item has `question` (required), `options`, and
`keyword`. The frontend `ClarificationQuestionsCard` already supports
rendering multiple questions — no frontend changes needed.
### Changes 🏗️
- `backend/copilot/tools/ask_question.py`: Replaced dual
question/questions schema with single `questions` array. Extracted
parsing into module-level `_parse_questions` and `_parse_one` helpers.
Follows backend code style: early returns, list comprehensions, top-down
ordering, functions under 40 lines.
- `backend/copilot/tools/ask_question_test.py`: Rewritten with 18
focused tests covering happy paths, keyword handling, options filtering,
and invalid input handling.
### Checklist 📋
#### For code changes:
- [x] I have clearly listed my changes in the PR description
- [x] I have made a test plan
- [ ] I have tested my changes according to the test plan:
- [ ] Run `poetry run pytest backend/copilot/tools/ask_question_test.py`
— all tests pass
🤖 Generated with [Claude Code](https://claude.com/claude-code)
---------
Co-authored-by: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
- Import get_subscription_price_id in v1.py
- get_subscription_status now calls stripe.Price.retrieve for PRO/BUSINESS
tiers to return actual unit_amount instead of hardcoded zeros
- UI will now show correct monthly costs when LD price IDs are configured
- Fix Button import from __legacy__ to design system in SubscriptionTierSection
- Update subscription status tests to mock the new Stripe price lookup
Drop the dual question/questions schema in favor of a single
`questions` array parameter. This removes ~175 lines of complexity
(the _execute_single path, duplicate params, precedence logic).
Restructured per backend code style rules:
- Top-down ordering: public _execute first, helpers below
- Early return with guard clauses, no deep nesting
- List comprehensions via walrus operator in _parse_questions
- Helpers extracted as module-level functions (not methods)
- Functions under 40 lines each
The frontend ClarificationQuestionsCard already renders arrays of
any length — no UI changes needed.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Add isinstance narrowing in test_execute_multiple_questions_ignores_single_params
to fix Pyright type-check CI failure (reportAttributeAccessIssue).
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Tests that access `result.questions` without first narrowing the type
from `ToolResponseBase` to `ClarificationNeededResponse` cause Pyright
type-check failures. Added `assert isinstance(result,
ClarificationNeededResponse)` before accessing `.questions` in 4 tests.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
When navigating back to a cached session, appliedActionKeys was reset to empty
but messages were preserved. This caused previously applied actions to reappear
as unapplied in the UI, allowing them to be re-applied and creating duplicate
undo entries. Clearing messages unconditionally on navigation ensures the
displayed action buttons always reflect the actual applied state.
- Restore top-level `required: ["question"]` in schema for LLM tool-
calling compatibility; validation handles the questions-only path
- Fix keyword null bug: `item.get("keyword")` returning None now
correctly falls back to `question-{idx}` instead of producing "None"
- Filter empty-string options in _build_question (`str(o).strip()`)
to avoid artifacts like "Email, , Slack"
- Revert session type hint to `ChatSession` to match base class contract
- Add tests for null keyword and empty-string options filtering
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
- Remove top-level `required: ["question"]` from schema so the
`questions`-only calling convention is valid for schema-compliant LLMs
- Move logger assignment below all imports (PEP 8 / isort)
- Remove duplicated option filtering in `_execute_single`; let
`_build_question` own that responsibility
- Fix `session` type hint to `ChatSession | None` to match the guard
- Add test for `questions` as non-list type (falls back to single path)
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
- Fix falsy option filtering: use `if o is not None` instead of `if o`
so valid values like "0" are preserved
- Improve multi-question `message` field: join all questions with ";"
instead of only using the first question's text
- Add logging warnings for skipped invalid items in multi-question path
instead of silently dropping them
- Simplify schema: use `"required": ["question"]` instead of empty
required + anyOf (more LLM-friendly)
- Add missing test cases: session=None, single-item questions array,
duplicate keywords, falsy option values
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
The ask_question tool previously only accepted a single question per
invocation, forcing the LLM to cram multiple queries into one text box
or make multiple sequential tool calls. This adds a `questions` parameter
(list of question objects) so multiple input fields render at once.
Backward-compatible: the existing `question`/`options`/`keyword` params
still work. When `questions` (plural) is provided, they take precedence.
The frontend ClarificationQuestionsCard already supports rendering
multiple questions — no frontend changes needed.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Tests for GET/POST /credits/subscription covering:
- GET returns current tier (PRO, FREE default when None)
- POST FREE skips Stripe when payment disabled
- POST PRO sets tier directly for beta users (payment disabled)
- POST paid tier rejects missing success_url/cancel_url with 422
- POST paid tier creates Stripe Checkout Session and returns URL
- POST FREE with payment enabled cancels active Stripe subscription
- Remove useCallback from changeTier (not needed per project guidelines)
- Block self-service tier changes for ENTERPRISE users (admin-managed)
- Preserve current tier on unrecognized Stripe price_id instead of
defaulting to FREE (prevents accidental downgrades during price migration)
Tests for:
- Unknown/mismatched Stripe price_id defaults to FREE (not early return)
- None from LaunchDarkly price flags defaults to FREE
- BUSINESS tier mapping
- StripeError during cancel_stripe_subscription is logged, not raised
When sync_subscription_from_stripe encounters an unrecognized price_id
(e.g. LD flags unconfigured or price changed), it no longer returns early
leaving the user on a stale tier. Instead it defaults to FREE and logs a
warning, keeping the DB state consistent with Stripe's subscription status.
Also guard against None pro_price/biz_price from LaunchDarkly before
comparison to avoid silent mismatches.
EditAgentTool and RunAgentTool call useCopilotChatActions() which throws
if no provider is in the tree. Wrap the panel content with
CopilotChatActionsProvider wired to sendRawMessage so tool components
can send retry prompts without crashing.
The user message was saved to DB before the <user_context> prefix was added
to session.messages. Subsequent upsert_chat_session calls only append new
messages (slicing by existing_message_count), so the prefixed content was
never written to the DB. On page reload or --resume, the unprefixed version
was loaded, losing personalisation.
Fix: add update_message_content_by_sequence to db.py and call it after
injecting the prefix in both sdk/service.py and baseline/service.py.
Add customer.subscription.created to the sync handler so user tier is
upgraded immediately when the subscription is first created (not just on
subsequent updates/deletions).
dev branch already creates SubscriptionTier enum and subscriptionTier column in
20260326200000_add_rate_limit_tier. Remove duplicate DDL from our migration and
only add SUBSCRIPTION to CreditTransactionType using IF NOT EXISTS guard.
Add cancel_stripe_subscription() which lists and cancels all active Stripe
subscriptions for the customer, preventing continued billing after downgrade.
Call it from update_subscription_tier() when tier == FREE and payment is
enabled. Add two unit tests covering active and empty subscription scenarios.
Add tests for the /logs/export endpoint (success, truncated, filters, auth) and
fix missing import of get_platform_cost_logs_for_export in platform_cost_test.py.