Commit Graph

8368 Commits

Author SHA1 Message Date
majdyz
e92ecbbb7c fix(backend): address review comments on SDK upgrade PR
- Make strip_forbidden_betas_from_body non-mutating (returns shallow
  copy instead of modifying caller's dict in-place)
- Add os.access(X_OK) validation for claude_agent_cli_path to reject
  non-executable paths at config load time
- Replace hardcoded /v1 path dedup with generic urlparse-based logic
  that handles any API version prefix in the target URL
2026-04-12 10:08:29 +00:00
majdyz
7f782d4676 ci(backend): add test to validate CLAUDE_CODE_DISABLE_EXPERIMENTAL_BETAS env var
Adds a new test that spawns the CLI with
CLAUDE_CODE_DISABLE_EXPERIMENTAL_BETAS=1 (without the compat proxy) and
checks whether the context-management-2025-06-27 beta header is
stripped. If this test passes in CI, the proxy can be removed entirely
in favour of the simpler env var approach.
2026-04-12 09:33:11 +00:00
majdyz
05477f2daa fix(copilot): address third CodeRabbit review cycle on proxy
- Preserve multi-valued response headers (e.g. Set-Cookie) by using
  clean_response_headers -> CIMultiDict instead of dict(headers)
- Use sock_connect + sock_read timeouts instead of total so long-lived
  SSE streaming responses aren't killed after 600s
- Log the configured bind_host instead of hardcoded 127.0.0.1
2026-04-12 07:43:12 +00:00
majdyz
cc3bac13c5 fix(copilot): address second CodeRabbit review cycle
- Fix docstring: default is True, not False (config_test.py)
- Redact exception message from stream-error log for consistency
  with upstream-error log (openrouter_compat_proxy.py)
2026-04-12 07:31:45 +00:00
majdyz
85f64de4cf fix(copilot): address review feedback on compat proxy PR
- Redact upstream URL from proxy error log to prevent leaking internal
  hostnames (openrouter_compat_proxy.py line 431)
- Remove type: ignore suppressors from cli_openrouter_compat_test.py,
  using cast instead for the untyped SDK import
- Fix validator precedence: replace field_validator with model_validator
  so explicit ChatConfig(claude_agent_use_compat_proxy=False) is not
  overridden by the unprefixed CLAUDE_AGENT_USE_COMPAT_PROXY env var
- Add regression test for explicit-kwarg precedence
2026-04-12 07:23:41 +00:00
majdyz
349daf48f7 test(copilot/config): flip default-compat-proxy test for dev preview
Dev-preview flips ``claude_agent_use_compat_proxy`` default to True
so the bundled CLI in claude-agent-sdk 0.1.58 works out of the box.
Update the no-env-var test accordingly so rebasing the upstream
config test on this branch doesn't fail.
2026-04-11 11:05:04 +00:00
majdyz
d702bcfae2 test(copilot/sdk-compat): skip bare-CLI reproduction when proxy enabled
When `claude_agent_use_compat_proxy=True` the operator has explicitly
opted into the workaround. The bare-CLI reproduction stops being a
useful signal in that mode — what matters is the *upstream* (post-
proxy) staying clean, which is covered by
`test_cli_via_compat_proxy_emits_clean_requests_to_upstream`.

Skip the bare test in that case so the dev-preview branch (0.1.58 +
proxy on) goes fully green instead of having an intentional-but-loud
failure on every CI run.

When the proxy is disabled (the default on the standalone proxy and
plumbing PRs), the bare test continues to run unchanged and serves
as the regression detector for the bundled CLI version.
2026-04-11 11:05:04 +00:00
majdyz
2af87616de test(copilot/sdk-compat): add proxy-routed reproduction variant
Adds `test_cli_via_compat_proxy_emits_clean_requests_to_upstream` so
the compat proxy has a real end-to-end regression guard: spawn the
bundled CLI through the proxy against a fake upstream, capture what
the upstream sees, assert it's clean.

The bare reproduction test
(`test_cli_does_not_send_openrouter_incompatible_features`) keeps
its original semantics — proves the bundled CLI is or isn't broken
upstream — so we still get a clean bisect signal when changing the
SDK pin.

Together the two tests give:

* bare CLI clean → bare test passes; proxy test passes (no-op).
* bare CLI broken → bare test fails (intentional bisect signal);
  proxy test passes if and only if the proxy successfully strips
  the forbidden patterns.

Which means on this dev preview branch (SDK 0.1.58 with proxy on),
CI catches both:

* the regression actually exists (bare test fails — that's the
  reproduction the user asked for), and
* the proxy actually fixes it (proxy test passes — that's the
  workaround validation).
2026-04-11 11:05:04 +00:00
majdyz
5cf60587ef chore(deps): bump claude-agent-sdk to 0.1.58 with compat proxy enabled
Dev preview PR — combines the cli_path plumbing (#12741), the
in-process compat proxy (#12745), and the SDK bump in one branch so
we can dogfood the full upgrade end-to-end.

Changes:

* `claude-agent-sdk` -> 0.1.58 (bundled CLI 2.1.97).  Gets us all the
  blocked features:
    - `exclude_dynamic_sections` cross-user prompt cache hits
      (0.1.57) — directly amplifies #12725
    - `AssistantMessage.usage` per-turn token tracking (0.1.49) —
      cost attribution
    - `task_budget` (0.1.51) — per-task cost ceiling
    - `get_context_usage()` (0.1.52) — context window monitoring
    - MCP large-tool-result truncation fix (0.1.55)
    - MCP HTTP/SSE buffer leak fix (CLI 2.1.97) — known production
      memory creep
    - 429 retry exponential-backoff fix (CLI 2.1.97) — production
      rate-limit recovery
    - `--resume` cache miss regression fix (CLI 2.1.90)
    - SDK session quadratic-write fix (CLI 2.1.90)

* `ChatConfig.claude_agent_use_compat_proxy` default flipped from
  `False` -> `True`. The bundled CLI in 0.1.55+ injects the
  `context-management-2025-06-27` beta header which OpenRouter
  rejects (anthropics/claude-agent-sdk-python#789). The proxy strips
  it transparently. Disable explicitly only if you've pinned to a
  CLI version in `_KNOWN_GOOD_BUNDLED_CLI_VERSIONS_DIRECT`.

* `sdk_compat_test.py` pin assertion split into two known-good sets:
    - `_KNOWN_GOOD_BUNDLED_CLI_VERSIONS_DIRECT` — works without the
      proxy ({"2.1.63", "2.1.70"})
    - `_KNOWN_GOOD_BUNDLED_CLI_VERSIONS_VIA_PROXY` — works only with
      the compat proxy enabled ({"2.1.97"})
  The test now requires `claude_agent_use_compat_proxy=True` for
  proxy-only versions, so disabling the proxy on a fresh checkout
  with this PR fails fast with a clear error.

Operational rollout (when ready to ship beyond dev preview):

1. Merge #12741 (plumbing + reproduction test)
2. Merge #12745 (proxy module — opt-in default off)
3. Merge this PR (bumps SDK + flips default to on)
4. Watch production for the existing reproduction test running
   continuously as a regression guard
5. If anything goes wrong: revert this PR (proxy becomes opt-in
   again, SDK back to whichever version is in the previous merge)

Dev preview usage: deploy this branch with no env-var changes —
the proxy is on by default. The reproduction test will continue
to pass against the bundled CLI 2.1.97 when (and only when) the
proxy successfully strips the forbidden patterns.
2026-04-11 11:05:03 +00:00
majdyz
428ed39a1a fix(copilot/sdk-proxy): abort transport on mid-stream upstream error
The previous fix set a ``stream_error`` flag and returned the
prepared ``StreamResponse`` without calling ``write_eof()``,
assuming aiohttp would leave the body dangling. It doesn't:
aiohttp's handler dispatcher finalises any returned
``StreamResponse`` on the way out (writing the chunked terminator /
content-length / EOF), so a regression test with a real mid-stream
failure still saw the client get a clean 200 body.

Correct fix: on the stream-error path, abort the underlying
transport directly via ``request.transport.abort()`` and then
re-raise the original stream error out of the handler. Aborting
drops the TCP socket mid-response so the client's parser surfaces
a ``ClientPayloadError`` / ``ServerDisconnectedError`` and the
caller sees the truncation as a real transport failure.

Also rewrote the regression test to use a raw
``asyncio.start_server`` TCP handler that sends a chunked response
header plus one partial chunk and then hard-closes the socket
(``transport.abort()``) — this is the one failure mode that
reliably propagates through aiohttp's ``iter_any()`` as a
``ClientError`` for the proxy to detect.  Verified locally: the
test now fails with the expected ``ClientPayloadError`` on the
client side instead of silently returning 200.
2026-04-11 11:04:50 +00:00
majdyz
8742c5e5b9 fix(copilot/sdk-proxy): treat empty sdk_env ANTHROPIC_BASE_URL as opt-out
Claude Code subscription mode intentionally sets
``sdk_env['ANTHROPIC_BASE_URL'] = ""`` to disable any base-URL
override and keep the CLI talking to Anthropic directly. The
previous ``or``-chained lookup evaluated the empty string as falsy
and fell through to ``os.environ.get("ANTHROPIC_BASE_URL")`` and
then to ``OPENROUTER_BASE_URL``, silently starting the compat proxy
for a session that had explicitly opted out — which breaks
subscription auth.

Use a presence check on ``sdk_env`` instead: if the key is present
with an empty value it's a hard "no-proxy" signal, so skip the
OpenRouter fallback even when ``openrouter_active`` is True. The
process-env fallback and the OpenRouter fallback still cover the
original cases (no sdk_env override, OpenRouter is the routing
provider for this session).

Flagged by sentry review on #12745 (thread 3067906804).
2026-04-11 10:48:04 +00:00
majdyz
370499c8dc fix(copilot/sdk-proxy): don't signal clean EOF on mid-stream error
When an ``aiohttp.ClientError`` fires mid-stream the previous code
logged it and then called ``downstream.write_eof()``, which tells
the downstream client "stream complete" on top of a partial,
truncated body. Clients then silently consumed the corrupt response
as if it were a clean success.

Track the stream error in a local variable and, when it's set, skip
the ``write_eof`` call and ``force_close`` the downstream response
so aiohttp drops the connection mid-body. The client's parser then
raises a ``ClientPayloadError`` / ``ServerDisconnectedError`` and
the failure is surfaced instead of silently producing garbage.

Added a regression test that spins up an upstream which calls
``force_close`` mid-response; the proxy must propagate the failure
to the client (exception on ``resp.read()``), never return a clean
body.

Flagged by sentry review on #12745 (thread 3067897364).
2026-04-11 10:41:40 +00:00
majdyz
cd9924f03e fix(copilot/sdk-proxy): address CodeRabbit follow-ups
Three follow-up findings from CodeRabbit's second-pass review:

* The forbidden-pattern scanner in ``cli_openrouter_compat_test``
  relied on a substring match against the prettified form
  `"type": "tool_reference"` (with a space). The CLI is free to
  emit compact JSON like `{"type":"tool_reference"}` which would
  slip past the scanner and false-pass the reproduction test.
  Replaced the substring check with a JSON walker that catches any
  dict with `type == "tool_reference"` regardless of serialisation,
  with a whitespace-tolerant regex fallback for malformed bodies.
  Added two regression tests (compact form, malformed fallback).

* The timeout path in ``_run_cli_against_fake_server`` called
  ``proc.kill()`` and returned immediately, leaving an unreaped
  subprocess until event-loop shutdown. Reap it with a 5-second
  bounded ``proc.communicate()`` wait after the kill.

* ``test_proxy_returns_502_on_upstream_failure`` swallowed
  ``aiohttp.ClientError`` / ``asyncio.TimeoutError`` on the outer
  ``client.post``. That outer call talks to the *proxy* on
  localhost — not the dead upstream — so any exception there
  indicates a proxy crash and must fail the test, not be caught.
  Removed the except block and bumped the client timeout to 10s to
  give the proxy room to return its 502. Also asserts the response
  body contains the generic "upstream error" text so a regression
  that replaces the 502 with a different status is caught.
2026-04-11 10:30:31 +00:00
majdyz
5d6cf91642 fix(copilot): handle bool default in compat-proxy env validator
The ``get_claude_agent_use_compat_proxy`` validator added in the
previous commit used ``if v is None`` to decide when to fall back to
the unprefixed env var. But unlike ``claude_agent_cli_path`` (which
defaults to ``None``), this field has ``default=False``. Pydantic-
settings passes the default bool into a ``mode="before"`` validator
when no explicit value is provided, so the ``is None`` branch never
fired and the unprefixed ``CLAUDE_AGENT_USE_COMPAT_PROXY`` env var
was silently ignored.

Switch to checking the raw process env directly: if the prefixed
``CHAT_CLAUDE_AGENT_USE_COMPAT_PROXY`` is set we trust Pydantic's
parsed value (which preserves any explicit ``false``), otherwise we
return the unprefixed env var's raw string so Pydantic's usual
truthy/falsy coercion handles it.

Added a new ``TestClaudeAgentUseCompatProxyEnvFallback`` class
covering both env-var names, the prefixed-wins-over-unprefixed
precedence (including the ``CHAT_...=false`` + unprefixed ``=true``
case), and the default. Also added the mirror tests for
``claude_agent_cli_path`` and included the new env var names in the
``_ENV_VARS_TO_CLEAR`` fixture so existing tests don't leak.

Flagged by sentry review on #12745 (thread 3067888297).
2026-04-11 10:26:17 +00:00
majdyz
0554a0ae35 fix(copilot/sdk-proxy): address PR review — RFC 7230 hop-by-hop,
timeouts, cancellation, provider gating

Addresses all seven review threads on #12745 (coderabbit + sentry)
in a single commit because they overlap in the same file cluster:

config.py
---------
* ``claude_agent_use_compat_proxy`` gains a ``field_validator`` that
  reads the unprefixed ``CLAUDE_AGENT_USE_COMPAT_PROXY`` in addition
  to the Pydantic-prefixed ``CHAT_`` form, matching the same dual-name
  pattern already used by ``api_key`` / ``base_url`` /
  ``claude_agent_cli_path`` and keeping parity with the docstring and
  the PR description. Without this the operator-facing env var was
  silently ignored because of ``env_prefix = "CHAT_"``.

openrouter_compat_proxy.py
--------------------------
* ``_HOP_BY_HOP_HEADERS`` now includes the canonical ``trailer``
  (singular per RFC 7230 §4.4) alongside the plural ``trailers``;
  ``clean_request_headers`` additionally drops every header whose
  name is listed in the incoming ``Connection`` field value (§6.1
  extension hop-by-hop headers), case-insensitively — previously
  extension hop-by-hop headers could leak upstream.

* ``strip_tool_reference_blocks`` now *removes* dict-valued
  ``tool_reference`` children from their parent dict instead of
  rewriting them to ``null``; the stated "strip anywhere" semantics
  were broken on nested dict assignments and still produced
  schema-invalid payloads upstream. Genuine ``None`` children on
  non-dict values are still preserved.

* ``_handle`` upstream-call error handler now catches
  ``asyncio.TimeoutError`` alongside ``aiohttp.ClientError`` —
  ``aiohttp.ClientTimeout`` raises ``asyncio.TimeoutError`` (not
  ``aiohttp.ClientError``), so hung upstreams used to escape as a
  generic 500 instead of the documented 502.

* Streaming-response handler no longer suppresses
  ``asyncio.CancelledError``. It's now split into its own except
  branch, releases the upstream body, and re-raises so cooperative
  task cancellation works as intended (cancellation while mid-stream
  was previously being caught alongside ``ClientError`` and silently
  swallowed, leading to hung request handlers on client disconnects
  / shutdowns).

* ``start()`` wraps the ``runner.setup() / site.start()`` sequence in
  try/except that tears down both the client session and the
  (partially-initialised) runner on any exception, so failed startups
  never leak resources. The attributes are only published to the
  instance after the full chain succeeds.

service.py
----------
* The compat-proxy startup is now gated on there actually being an
  Anthropic-compatible upstream to forward to. Previously the code
  fell back to ``OPENROUTER_BASE_URL`` unconditionally, which would
  silently re-route direct-Anthropic / Claude Code subscription
  sessions through OpenRouter and break auth. The new gate is:
  explicit ``ANTHROPIC_BASE_URL`` in ``sdk_env`` or the process env,
  OR ``ChatConfig.openrouter_active`` (OpenRouter is configured as
  the session's routing provider). When neither holds we log a
  warning and skip proxy startup — the feature is opt-in and named
  "OpenRouter compatibility", so no-oping direct-Anthropic sessions
  is the safe default. The success log line also drops the upstream
  URL to match the taint-analysis guidance already applied to
  ``openrouter_compat_proxy.start``.

Tests
-----
* Added regression tests for the dict-valued tool_reference fix, the
  Connection-listed header stripping (with case-insensitive matching),
  and an end-to-end 502-on-upstream-timeout test (fake upstream that
  sleeps longer than the proxy's request timeout). The hop-by-hop
  completeness test now also pins ``trailer`` / ``trailers``.
2026-04-11 10:18:50 +00:00
majdyz
fed728e546 fix(copilot/sdk-proxy): drop upstream from log message entirely
Previous fix logged the parsed netloc instead of the full URL, but
CodeQL's `py/clear-text-logging-sensitive-data` taint analysis still
traces the value through `urlparse(target_base_url).netloc` and
flags the log call. Address by dropping the upstream component from
the log entirely — only the local bind port is logged. The upstream
endpoint is discoverable from `ChatConfig` and exposed via the
`target_base_url` property for callers that need it.
2026-04-11 10:13:49 +00:00
majdyz
93f27ffdf6 fix(copilot/sdk-proxy): address CodeQL findings + isort drift
CodeQL flagged two issues in the new compat proxy:

1. `py/clear-text-logging-sensitive-data` (high) — logging
   `self._target_base_url` could leak credentials if a future caller
   passed a URL containing them. Switched to logging only the host
   component (and the local 127.0.0.1 port) so even an
   accidentally-credentialled base URL stays out of logs.

2. `py/stack-trace-exposure` (medium) — returning the upstream
   exception text in the 502 response body could leak internal
   hostnames or stack frames to the client. Changed to a generic
   "upstream error" string; the detailed exception is still logged
   server-side.

Also fixes an isort sorting drift in the test file (private
underscore-prefixed names must sort before public names — local
isort accepted the order, CI's isort did not).
2026-04-11 10:13:48 +00:00
majdyz
0f00972efc feat(copilot): in-process OpenRouter compat proxy for newer Claude SDK
The Claude Code CLI in any `claude-agent-sdk` version above 0.1.47
sends the `context-management-2025-06-27` beta header / body field
that OpenRouter rejects with HTTP 400. This blocks us from upgrading
to take features we want (`exclude_dynamic_sections` cross-user prompt
caching in 0.1.57, `AssistantMessage.usage` per-turn token tracking
in 0.1.49, the MCP large-tool-result truncation fix in 0.1.55, etc).
Tracked upstream at anthropics/claude-agent-sdk-python#789, no fix
released yet.

This commit adds an in-process HTTP middleware that lets the latest
SDK / CLI talk to OpenRouter unchanged. The proxy:

* listens on `127.0.0.1:RANDOM_PORT`,
* receives every CLI request that would normally go to
  `ANTHROPIC_BASE_URL`,
* strips `tool_reference` content blocks (the original 0.1.46+
  regression — defensive, in case the CLI 2.1.70 proxy detection
  ever regresses) and `context-management-2025-06-27` from both the
  request body's `betas` array and the `anthropic-beta` header,
* forwards the cleaned request upstream and streams the response
  back unchanged.

Wired via `ChatConfig.claude_agent_use_compat_proxy` (default
`False`, opt-in). When the flag is on, the SDK service starts a
proxy per session, injects its local URL into the spawned CLI
subprocess `env` as `ANTHROPIC_BASE_URL`, and tears it down in the
session's `finally` block.

The proxy is intentionally orthogonal to the existing
`claude_agent_cli_path` override:

* `cli_path`  picks **which** CLI binary we run.
* compat proxy rewrites **whatever the chosen binary sends**.

Both can be combined or used independently.

Tests cover:

* the pure stripping helpers (`strip_tool_reference_blocks`,
  `strip_forbidden_betas_from_body`,
  `strip_forbidden_anthropic_beta_header`,
  `clean_request_body_bytes`, `clean_request_headers`) including
  edge cases like empty input, non-JSON bodies, and the
  hop-by-hop header set,
* end-to-end behaviour against a fake upstream server: stripping
  the `tool_reference` block in nested `tool_result.content`,
  rewriting the `anthropic-beta` header,
  removing the forbidden token from the body `betas` array,
  passing through clean requests unchanged, and returning a clear
  502 on upstream failure (no infinite hang).
2026-04-11 10:13:48 +00:00
majdyz
a6e306d28a fix(copilot): accept unprefixed CLAUDE_AGENT_CLI_PATH in config
The new `claude_agent_cli_path` field inherited the `CHAT_` Pydantic
prefix from `ChatConfig`, so the documented `CLAUDE_AGENT_CLI_PATH`
env var was silently ignored — operators following the PR description
or the field docstring would set the unprefixed form and the config
would fall back to the bundled CLI.

Add a `field_validator` that reads `CHAT_CLAUDE_AGENT_CLI_PATH` first
and falls back to the unprefixed `CLAUDE_AGENT_CLI_PATH`, matching the
same pattern already used by `api_key` and `base_url`. The test helper
`_resolve_cli_path` in `cli_openrouter_compat_test.py` mirrors the
same two-name lookup so the reproduction test picks up the override
regardless of which form is set, and a new test covers the prefixed
variant explicitly.

Flagged by sentry review on #12741 (thread IDs 3067725580 and
3067768817) as two instances of the same bug.
2026-04-11 10:11:47 +00:00
majdyz
d6f0fcb052 test(copilot/sdk-compat): unit-test the forbidden-pattern scanner
Add direct unit tests for `_scan_request_for_forbidden_patterns` and
`_resolve_cli_path` so the helper logic stays exercised even on CI
runs where the slow end-to-end CLI subprocess test can't capture a
request (sandboxed runner, missing CLI binary, etc).

Brings codecov/patch coverage above the 80% gate. No production
code changes — tests only.
2026-04-11 07:57:04 +00:00
majdyz
feb247d56e chore(backend): drop stray blank line in platform_cost_test.py
Same pre-existing dev-branch lint issue from PR #12739 — black would
reformat this file (extra blank line between two test classes), which
fails the `lint` CI job for any PR branched from current dev.
2026-04-11 07:10:55 +00:00
majdyz
fdb3590693 chore(copilot): add SDK CLI override + OpenRouter compat regression tests
We've been pinned at `claude-agent-sdk==0.1.45` (bundled CLI 2.1.63)
since PR #12294 because every version above introduces a 400 against
OpenRouter. There are two stacked regressions today:

1. CLI 2.1.69 (= SDK 0.1.46) added a `tool_reference` content block in
   `tool_result.content` that OpenRouter's stricter Zod validation
   rejects. CLI 2.1.70 added a proxy-detection workaround but our
   subsequent attempts at 0.1.55 and 0.1.56 still failed.
2. A newer regression — the `context-management-2025-06-27` beta
   header — appears in some CLI version after 2.1.91. Tracked upstream
   at anthropics/claude-agent-sdk-python#789, still open with no fix.

This commit doesn't actually upgrade the SDK — it adds the
infrastructure we need to upgrade safely *when* upstream lands a fix
or when we identify a known-good newer CLI version via bisection:

* `ChatConfig.claude_agent_cli_path` (env: `CLAUDE_AGENT_CLI_PATH`)
  threads through to `ClaudeAgentOptions(cli_path=...)` so we can
  decouple the Python SDK API surface from the CLI binary version.
  `_prewarm_cli` in the CoPilotExecutor honours the same override.

* `test_bundled_cli_version_is_known_good_against_openrouter` pins
  the bundled CLI to a known-good set (`{"2.1.63"}` today). Any
  `claude-agent-sdk` bump that changes the bundled CLI will fail this
  test loudly with a pointer to PR #12294 and issue #789, instead of
  silently re-breaking production.

* `test_sdk_exposes_cli_path_option` is a forward-compat sentinel that
  fails fast if upstream removes the `cli_path` option we depend on
  for the override.

* `cli_openrouter_compat_test.py` is the actual reproduction test:
  spawns the bundled (or `CLAUDE_AGENT_CLI_PATH`-overridden) CLI
  against an in-process aiohttp server pretending to be the Anthropic
  Messages API, captures every request body the CLI sends, and
  asserts that none of them contain the two known forbidden patterns
  (`"type": "tool_reference"` content blocks or
  `"context-management-2025-06-27"` in body or `anthropic-beta`
  header). The fake server returns a minimal valid streamed response
  so the CLI doesn't error out before we can inspect what it sent.
  No OpenRouter API key required — the test reproduces the *mechanism*
  rather than the symptom, so it's deterministic and free to run in CI.

Workflow for verifying a candidate upgrade going forward: bump the
SDK in `pyproject.toml`, push the commit, and watch the CI run for
both tests in `sdk_compat_test.py` and `cli_openrouter_compat_test.py`.
A clean run on both means it's safe to add the new bundled CLI version
to `_KNOWN_GOOD_BUNDLED_CLI_VERSIONS` and merge.
2026-04-11 07:05:05 +00:00
Zamil Majdy
b319c26cab feat(platform/admin): per-model cost breakdown, cache token tracking, OrchestratorBlock cost fix (#12726)
## Why

The platform cost tracking system had several gaps that made the admin
dashboard less accurate and harder to reason about:

**Q: Do we have per-model granularity on the provider page?**
The `model` column was stored in `PlatformCostLog` but the SQL
aggregation grouped only by `(provider, tracking_type)`, so all models
for a given provider collapsed into one row. Now grouped by `(provider,
tracking_type, model)` — each model gets its own row.

**Q: Why does Anthropic show `per_run` for OrchestratorBlock?**
Bug: `OrchestratorBlock._call_llm()` was building `NodeExecutionStats`
with only `input_token_count` and `output_token_count` — it dropped
`resp.provider_cost` entirely. For OpenRouter calls this silently
discarded the `cost_usd`. For the SDK (autopilot) path,
`ResultMessage.total_cost_usd` was never read. When `provider_cost` is
None and token counts are 0 (e.g. SDK error path), `resolve_tracking`
falls through to `per_run`. Fixed by propagating all cost/cache fields.

**Q: Why can't we get `cost_usd` for Anthropic direct API calls?**
The Anthropic Messages API does not return a dollar amount — only token
counts. OpenRouter returns cost via response headers, so it uses
`cost_usd` directly. The Claude Agent SDK *does* compute
`total_cost_usd` internally, so SDK-mode OrchestratorBlock runs now get
`cost_usd` tracking. For direct Anthropic LLM blocks the estimate uses
per-token rates (see cache section below).

**Q: What about labeling by source (autopilot vs block)?**
Already tracked: `block_name` stores `copilot:SDK`, `copilot:Baseline`,
or the actual block name. Visible in the raw logs table. Not added to
the provider group-by (would explode row count); use the logs table
filter instead.

**Q: Is there double-counting between `tokens`, `per_run`, and
`cost_usd`?**
No. `resolve_tracking()` uses a strict preference hierarchy — exactly
one tracking type per execution: `cost_usd` > `tokens` > provider
heuristics > `per_run`. A single execution produces exactly one
`PlatformCostLog` row.

**Q: Should we track Anthropic prompt cache tokens (PR #12725)?**
Yes — PR #12725 adds `cache_control` markers to Anthropic API calls,
which causes the API to return `cache_read_input_tokens` and
`cache_creation_input_tokens` alongside regular `input_tokens`. These
have different billing rates:
- Cache reads: **10%** of base input rate (much cheaper)
- Cache writes: **125%** of base input rate (slightly more expensive,
one-time)
- Uncached input: **100%** of base rate

Without tracking them separately, a flat-rate estimate on
`total_input_tokens` would be wrong in both directions.

## What

- **Per-model provider table**: SQL now groups by `(provider,
tracking_type, model)`. `ProviderCostSummary` and the frontend
`ProviderTable` show a model column.
- **Cache token columns**: New `cacheReadTokens` and
`cacheCreationTokens` columns in `PlatformCostLog` with matching
migration.
- **LLM block cache tracking**: `LLMResponse` captures
`cache_read_input_tokens` / `cache_creation_input_tokens` from Anthropic
responses. `NodeExecutionStats` gains `cache_read_token_count` /
`cache_creation_token_count`. Both propagate to `PlatformCostEntry` and
the DB.
- **Copilot path**: `token_tracking.persist_and_record_usage` now writes
cache tokens as dedicated `PlatformCostEntry` fields (was
metadata-only).
- **OrchestratorBlock bug fix**: `_call_llm()` now includes
`resp.provider_cost`, `resp.cache_read_tokens`,
`resp.cache_creation_tokens` in the stats merge. SDK path captures
`ResultMessage.total_cost_usd` as `provider_cost`.
- **Accurate cost estimation**: `estimateCostForRow` uses
token-type-specific rates for `tokens` rows (uncached=100%, reads=10%,
writes=125% of configured base rate).

## How

`resolve_tracking` priority is unchanged. For Anthropic LLM blocks the
tracking type remains `tokens` (Anthropic API returns no dollar amount).
For OrchestratorBlock in SDK/autopilot mode it now correctly uses
`cost_usd` because the Claude Agent SDK computes and returns
`total_cost_usd`. For OpenRouter through OrchestratorBlock it now
correctly uses `cost_usd` (was silently dropped before).

## Checklist 📋

#### For code changes:
- [x] I have clearly listed my changes in the PR description
- [x] I have made a test plan
- [x] I have tested my changes according to the test plan:
  - [x] `ProviderCostSummary` SQL updated
- [x] Cache token fields present in `PlatformCostEntry` and
`PlatformCostLogCreateInput`
  - [x] Prisma client regenerated — all type checks pass
  - [x] Frontend `helpers.test.ts` updated for new `rateKey` format
  - [x] Pre-commit hooks pass (Black, Ruff, isort, tsc, Prisma generate)
2026-04-10 23:14:43 +07:00
Zamil Majdy
85921f227a Merge branch 'dev' of github.com:Significant-Gravitas/AutoGPT into preview/all-active-prs 2026-04-10 22:59:30 +07:00
Zamil Majdy
5844b13fb1 feat(backend/copilot): support multiple questions in ask_question tool (#12732)
### Why / What / How

**Why:** The `ask_question` copilot tool previously only accepted a
single question per invocation. When the LLM needs to ask multiple
clarifying questions simultaneously, it either crams them into one text
field (requiring users to format numbered answers manually) or makes
multiple sequential tool calls (slow and disruptive UX).

**What:** Replace the single `question`/`options`/`keyword` parameters
with a `questions` array parameter so the LLM can ask multiple questions
in one tool call, each rendered as its own input box.

**How:** Simplified the tool to accept only `questions` (array of
question objects). Each item has `question` (required), `options`, and
`keyword`. The frontend `ClarificationQuestionsCard` already supports
rendering multiple questions — no frontend changes needed.

### Changes 🏗️

- `backend/copilot/tools/ask_question.py`: Replaced dual
question/questions schema with single `questions` array. Extracted
parsing into module-level `_parse_questions` and `_parse_one` helpers.
Follows backend code style: early returns, list comprehensions, top-down
ordering, functions under 40 lines.
- `backend/copilot/tools/ask_question_test.py`: Rewritten with 18
focused tests covering happy paths, keyword handling, options filtering,
and invalid input handling.

### Checklist 📋

#### For code changes:
- [x] I have clearly listed my changes in the PR description
- [x] I have made a test plan
- [ ] I have tested my changes according to the test plan:
- [ ] Run `poetry run pytest backend/copilot/tools/ask_question_test.py`
— all tests pass

🤖 Generated with [Claude Code](https://claude.com/claude-code)

---------

Co-authored-by: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-04-10 21:54:53 +07:00
Zamil Majdy
c014e1aa35 merge(preview): merge all active PRs into preview/all-active-prs from fresh dev 2026-04-10 08:40:23 +07:00
Zamil Majdy
e59f576622 Merge remote-tracking branch 'origin/spare/13' into preview/all-active-prs 2026-04-10 08:39:34 +07:00
Zamil Majdy
c99fa32ae3 Merge remote-tracking branch 'origin/spare/3' into preview/all-active-prs 2026-04-10 08:39:34 +07:00
Zamil Majdy
b71789da50 Merge remote-tracking branch 'origin/feat/subscription-tier-billing' into preview/all-active-prs 2026-04-10 08:39:34 +07:00
Zamil Majdy
5661326e7e fix(platform): fetch real Stripe prices in subscription status endpoint
- Import get_subscription_price_id in v1.py
- get_subscription_status now calls stripe.Price.retrieve for PRO/BUSINESS
  tiers to return actual unit_amount instead of hardcoded zeros
- UI will now show correct monthly costs when LD price IDs are configured
- Fix Button import from __legacy__ to design system in SubscriptionTierSection
- Update subscription status tests to mock the new Stripe price lookup
2026-04-10 08:37:40 +07:00
Zamil Majdy
df3fe926f2 style(backend/copilot): apply Black formatting to ask_question
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-04-09 23:56:42 +00:00
Zamil Majdy
505af7e673 refactor(backend/copilot): simplify ask_question to questions-only API
Drop the dual question/questions schema in favor of a single
`questions` array parameter. This removes ~175 lines of complexity
(the _execute_single path, duplicate params, precedence logic).

Restructured per backend code style rules:
- Top-down ordering: public _execute first, helpers below
- Early return with guard clauses, no deep nesting
- List comprehensions via walrus operator in _parse_questions
- Helpers extracted as module-level functions (not methods)
- Functions under 40 lines each

The frontend ClarificationQuestionsCard already renders arrays of
any length — no UI changes needed.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-04-09 23:54:11 +00:00
Zamil Majdy
d896a1f9fa fix(backend/copilot): add missing isinstance assertion in test
Add isinstance narrowing in test_execute_multiple_questions_ignores_single_params
to fix Pyright type-check CI failure (reportAttributeAccessIssue).

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-04-09 23:48:02 +00:00
Zamil Majdy
6aa5a808e0 fix(backend/copilot): add isinstance assertions to fix type-check CI
Tests that access `result.questions` without first narrowing the type
from `ToolResponseBase` to `ClarificationNeededResponse` cause Pyright
type-check failures. Added `assert isinstance(result,
ClarificationNeededResponse)` before accessing `.questions` in 4 tests.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-04-09 23:40:08 +00:00
Zamil Majdy
18c88b4da0 fix(frontend/builder): always clear messages on flowID change to keep action state consistent
When navigating back to a cached session, appliedActionKeys was reset to empty
but messages were preserved. This caused previously applied actions to reappear
as unapplied in the UI, allowing them to be re-applied and creating duplicate
undo entries. Clearing messages unconditionally on navigation ensures the
displayed action buttons always reflect the actual applied state.
2026-04-10 02:03:56 +07:00
Zamil Majdy
3a5ce570e0 fix(backend/copilot): address PR review round 4
- Restore top-level `required: ["question"]` in schema for LLM tool-
  calling compatibility; validation handles the questions-only path
- Fix keyword null bug: `item.get("keyword")` returning None now
  correctly falls back to `question-{idx}` instead of producing "None"
- Filter empty-string options in _build_question (`str(o).strip()`)
  to avoid artifacts like "Email, , Slack"
- Revert session type hint to `ChatSession` to match base class contract
- Add tests for null keyword and empty-string options filtering

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-04-09 18:56:37 +00:00
Zamil Majdy
5a3739e54d fix(backend/copilot): address PR review round 2
- Remove top-level `required: ["question"]` from schema so the
  `questions`-only calling convention is valid for schema-compliant LLMs
- Move logger assignment below all imports (PEP 8 / isort)
- Remove duplicated option filtering in `_execute_single`; let
  `_build_question` own that responsibility
- Fix `session` type hint to `ChatSession | None` to match the guard
- Add test for `questions` as non-list type (falls back to single path)

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-04-09 18:43:11 +00:00
Zamil Majdy
72bc8a92df fix(frontend/builder): guard msg.parts with nullish coalescing to prevent runtime error 2026-04-10 01:41:15 +07:00
Zamil Majdy
cc29cf5e20 fix(backend/copilot): address PR review round 1
- Fix falsy option filtering: use `if o is not None` instead of `if o`
  so valid values like "0" are preserved
- Improve multi-question `message` field: join all questions with ";"
  instead of only using the first question's text
- Add logging warnings for skipped invalid items in multi-question path
  instead of silently dropping them
- Simplify schema: use `"required": ["question"]` instead of empty
  required + anyOf (more LLM-friendly)
- Add missing test cases: session=None, single-item questions array,
  duplicate keywords, falsy option values

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-04-09 18:39:55 +00:00
Zamil Majdy
a0efbbba90 feat(backend/copilot): support multiple questions in ask_question tool
The ask_question tool previously only accepted a single question per
invocation, forcing the LLM to cram multiple queries into one text box
or make multiple sequential tool calls. This adds a `questions` parameter
(list of question objects) so multiple input fields render at once.

Backward-compatible: the existing `question`/`options`/`keyword` params
still work. When `questions` (plural) is provided, they take precedence.
The frontend ClarificationQuestionsCard already supports rendering
multiple questions — no frontend changes needed.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-04-09 18:21:35 +00:00
Zamil Majdy
8ed959433a fix(frontend/builder): clear stale messages in retrySession so new session starts clean 2026-04-10 00:56:31 +07:00
Zamil Majdy
98f3e09580 fix(frontend/builder): reset hasSentSeedMessageRef in retrySession so seed is sent to new session 2026-04-10 00:39:10 +07:00
Zamil Majdy
9ec44dd109 test(backend): add route-level tests for subscription API endpoints
Tests for GET/POST /credits/subscription covering:
- GET returns current tier (PRO, FREE default when None)
- POST FREE skips Stripe when payment disabled
- POST PRO sets tier directly for beta users (payment disabled)
- POST paid tier rejects missing success_url/cancel_url with 422
- POST paid tier creates Stripe Checkout Session and returns URL
- POST FREE with payment enabled cancels active Stripe subscription
2026-04-10 00:19:06 +07:00
Zamil Majdy
bfb82b6246 fix(platform): address reviewer feedback on subscription endpoint
- Remove useCallback from changeTier (not needed per project guidelines)
- Block self-service tier changes for ENTERPRISE users (admin-managed)
- Preserve current tier on unrecognized Stripe price_id instead of
  defaulting to FREE (prevents accidental downgrades during price migration)
2026-04-10 00:08:54 +07:00
Zamil Majdy
63210770ce test(backend): add tests for get_subscription_price_id to improve coverage 2026-04-09 23:54:02 +07:00
Zamil Majdy
f2b8f81bb1 test(backend/copilot): add unit tests for update_message_content_by_sequence
Cover success, not-found (returns False + warning), and DB-error (returns
False + error log) paths to push patch coverage above the 80% threshold.
2026-04-09 23:52:39 +07:00
Zamil Majdy
68b51ae2d3 test(backend): add coverage for sync_subscription_from_stripe edge cases
Tests for:
- Unknown/mismatched Stripe price_id defaults to FREE (not early return)
- None from LaunchDarkly price flags defaults to FREE
- BUSINESS tier mapping
- StripeError during cancel_stripe_subscription is logged, not raised
2026-04-09 23:52:16 +07:00
Zamil Majdy
63ff214563 fix(backend): default to FREE tier on unknown Stripe price ID in webhook sync
When sync_subscription_from_stripe encounters an unrecognized price_id
(e.g. LD flags unconfigured or price changed), it no longer returns early
leaving the user on a stale tier. Instead it defaults to FREE and logs a
warning, keeping the DB state consistent with Stripe's subscription status.

Also guard against None pro_price/biz_price from LaunchDarkly before
comparison to avoid silent mismatches.
2026-04-09 23:41:51 +07:00
Zamil Majdy
9498daca31 fix(frontend/builder): wrap panel in CopilotChatActionsProvider to prevent crash
EditAgentTool and RunAgentTool call useCopilotChatActions() which throws
if no provider is in the tree. Wrap the panel content with
CopilotChatActionsProvider wired to sendRawMessage so tool components
can send retry prompts without crashing.
2026-04-09 23:41:06 +07:00
Zamil Majdy
ce0cb1e035 fix(backend/copilot): persist user-context prefix to DB in both SDK and baseline paths
The user message was saved to DB before the <user_context> prefix was added
to session.messages. Subsequent upsert_chat_session calls only append new
messages (slicing by existing_message_count), so the prefixed content was
never written to the DB. On page reload or --resume, the unprefixed version
was loaded, losing personalisation.

Fix: add update_message_content_by_sequence to db.py and call it after
injecting the prefix in both sdk/service.py and baseline/service.py.
2026-04-09 23:40:14 +07:00