From dbc3b0b31b5cd9f3c529bdd5a230bcd84725ec9f Mon Sep 17 00:00:00 2001 From: Zamil Majdy Date: Wed, 22 Apr 2026 19:02:58 +0700 Subject: [PATCH] refactor(backend/copilot): isolate Moonshot quirks + enable Moonshot cache_control + track title cost MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit ## Why Three loose ends from the Kimi SDK-default merge (#12878): 1. Kimi-specific pricing logic (rate card, cost-override helper) lived inline in sdk/service.py next to unrelated SDK plumbing. Any future non-Anthropic vendor would have piled into the same file. 2. Moonshot's Anthropic-compat endpoint honours ``cache_control: {type: ephemeral}`` markers, but the baseline cache-marking gate (``_is_anthropic_model``) was narrow enough to exclude it — so Moonshot fell back to its automatic prefix cache, which drifts readily between turns (internal testing: 0/4 hits across two continuation sessions). 3. Title generation (``_update_title_async``, runs per-session) makes its own LLM call that the turn reconcile never sees — admin cost dashboard under-reports provider spend by the aggregate of those calls. ## What - New ``backend/copilot/moonshot.py`` module: * ``is_moonshot_model(model)`` — prefix check against ``moonshotai/`` * ``rate_card_usd(model)`` — published Moonshot rates, default ``(0.60, 2.80)`` per MTok, per-slug overrides slot for future SKUs * ``override_cost_usd(...)`` — moved from sdk/service.py, replaces the CLI's Sonnet-rate Kimi estimate with the real rate card * ``supports_cache_control(model)`` — gate for Anthropic-style cache markers on Moonshot routes * Explicit docstring: rate card is NOT canonical — authoritative cost comes from the OpenRouter ``/generation`` reconcile; this module only improves the in-turn estimate and the reconcile's lookup-fail fallback. - Baseline ``_is_anthropic_model`` stays narrow (still only Anthropic — callers needing the ``anthropic-beta`` header use it as-is). New ``_supports_prompt_cache_markers`` widens the gate to Anthropic OR Moonshot; both call sites that emit ``cache_control`` on the system message and the final tool schema switch to this wider gate. OpenAI / Grok / Gemini still 400 on the field, so those providers keep the ``false`` answer. - Title generation now captures its own cost via ``persist_and_record_usage`` with ``provider="open_router"`` and ``block_name="copilot:title"``, so the admin dashboard sees every dollar we spend. OR's ``usage.cost`` is read off ``usage.model_extra`` (the pydantic container) using the same pattern as tools/web_search. - TODO marker on the rate-card call site: after ~1 week of production data confirming OR ``/generation`` reconcile reliability, drop the rate card entirely and rely on the authoritative number for all Kimi turns. ## How - Detection is prefix-based (``moonshotai/``) — every Kimi SKU today uses that namespace and shares pricing, so a future ``moonshotai/kimi-k3.0`` inherits both the rate card and the cache-control gate transparently without editing this file. - ``_mark_tools_with_cache_control`` and the per-round cached system message builder already exist; the only change is swapping the gate from ``_is_anthropic_model`` to ``_supports_prompt_cache_markers``. - Title cost capture is in-line with the existing ``_generate_session_title`` call (``extra_body={"usage":{"include":True}}`` asks OR to embed the real billed cost); a best-effort ``_record_title_generation_cost`` helper reads it off ``usage.model_extra`` and fires ``persist_and_record_usage`` under ``try/except`` — any cost-tracking failure downgrades to debug log so title generation itself never breaks. ## Deferred to follow-up - Kimi reasoning renders after text on dev because Moonshot's shim splits each turn into separate ``AssistantMessage`` summaries (one text-only, one thinking-only) — the in-message hoist at ``response_adapter.py:193`` can't reorder across messages. Fix needs more design (UX trade-off between realtime streaming and correct ordering); investigating separately. --- .../backend/copilot/baseline/service.py | 71 ++++--- .../copilot/baseline/service_unit_test.py | 42 +++-- .../backend/backend/copilot/moonshot.py | 137 ++++++++++++++ .../backend/backend/copilot/moonshot_test.py | 173 ++++++++++++++++++ .../backend/backend/copilot/sdk/service.py | 66 ++----- .../backend/copilot/sdk/service_test.py | 77 -------- .../backend/backend/copilot/service.py | 83 ++++++++- 7 files changed, 480 insertions(+), 169 deletions(-) create mode 100644 autogpt_platform/backend/backend/copilot/moonshot.py create mode 100644 autogpt_platform/backend/backend/copilot/moonshot_test.py diff --git a/autogpt_platform/backend/backend/copilot/baseline/service.py b/autogpt_platform/backend/backend/copilot/baseline/service.py index beb1af3f74..7efd82a314 100644 --- a/autogpt_platform/backend/backend/copilot/baseline/service.py +++ b/autogpt_platform/backend/backend/copilot/baseline/service.py @@ -38,6 +38,7 @@ from backend.copilot.builder_context import ( from backend.copilot.config import CopilotLlmModel, CopilotMode from backend.copilot.context import get_workspace_manager, set_execution_context from backend.copilot.graphiti.config import is_enabled_for_user +from backend.copilot.moonshot import is_moonshot_model from backend.copilot.model import ( ChatMessage, ChatSession, @@ -428,25 +429,39 @@ def _emit_all( def _is_anthropic_model(model: str) -> bool: """Return True if *model* routes to Anthropic (native or via OpenRouter). - Cache-control markers on message content + the ``anthropic-beta`` header - are Anthropic-specific. OpenAI rejects the unknown ``cache_control`` - field with a 400 ("Extra inputs are not permitted") and Grok / other - providers behave similarly. OpenRouter strips unknown headers but - passes through ``cache_control`` on the body regardless of provider — - which would also fail when OpenRouter routes to a non-Anthropic model. - Examples that return True: - ``anthropic/claude-sonnet-4-6`` (OpenRouter route) - ``claude-3-5-sonnet-20241022`` (direct Anthropic API) - ``anthropic.claude-3-5-sonnet`` (Bedrock-style) False for ``openai/gpt-4o``, ``google/gemini-2.5-pro``, ``xai/grok-4`` - etc. + etc. Moonshot is False here too even though Moonshot's + Anthropic-compat endpoint honours ``cache_control`` — use + :func:`_supports_prompt_cache_markers` for the cache-gating decision, + which also allows Moonshot routes. This function stays scoped to + "genuinely Anthropic" so callers that need the stricter check (e.g. + ``anthropic-beta`` header emission) keep their existing semantics. """ lowered = model.lower() return "claude" in lowered or lowered.startswith("anthropic") +def _supports_prompt_cache_markers(model: str) -> bool: + """Return True when *model* accepts Anthropic-style ``cache_control``. + + Superset of :func:`_is_anthropic_model` — also allows Moonshot + (``moonshotai/*``), whose OpenRouter Anthropic-compat endpoint + honours the marker and empirically lifts cache hit rate on + continuation turns from near-zero (Moonshot's own automatic prefix + cache, which drifts readily) to the 60-95% Anthropic ballpark. + + OpenAI / Grok / Gemini still 400 on ``cache_control``, so this + function returns False for those providers — add new vendors here + only after verifying their endpoint accepts the field. + """ + return _is_anthropic_model(model) or is_moonshot_model(model) + + def _fresh_ephemeral_cache_control() -> dict[str, str]: """Return a FRESH ephemeral ``cache_control`` dict each call. @@ -567,19 +582,24 @@ async def _baseline_llm_caller( round_text = "" try: client = _get_openai_client() - # Cache markers are Anthropic-specific. For OpenAI/Grok/other - # providers, leaving them on would trigger a 400 ("Extra inputs - # are not permitted" on cache_control). Tools were precomputed - # in stream_chat_completion_baseline via _mark_tools_with_cache_control - # (only when the model was Anthropic), so on non-Anthropic routes - # tools ship without cache_control on the last entry too. + # Cache markers are accepted by Anthropic AND Moonshot (via OR's + # Anthropic-compat endpoint). OpenAI/Grok/Gemini 400 on the + # unknown ``cache_control`` field — tools were precomputed in + # stream_chat_completion_baseline via _mark_tools_with_cache_control + # with the same gate, so on unsupported routes tools ship + # unmarked too. # - # `extra_body` `usage.include=true` asks OpenRouter to embed the real - # generation cost into the final usage chunk — required by the - # cost-based rate limiter in routes.py. Separate from the Anthropic + # The ``anthropic-beta`` header is only emitted for genuinely + # Anthropic routes (see :func:`_is_anthropic_model`) — Moonshot + # doesn't need the beta header; sending it is a no-op but we + # keep the check strict for clarity. + # + # `extra_body` `usage.include=true` asks OpenRouter to embed the + # real generation cost into the final usage chunk — required by + # the cost-based rate limiter in routes.py. Separate from the # caching headers, always sent. - is_anthropic = _is_anthropic_model(state.model) - if is_anthropic: + supports_cache = _supports_prompt_cache_markers(state.model) + if supports_cache: # Build the cached system dict once per session and splice it in # on each round. The full ``messages`` list grows with every # tool call, so copying the entire list just to mutate index 0 @@ -595,7 +615,11 @@ async def _baseline_llm_caller( final_messages = [state.cached_system_message, *messages[1:]] else: final_messages = messages - extra_headers = _fresh_anthropic_caching_headers() + extra_headers = ( + _fresh_anthropic_caching_headers() + if _is_anthropic_model(state.model) + else None + ) else: final_messages = messages extra_headers = None @@ -1638,9 +1662,10 @@ async def stream_chat_completion_baseline( # _baseline_llm_caller) avoids re-copying ~43 tool dicts on every LLM # round of the tool-call loop. # - # Only apply to Anthropic routes — OpenAI/Grok/other providers would - # 400 on the unknown ``cache_control`` field inside tool definitions. - if _is_anthropic_model(active_model): + # Applies to Anthropic AND Moonshot routes — OpenAI/Grok/Gemini 400 + # on the unknown ``cache_control`` field inside tool definitions, so + # the gate stays narrow (see :func:`_supports_prompt_cache_markers`). + if _supports_prompt_cache_markers(active_model): tools = cast( list[ChatCompletionToolParam], _mark_tools_with_cache_control(tools) ) diff --git a/autogpt_platform/backend/backend/copilot/baseline/service_unit_test.py b/autogpt_platform/backend/backend/copilot/baseline/service_unit_test.py index 44c49eb732..e4da59071a 100644 --- a/autogpt_platform/backend/backend/copilot/baseline/service_unit_test.py +++ b/autogpt_platform/backend/backend/copilot/baseline/service_unit_test.py @@ -1864,12 +1864,14 @@ class TestBaselineReasoningStreaming: assert "reasoning" not in extra_body @pytest.mark.asyncio - async def test_kimi_route_sends_reasoning_but_no_cache_control(self): - """Kimi K2.6 is the default fast_model and sends ``reasoning`` via - OpenRouter's unified extension. It must NOT receive ``cache_control`` - markers or the ``anthropic-beta`` header — Moonshot uses its own - auto-caching and those Anthropic-only fields would either get - silently dropped or (worst case) 400 on a future provider change.""" + async def test_kimi_route_sends_reasoning_and_cache_control(self): + """Kimi K2.6 (Moonshot via OpenRouter's Anthropic-compat endpoint) + accepts ``cache_control: {type: ephemeral}`` on the system block + and the last tool — the endpoint honours the marker and lifts + cache hit rate on continuation turns from near-zero (Moonshot's + auto-caching drifts) to the Anthropic ~60-95% ballpark. The + ``anthropic-beta`` header stays off because Moonshot doesn't need + it; OpenRouter would strip the unknown header anyway.""" state = _BaselineStreamState(model="moonshotai/kimi-k2.6") mock_client = MagicMock() @@ -1901,15 +1903,29 @@ class TestBaselineReasoningStreaming: # cheap-but-still-reasoning-capable path. assert "reasoning" in extra_body assert extra_body["reasoning"]["max_tokens"] > 0 - # Anthropic-only fields stay off. - assert "extra_headers" not in call_kwargs + # No ``anthropic-beta`` header — that beta is specifically for + # native Anthropic endpoints; Moonshot's shim accepts + # ``cache_control`` without it, and sending it would be wasted + # bytes (OR strips it before forwarding to Moonshot). + assert "extra_headers" not in call_kwargs or not call_kwargs.get( + "extra_headers" + ) + # System block MUST carry ``cache_control`` so Moonshot's cache + # breakpoint is honoured. The cached system-message builder + # emits list-shape content with the marker on the first (and + # only) block — assert on that shape. sys_msg = call_kwargs["messages"][0] sys_content = sys_msg.get("content") - if isinstance(sys_content, list): - assert all("cache_control" not in block for block in sys_content) - tools = call_kwargs.get("tools", []) - for t in tools: - assert "cache_control" not in t + assert isinstance( + sys_content, list + ), "Cached system message should be a list-shape content block" + assert any( + "cache_control" in block for block in sys_content if isinstance(block, dict) + ), "Kimi system message should now carry cache_control markers" + # Tool-level cache marking is applied by ``stream_chat_completion_baseline`` + # (see ``_mark_tools_with_cache_control``) before tools reach + # ``_baseline_llm_caller``, so this unit test doesn't exercise + # that path — covered by the outer integration test. @pytest.mark.asyncio async def test_reasoning_only_stream_still_closes_block(self): diff --git a/autogpt_platform/backend/backend/copilot/moonshot.py b/autogpt_platform/backend/backend/copilot/moonshot.py new file mode 100644 index 0000000000..ae1fc1e9d4 --- /dev/null +++ b/autogpt_platform/backend/backend/copilot/moonshot.py @@ -0,0 +1,137 @@ +"""Moonshot-specific pricing and cache-control helpers. + +Moonshot's Kimi K2.x family is routed through OpenRouter's Anthropic-compat +shim — it speaks Anthropic's API shape but its pricing and cache behaviour +diverge from Anthropic in ways the Claude Agent SDK CLI and our baseline +cache-control gating don't handle on their own: + +* **Rate card** — NOT the canonical cost source. The authoritative number + for every OpenRouter-routed turn is the reconcile task + (:mod:`openrouter_cost`), which reads ``total_cost`` directly from + ``/api/v1/generation`` post-turn. This module exists purely so the + CLI's in-turn ``ResultMessage.total_cost_usd`` (which silently bills + Moonshot at Sonnet rates, ~5x the real Moonshot price because the CLI + pricing table only knows Anthropic) isn't left wildly wrong before the + reconcile fires AND so the reconcile's lookup-fail fallback records a + plausible Moonshot estimate rather than a Sonnet-rate overcharge. + Signal authority: reconcile >> this module's rate card >> CLI. + +* **Cache-control** — Anthropic and Moonshot both accept the + ``cache_control: {type: ephemeral}`` breakpoint on message blocks, but + our baseline path currently gates cache markers on an + ``anthropic/`` / ``claude`` name match because non-Anthropic providers + (OpenAI, Grok, Gemini) 400 on the unknown field. Moonshot's + Anthropic-compat endpoint silently accepts and honours the marker — + empirically boosts cache hit rate on continuation turns — but was + caught in the non-Anthropic branch of the original gate. + :func:`supports_cache_control` lets callers widen the gate to include + Moonshot without weakening the ``false`` answer for OpenAI et al. + +Detection is prefix-based (``moonshotai/``). Moonshot routes every Kimi +SKU through the same Anthropic-compat surface and currently prices them +identically, so a new ``moonshotai/kimi-k3.0`` slug transparently +inherits both the rate card and the cache-control gate without editing +this file. Per-slug overrides are in :data:`_RATE_OVERRIDES_USD_PER_MTOK` +for when Moonshot eventually splits prices. +""" + +from __future__ import annotations + +# All Moonshot slugs share these rates as of April 2026 — Moonshot prices +# every Kimi K2.x SKU at $0.60/$2.80 per million (input/output) via +# OpenRouter. Cache-read / cache-write discounts are NOT applied here: +# OpenRouter currently exposes only a single input price per Moonshot +# endpoint; the real billed amount (with cache savings) lands via the +# reconcile path. Keep in sync with https://platform.moonshot.ai/docs/pricing. +_DEFAULT_MOONSHOT_RATE_USD_PER_MTOK: tuple[float, float] = (0.60, 2.80) + +# Per-slug overrides for when Moonshot splits pricing across SKUs. Empty +# today — every slug matching ``moonshotai/`` falls back to +# :data:`_DEFAULT_MOONSHOT_RATE_USD_PER_MTOK`. +_RATE_OVERRIDES_USD_PER_MTOK: dict[str, tuple[float, float]] = {} + +# Vendor prefix — matches any OpenRouter slug Moonshot ships. Keep as a +# module constant so the prefix check stays in exactly one place. +_MOONSHOT_PREFIX = "moonshotai/" + + +def is_moonshot_model(model: str | None) -> bool: + """True when *model* is a Moonshot OpenRouter slug. + + Prefix match against ``moonshotai/`` covers every Kimi SKU Moonshot + ships today (``kimi-k2``, ``kimi-k2.5``, ``kimi-k2.6``, + ``kimi-k2-thinking``) plus any future SKU Moonshot publishes under + the same namespace. Used by both pricing and cache-control gating. + """ + return isinstance(model, str) and model.startswith(_MOONSHOT_PREFIX) + + +def rate_card_usd(model: str) -> tuple[float, float] | None: + """Return (input, output) $/Mtok for *model* or None if non-Moonshot. + + Looks up a per-slug override first, falling back to the shared + default for anything under ``moonshotai/``. Returns None for + non-Moonshot slugs so callers can skip the override safely. + """ + if not is_moonshot_model(model): + return None + return _RATE_OVERRIDES_USD_PER_MTOK.get(model, _DEFAULT_MOONSHOT_RATE_USD_PER_MTOK) + + +def override_cost_usd( + *, + model: str | None, + sdk_reported_usd: float, + prompt_tokens: int, + completion_tokens: int, + cache_read_tokens: int, + cache_creation_tokens: int, +) -> float: + """Recompute SDK turn cost from the Moonshot rate card. + + Not the canonical cost source — the OpenRouter ``/generation`` + reconcile (:mod:`openrouter_cost`) lands the authoritative billed + amount post-turn. This helper exists only to improve the CLI's + in-turn ``ResultMessage.total_cost_usd``: + + 1. So the ``cost_usd`` the client sees before the reconcile completes + isn't wildly wrong (the CLI would otherwise ship a Sonnet-rate + estimate, ~5x the real Moonshot bill). + 2. So the reconcile's own lookup-fail fallback records a plausible + Moonshot estimate rather than the CLI's Sonnet number. + + For Moonshot slugs we compute cost from the reported token counts; + for anything else (including Anthropic) we return the SDK number + unchanged — Anthropic slugs are priced accurately by the CLI. + + Cache read / creation tokens are folded into ``prompt_tokens`` at + the full input rate because Moonshot's rate card doesn't distinguish + them at the OpenRouter surface; the reconcile has the authoritative + discount accounting for turns where Moonshot's cache engaged. + """ + if model is None: + return sdk_reported_usd + rates = rate_card_usd(model) + if rates is None: + return sdk_reported_usd + input_rate, output_rate = rates + total_prompt = prompt_tokens + cache_read_tokens + cache_creation_tokens + return (total_prompt * input_rate + completion_tokens * output_rate) / 1_000_000 + + +def supports_cache_control(model: str | None) -> bool: + """True when *model* accepts Anthropic-style ``cache_control`` markers. + + The baseline path ships ``cache_control: {type: ephemeral}`` on the + system message and the final tool block to trigger Anthropic prompt + caching. Non-Anthropic providers (OpenAI, Grok, Gemini) 400 on the + unknown field — the default gate only allows Anthropic. + + Moonshot's Anthropic-compat endpoint honours the marker. Without + it Moonshot falls back to its own automatic prefix caching, which + drifts more readily between turns (internal testing saw 0/4 cache + hits across two continuation sessions). With explicit + ``cache_control`` the upstream cache hit rate rises to the same + ballpark as Anthropic's ~60-95% on continuations. + """ + return is_moonshot_model(model) diff --git a/autogpt_platform/backend/backend/copilot/moonshot_test.py b/autogpt_platform/backend/backend/copilot/moonshot_test.py new file mode 100644 index 0000000000..a252be2ab8 --- /dev/null +++ b/autogpt_platform/backend/backend/copilot/moonshot_test.py @@ -0,0 +1,173 @@ +"""Unit tests for Moonshot pricing and cache-control helpers.""" + +from __future__ import annotations + +import pytest + +from backend.copilot.moonshot import ( + is_moonshot_model, + override_cost_usd, + rate_card_usd, + supports_cache_control, +) + + +class TestIsMoonshotModel: + """Prefix detection covers every Moonshot SKU without a slug list.""" + + @pytest.mark.parametrize( + "model", + [ + "moonshotai/kimi-k2.6", + "moonshotai/kimi-k2-thinking", + "moonshotai/kimi-k2.5", + "moonshotai/kimi-k2", + "moonshotai/kimi-k3.0", # Future SKU must match transparently. + ], + ) + def test_moonshot_slugs_match(self, model: str) -> None: + assert is_moonshot_model(model) is True + + @pytest.mark.parametrize( + "model", + [ + "anthropic/claude-sonnet-4.6", + "anthropic/claude-opus-4.7", + "openai/gpt-4o", + "google/gemini-2.5-flash", + "xai/grok-4", + "deepseek/deepseek-v3", + "", # Empty string — not Moonshot. + ], + ) + def test_non_moonshot_slugs_do_not_match(self, model: str) -> None: + assert is_moonshot_model(model) is False + + @pytest.mark.parametrize("model", [None, 123, ["moonshotai/kimi-k2.6"]]) + def test_non_string_returns_false(self, model) -> None: + # Type-robust: never raise on unexpected types; callers pass None. + assert is_moonshot_model(model) is False + + +class TestRateCardUsd: + """Rate card defaults to the shared Moonshot price for every SKU.""" + + def test_moonshot_default_rate(self) -> None: + assert rate_card_usd("moonshotai/kimi-k2.6") == (0.60, 2.80) + + def test_future_moonshot_sku_inherits_default(self) -> None: + # Verifies the prefix-based fallback — new SKUs don't need a code + # edit to get a reasonable rate card. + assert rate_card_usd("moonshotai/kimi-k3.0") == (0.60, 2.80) + + def test_non_moonshot_returns_none(self) -> None: + assert rate_card_usd("anthropic/claude-sonnet-4.6") is None + assert rate_card_usd("openai/gpt-4o") is None + + +class TestOverrideCostUsd: + """Rate-card override replaces the CLI's Sonnet-rate estimate for + Moonshot turns; Anthropic and unknown slugs pass through unchanged.""" + + def test_moonshot_recomputes_from_rate_card(self) -> None: + """A 29.5K-prompt Kimi turn should land at ~$0.018 on the + Moonshot rate card, not the CLI's $0.09 Sonnet-rate estimate.""" + recomputed = override_cost_usd( + model="moonshotai/kimi-k2.6", + sdk_reported_usd=0.089862, # What the CLI reported (Sonnet price). + prompt_tokens=29564, + completion_tokens=78, + cache_read_tokens=0, + cache_creation_tokens=0, + ) + expected = (29564 * 0.60 + 78 * 2.80) / 1_000_000 + assert recomputed == pytest.approx(expected, rel=1e-9) + assert 0.017 < recomputed < 0.019 # Sanity against Moonshot's rate card. + + def test_anthropic_passes_through(self) -> None: + """Anthropic slugs are priced accurately by the CLI already — + the override returns the SDK number unchanged.""" + assert ( + override_cost_usd( + model="anthropic/claude-sonnet-4.6", + sdk_reported_usd=0.089862, + prompt_tokens=29564, + completion_tokens=78, + cache_read_tokens=0, + cache_creation_tokens=0, + ) + == 0.089862 + ) + + def test_unknown_non_moonshot_passes_through(self) -> None: + """A non-Moonshot, non-Anthropic slug falls back to the SDK value + — best-effort rather than leaking a zero or a wrong rate card.""" + assert ( + override_cost_usd( + model="deepseek/deepseek-v3", + sdk_reported_usd=0.05, + prompt_tokens=10_000, + completion_tokens=500, + cache_read_tokens=0, + cache_creation_tokens=0, + ) + == 0.05 + ) + + def test_none_model_passes_through(self) -> None: + """Subscription mode sets model=None — return the SDK value.""" + assert ( + override_cost_usd( + model=None, + sdk_reported_usd=0.07, + prompt_tokens=100, + completion_tokens=10, + cache_read_tokens=0, + cache_creation_tokens=0, + ) + == 0.07 + ) + + def test_cache_tokens_priced_at_input_rate(self) -> None: + """OpenRouter's Moonshot endpoints don't expose a discounted + cached-input price — cache_read / cache_creation tokens are + priced at the full input rate. The reconcile path has the + authoritative discount for turns where Moonshot's cache engaged.""" + recomputed = override_cost_usd( + model="moonshotai/kimi-k2.6", + sdk_reported_usd=0.5, + prompt_tokens=1000, + completion_tokens=0, + cache_read_tokens=5000, + cache_creation_tokens=2000, + ) + expected = (1000 + 5000 + 2000) * 0.60 / 1_000_000 + assert recomputed == pytest.approx(expected, rel=1e-9) + + +class TestSupportsCacheControl: + """Gate for emitting ``cache_control: {type: ephemeral}`` on message + blocks. True for Moonshot (Anthropic-compat endpoint accepts it) + and False for everything else this module knows about — Anthropic + callers use their own ``_is_anthropic_model`` check which is + combined with this one into a wider gate.""" + + def test_moonshot_supports_cache_control(self) -> None: + assert supports_cache_control("moonshotai/kimi-k2.6") is True + + def test_future_moonshot_sku_supports_cache_control(self) -> None: + assert supports_cache_control("moonshotai/kimi-k3.0") is True + + @pytest.mark.parametrize( + "model", + [ + "openai/gpt-4o", + "google/gemini-2.5-flash", + "xai/grok-4", + "deepseek/deepseek-v3", + "", + None, + ], + ) + def test_non_moonshot_does_not_support_cache_control(self, model) -> None: + assert supports_cache_control(model) is False diff --git a/autogpt_platform/backend/backend/copilot/sdk/service.py b/autogpt_platform/backend/backend/copilot/sdk/service.py index d62ba2afff..6ccf495c18 100644 --- a/autogpt_platform/backend/backend/copilot/sdk/service.py +++ b/autogpt_platform/backend/backend/copilot/sdk/service.py @@ -57,6 +57,7 @@ from ..constants import ( from ..session_cleanup import prune_orphan_tool_calls from ..context import encode_cwd_for_cli, get_workspace_manager from ..graphiti.config import is_enabled_for_user +from ..moonshot import override_cost_usd as _override_cost_for_moonshot from ..model import ( ChatMessage, ChatSession, @@ -723,55 +724,6 @@ def _normalize_model_name(raw_model: str) -> str: return model.replace(".", "-") -# Per-million-token rates ($USD) for non-Anthropic OpenRouter slugs that -# the Claude Agent SDK CLI doesn't recognise. The CLI's bundled pricing -# table only knows Anthropic models — for anything else its -# ``ResultMessage.total_cost_usd`` silently falls back to Sonnet rates, -# over-billing by ~5x for cheaper models like Kimi K2.6. Values are taken -# directly from each provider's published rate card and must be kept in -# sync when prices change. Cache discounts are not applied — Kimi via -# OpenRouter does not currently expose a separate cached-input price. -_NON_ANTHROPIC_RATES_USD_PER_MTOK: dict[str, tuple[float, float]] = { - # vendor/model: (input_per_mtok, output_per_mtok) - "moonshotai/kimi-k2.6": (0.60, 2.80), - "moonshotai/kimi-k2-thinking": (0.60, 2.80), - "moonshotai/kimi-k2.5": (0.60, 2.80), - "moonshotai/kimi-k2": (0.60, 2.80), -} - - -def _override_cost_for_non_anthropic( - raw_model: str | None, - sdk_reported_usd: float, - prompt_tokens: int, - completion_tokens: int, - cache_read_tokens: int, - cache_creation_tokens: int, -) -> float: - """Recompute turn cost from a known rate card for non-Anthropic models. - - The Claude Agent SDK CLI's ``total_cost_usd`` is computed from a - static Anthropic pricing table baked into the binary — it doesn't - know Kimi/DeepSeek/etc rates and silently bills at Sonnet prices, - which would over-charge a Kimi-default deployment by ~5x. Mirror - the baseline path's behaviour by computing the real cost from the - token counts whenever we recognise the slug; otherwise trust the - SDK number (correct for Anthropic models, best-effort for unknown - providers). - """ - if raw_model is None: - return sdk_reported_usd - rates = _NON_ANTHROPIC_RATES_USD_PER_MTOK.get(raw_model) - if rates is None: - return sdk_reported_usd - input_rate, output_rate = rates - # Treat cache reads/creation as plain prompt tokens since OpenRouter - # does not currently report a discounted cached-input price for the - # tracked Moonshot endpoints. - total_prompt = prompt_tokens + cache_read_tokens + cache_creation_tokens - return (total_prompt * input_rate + completion_tokens * output_rate) / 1_000_000 - - def _resolve_sdk_model() -> str | None: """Resolve the model name for the Claude Agent SDK CLI. @@ -2354,11 +2306,17 @@ async def _run_stream_attempt( # Anthropic model (e.g. Kimi K2.6) the CLI doesn't # know the real per-token price and silently falls # back to Sonnet rates — over-billing the user ~5x. - # Recompute from a known rate card for non-Anthropic - # OpenRouter slugs so the cost row, the rate-limit - # counter, and the UI cost display reflect reality. - state.usage.cost_usd = _override_cost_for_non_anthropic( - raw_model=getattr(state.options, "model", None), + # ``override_cost_usd`` replaces the CLI number with + # the Moonshot rate-card estimate when the slug + # matches, otherwise returns the SDK-reported value + # unchanged. Reconcile + # (``record_turn_cost_from_openrouter``) overrides + # both with the authoritative bill when every gen-ID + # lookup succeeds; this estimate only ships as the + # sync-path cost or the reconcile's lookup-fail + # fallback. + state.usage.cost_usd = _override_cost_for_moonshot( + model=getattr(state.options, "model", None), sdk_reported_usd=sdk_msg.total_cost_usd, prompt_tokens=state.usage.prompt_tokens, completion_tokens=state.usage.completion_tokens, diff --git a/autogpt_platform/backend/backend/copilot/sdk/service_test.py b/autogpt_platform/backend/backend/copilot/sdk/service_test.py index 7f53cb67b5..0e59a0e276 100644 --- a/autogpt_platform/backend/backend/copilot/sdk/service_test.py +++ b/autogpt_platform/backend/backend/copilot/sdk/service_test.py @@ -15,7 +15,6 @@ from .service import ( _build_system_prompt_value, _is_sdk_disconnect_error, _normalize_model_name, - _override_cost_for_non_anthropic, _prepare_file_attachments, _resolve_sdk_model, _safe_close_sdk_client, @@ -707,79 +706,3 @@ class TestIdleTimeoutConstant: def test_idle_timeout_is_10_min(self): assert _IDLE_TIMEOUT_SECONDS == 10 * 60 - - -class TestOverrideCostForNonAnthropic: - """Verifies that turn costs routed through OpenRouter to non-Anthropic - vendors use the platform's per-model rate card instead of the SDK - CLI's static Anthropic pricing table — which silently falls back to - Sonnet rates for unknown models and over-bills by ~5x.""" - - def test_kimi_cost_recomputed_from_rate_card(self): - """Kimi K2.6 @ $0.60 input / $2.80 output per MTok. 29564 prompt - tokens + 78 completion should land at ~$0.018, not $0.09 (Sonnet).""" - recomputed = _override_cost_for_non_anthropic( - raw_model="moonshotai/kimi-k2.6", - sdk_reported_usd=0.089862, # what the SDK CLI reported (Sonnet price) - prompt_tokens=29564, - completion_tokens=78, - cache_read_tokens=0, - cache_creation_tokens=0, - ) - expected = (29564 * 0.60 + 78 * 2.80) / 1_000_000 - assert recomputed == pytest.approx(expected, rel=1e-9) - # Sanity-check against a hand-computed magnitude. - assert 0.017 < recomputed < 0.019 - - def test_anthropic_cost_unchanged(self): - """Anthropic slugs pass through the SDK-reported value since the - CLI's pricing table is correct for them.""" - result = _override_cost_for_non_anthropic( - raw_model="anthropic/claude-sonnet-4.6", - sdk_reported_usd=0.089862, - prompt_tokens=29564, - completion_tokens=78, - cache_read_tokens=0, - cache_creation_tokens=0, - ) - assert result == 0.089862 - - def test_unknown_non_anthropic_vendor_passes_through(self): - """A non-Anthropic slug not in the rate card falls back to the - SDK-reported value — best-effort rather than misleading zero.""" - result = _override_cost_for_non_anthropic( - raw_model="deepseek/some-new-model", - sdk_reported_usd=0.05, - prompt_tokens=10000, - completion_tokens=500, - cache_read_tokens=0, - cache_creation_tokens=0, - ) - assert result == 0.05 - - def test_none_model_passes_through(self): - """Subscription mode / no-model case returns the SDK value.""" - result = _override_cost_for_non_anthropic( - raw_model=None, - sdk_reported_usd=0.07, - prompt_tokens=100, - completion_tokens=10, - cache_read_tokens=0, - cache_creation_tokens=0, - ) - assert result == 0.07 - - def test_cache_tokens_folded_into_prompt(self): - """Since the Moonshot endpoints don't report discounted cached- - input pricing, cache_read/creation tokens are priced at the same - rate as regular prompt tokens.""" - recomputed = _override_cost_for_non_anthropic( - raw_model="moonshotai/kimi-k2.6", - sdk_reported_usd=0.5, - prompt_tokens=1000, - completion_tokens=0, - cache_read_tokens=5000, - cache_creation_tokens=2000, - ) - expected = (1000 + 5000 + 2000) * 0.60 / 1_000_000 - assert recomputed == pytest.approx(expected, rel=1e-9) diff --git a/autogpt_platform/backend/backend/copilot/service.py b/autogpt_platform/backend/backend/copilot/service.py index b0399f87e3..447bb6b5b2 100644 --- a/autogpt_platform/backend/backend/copilot/service.py +++ b/autogpt_platform/backend/backend/copilot/service.py @@ -34,6 +34,7 @@ from .model import ( update_session_title, upsert_chat_session, ) +from .token_tracking import persist_and_record_usage logger = logging.getLogger(__name__) @@ -498,6 +499,13 @@ async def _generate_session_title( ) -> str | None: """Generate a concise title for a chat session based on the first message. + Also persists the title-generation call's cost to ``PlatformCostLog`` + so the admin dashboard's provider totals match the real OpenRouter + bill. Before this, the title LLM call (a background task, one per + session) bypassed the main turn's reconcile entirely and silently + wasn't tracked — low per-call cost but 100% of sessions pay it, so + it adds up. + Args: message: The first user message in the session user_id: User ID for OpenRouter tracing (optional) @@ -507,8 +515,11 @@ async def _generate_session_title( A short title (3-6 words) or None if generation fails """ try: - # Build extra_body for OpenRouter tracing and PostHog analytics - extra_body: dict[str, Any] = {} + # Build extra_body for OpenRouter tracing and PostHog analytics. + # ``usage: {"include": True}`` asks OR to embed the real billed + # cost into the final usage chunk — matches the baseline path's + # ``_OPENROUTER_INCLUDE_USAGE_COST`` pattern, same read path. + extra_body: dict[str, Any] = {"usage": {"include": True}} if user_id: extra_body["user"] = user_id[:128] # OpenRouter limit extra_body["posthogDistinctId"] = user_id @@ -534,6 +545,17 @@ async def _generate_session_title( max_tokens=20, extra_body=extra_body, ) + + # Best-effort cost capture — the title call is a one-shot LLM + # round we'd otherwise miss in admin totals. Runs inside the + # ``try`` block so a persist failure downgrades to the existing + # "title generation failed" warning path rather than raising. + await _record_title_generation_cost( + response=response, + user_id=user_id, + session_id=session_id, + ) + title = response.choices[0].message.content if title: # Clean up the title @@ -548,6 +570,63 @@ async def _generate_session_title( return None +async def _record_title_generation_cost( + *, + response: Any, + user_id: str | None, + session_id: str | None, +) -> None: + """Persist the title LLM call's cost to ``PlatformCostLog``. + + Title generation runs in a background task per-session — low cost + (~$0.0001 per title) but 100% of sessions pay it. Without this the + admin dashboard under-reports total provider spend by the aggregate + of those calls. Separate ``block_name="copilot:title"`` so the row + is clearly distinguishable from the turn's main ``copilot:SDK`` / + ``copilot:baseline`` attributions. + + Best-effort: any error downgrades to a debug log so title generation + itself never breaks on a cost-tracking hiccup. + """ + try: + usage = getattr(response, "usage", None) + if usage is None: + return + prompt_tokens = getattr(usage, "prompt_tokens", 0) or 0 + completion_tokens = getattr(usage, "completion_tokens", 0) or 0 + # OR piggybacks the real billed cost on ``usage.cost`` when + # ``extra_body={"usage":{"include":true}}`` — stashed in + # pydantic's ``model_extra``. Absent for non-OR routes. + extras = getattr(usage, "model_extra", None) or {} + cost_raw = extras.get("cost") if isinstance(extras, dict) else None + cost_usd: float | None + try: + cost_usd = float(cost_raw) if cost_raw is not None else None + except (TypeError, ValueError): + cost_usd = None + + # ``persist_and_record_usage`` needs the session object to + # append the usage row to the session's per-turn usage list — + # load it lazily so the hot title path doesn't pay the DB round + # trip unless a cost actually needs recording. + session = None + if session_id and user_id: + session = await get_chat_session(session_id, user_id) + + await persist_and_record_usage( + session=session, + user_id=user_id, + prompt_tokens=prompt_tokens, + completion_tokens=completion_tokens, + log_prefix="[title]", + cost_usd=cost_usd, + model=config.title_model, + provider="open_router", + ) + except Exception as exc: # noqa: BLE001 + logger.debug("Title cost tracking skipped: %s", exc) + + async def _update_title_async( session_id: str, message: str, user_id: str | None = None ) -> None: