refactor(backend/copilot): isolate Moonshot quirks + enable Moonshot cache_control + track title cost

## Why Three loose ends from the Kimi SDK-default merge (#12878): 1. Kimi-specific pricing logic (rate card, cost-override helper) lived inline in sdk/service.py next to unrelated SDK plumbing. Any future non-Anthropic vendor would have piled into the same file. 2. Moonshot's Anthropic-compat endpoint honours ``cache_control: {type: ephemeral}`` markers, but the baseline cache-marking gate (``_is_anthropic_model``) was narrow enough to exclude it — so Moonshot fell back to its automatic prefix cache, which drifts readily between turns (internal testing: 0/4 hits across two continuation sessions). 3. Title generation (``_update_title_async``, runs per-session) makes its own LLM call that the turn reconcile never sees — admin cost dashboard under-reports provider spend by the aggregate of those calls. ## What - New ``backend/copilot/moonshot.py`` module: * ``is_moonshot_model(model)`` — prefix check against ``moonshotai/`` * ``rate_card_usd(model)`` — published Moonshot rates, default ``(0.60, 2.80)`` per MTok, per-slug overrides slot for future SKUs * ``override_cost_usd(...)`` — moved from sdk/service.py, replaces the CLI's Sonnet-rate Kimi estimate with the real rate card * ``supports_cache_control(model)`` — gate for Anthropic-style cache markers on Moonshot routes * Explicit docstring: rate card is NOT canonical — authoritative cost comes from the OpenRouter ``/generation`` reconcile; this module only improves the in-turn estimate and the reconcile's lookup-fail fallback. - Baseline ``_is_anthropic_model`` stays narrow (still only Anthropic — callers needing the ``anthropic-beta`` header use it as-is). New ``_supports_prompt_cache_markers`` widens the gate to Anthropic OR Moonshot; both call sites that emit ``cache_control`` on the system message and the final tool schema switch to this wider gate. OpenAI / Grok / Gemini still 400 on the field, so those providers keep the ``false`` answer. - Title generation now captures its own cost via ``persist_and_record_usage`` with ``provider="open_router"`` and ``block_name="copilot:title"``, so the admin dashboard sees every dollar we spend. OR's ``usage.cost`` is read off ``usage.model_extra`` (the pydantic container) using the same pattern as tools/web_search. - TODO marker on the rate-card call site: after ~1 week of production data confirming OR ``/generation`` reconcile reliability, drop the rate card entirely and rely on the authoritative number for all Kimi turns. ## How - Detection is prefix-based (``moonshotai/``) — every Kimi SKU today uses that namespace and shares pricing, so a future ``moonshotai/kimi-k3.0`` inherits both the rate card and the cache-control gate transparently without editing this file. - ``_mark_tools_with_cache_control`` and the per-round cached system message builder already exist; the only change is swapping the gate from ``_is_anthropic_model`` to ``_supports_prompt_cache_markers``. - Title cost capture is in-line with the existing ``_generate_session_title`` call (``extra_body={"usage":{"include":True}}`` asks OR to embed the real billed cost); a best-effort ``_record_title_generation_cost`` helper reads it off ``usage.model_extra`` and fires ``persist_and_record_usage`` under ``try/except`` — any cost-tracking failure downgrades to debug log so title generation itself never breaks. ## Deferred to follow-up - Kimi reasoning renders after text on dev because Moonshot's shim splits each turn into separate ``AssistantMessage`` summaries (one text-only, one thinking-only) — the in-message hoist at ``response_adapter.py:193`` can't reorder across messages. Fix needs more design (UX trade-off between realtime streaming and correct ordering); investigating separately.
2026-04-30 03:00:41 -04:00 · 2026-04-22 19:02:58 +07:00
parent b98bcf31c8
commit dbc3b0b31b
7 changed files with 480 additions and 169 deletions
--- a/autogpt_platform/backend/backend/copilot/baseline/service.py
+++ b/autogpt_platform/backend/backend/copilot/baseline/service.py
@@ -38,6 +38,7 @@ from backend.copilot.builder_context import (
 from backend.copilot.config import CopilotLlmModel, CopilotMode
 from backend.copilot.context import get_workspace_manager, set_execution_context
 from backend.copilot.graphiti.config import is_enabled_for_user
+from backend.copilot.moonshot import is_moonshot_model
 from backend.copilot.model import (
    ChatMessage,
    ChatSession,
@@ -428,25 +429,39 @@ def _emit_all(
 def _is_anthropic_model(model: str) -> bool:
    """Return True if *model* routes to Anthropic (native or via OpenRouter).

-    Cache-control markers on message content + the ``anthropic-beta`` header
-    are Anthropic-specific.  OpenAI rejects the unknown ``cache_control``
-    field with a 400 ("Extra inputs are not permitted") and Grok / other
-    providers behave similarly.  OpenRouter strips unknown headers but
-    passes through ``cache_control`` on the body regardless of provider —
-    which would also fail when OpenRouter routes to a non-Anthropic model.
-
    Examples that return True:
      - ``anthropic/claude-sonnet-4-6`` (OpenRouter route)
      - ``claude-3-5-sonnet-20241022`` (direct Anthropic API)
      - ``anthropic.claude-3-5-sonnet`` (Bedrock-style)

    False for ``openai/gpt-4o``, ``google/gemini-2.5-pro``, ``xai/grok-4``
-    etc.
+    etc.  Moonshot is False here too even though Moonshot's
+    Anthropic-compat endpoint honours ``cache_control`` — use
+    :func:`_supports_prompt_cache_markers` for the cache-gating decision,
+    which also allows Moonshot routes.  This function stays scoped to
+    "genuinely Anthropic" so callers that need the stricter check (e.g.
+    ``anthropic-beta`` header emission) keep their existing semantics.
    """
    lowered = model.lower()
    return "claude" in lowered or lowered.startswith("anthropic")


+def _supports_prompt_cache_markers(model: str) -> bool:
+    """Return True when *model* accepts Anthropic-style ``cache_control``.
+
+    Superset of :func:`_is_anthropic_model` — also allows Moonshot
+    (``moonshotai/*``), whose OpenRouter Anthropic-compat endpoint
+    honours the marker and empirically lifts cache hit rate on
+    continuation turns from near-zero (Moonshot's own automatic prefix
+    cache, which drifts readily) to the 60-95% Anthropic ballpark.
+
+    OpenAI / Grok / Gemini still 400 on ``cache_control``, so this
+    function returns False for those providers — add new vendors here
+    only after verifying their endpoint accepts the field.
+    """
+    return _is_anthropic_model(model) or is_moonshot_model(model)
+
+
 def _fresh_ephemeral_cache_control() -> dict[str, str]:
    """Return a FRESH ephemeral ``cache_control`` dict each call.

@@ -567,19 +582,24 @@ async def _baseline_llm_caller(
    round_text = ""
    try:
        client = _get_openai_client()
-        # Cache markers are Anthropic-specific.  For OpenAI/Grok/other
-        # providers, leaving them on would trigger a 400 ("Extra inputs
-        # are not permitted" on cache_control).  Tools were precomputed
-        # in stream_chat_completion_baseline via _mark_tools_with_cache_control
-        # (only when the model was Anthropic), so on non-Anthropic routes
-        # tools ship without cache_control on the last entry too.
+        # Cache markers are accepted by Anthropic AND Moonshot (via OR's
+        # Anthropic-compat endpoint).  OpenAI/Grok/Gemini 400 on the
+        # unknown ``cache_control`` field — tools were precomputed in
+        # stream_chat_completion_baseline via _mark_tools_with_cache_control
+        # with the same gate, so on unsupported routes tools ship
+        # unmarked too.
        #
-        # `extra_body` `usage.include=true` asks OpenRouter to embed the real
-        # generation cost into the final usage chunk — required by the
-        # cost-based rate limiter in routes.py.  Separate from the Anthropic
+        # The ``anthropic-beta`` header is only emitted for genuinely
+        # Anthropic routes (see :func:`_is_anthropic_model`) — Moonshot
+        # doesn't need the beta header; sending it is a no-op but we
+        # keep the check strict for clarity.
+        #
+        # `extra_body` `usage.include=true` asks OpenRouter to embed the
+        # real generation cost into the final usage chunk — required by
+        # the cost-based rate limiter in routes.py.  Separate from the
        # caching headers, always sent.
-        is_anthropic = _is_anthropic_model(state.model)
-        if is_anthropic:
+        supports_cache = _supports_prompt_cache_markers(state.model)
+        if supports_cache:
            # Build the cached system dict once per session and splice it in
            # on each round.  The full ``messages`` list grows with every
            # tool call, so copying the entire list just to mutate index 0
@@ -595,7 +615,11 @@ async def _baseline_llm_caller(
                final_messages = [state.cached_system_message, *messages[1:]]
            else:
                final_messages = messages
-            extra_headers = _fresh_anthropic_caching_headers()
+            extra_headers = (
+                _fresh_anthropic_caching_headers()
+                if _is_anthropic_model(state.model)
+                else None
+            )
        else:
            final_messages = messages
            extra_headers = None
@@ -1638,9 +1662,10 @@ async def stream_chat_completion_baseline(
    # _baseline_llm_caller) avoids re-copying ~43 tool dicts on every LLM
    # round of the tool-call loop.
    #
-    # Only apply to Anthropic routes — OpenAI/Grok/other providers would
-    # 400 on the unknown ``cache_control`` field inside tool definitions.
-    if _is_anthropic_model(active_model):
+    # Applies to Anthropic AND Moonshot routes — OpenAI/Grok/Gemini 400
+    # on the unknown ``cache_control`` field inside tool definitions, so
+    # the gate stays narrow (see :func:`_supports_prompt_cache_markers`).
+    if _supports_prompt_cache_markers(active_model):
        tools = cast(
            list[ChatCompletionToolParam], _mark_tools_with_cache_control(tools)
        )
--- a/autogpt_platform/backend/backend/copilot/baseline/service_unit_test.py
+++ b/autogpt_platform/backend/backend/copilot/baseline/service_unit_test.py
@@ -1864,12 +1864,14 @@ class TestBaselineReasoningStreaming:
        assert "reasoning" not in extra_body

    @pytest.mark.asyncio
-    async def test_kimi_route_sends_reasoning_but_no_cache_control(self):
-        """Kimi K2.6 is the default fast_model and sends ``reasoning`` via
-        OpenRouter's unified extension.  It must NOT receive ``cache_control``
-        markers or the ``anthropic-beta`` header — Moonshot uses its own
-        auto-caching and those Anthropic-only fields would either get
-        silently dropped or (worst case) 400 on a future provider change."""
+    async def test_kimi_route_sends_reasoning_and_cache_control(self):
+        """Kimi K2.6 (Moonshot via OpenRouter's Anthropic-compat endpoint)
+        accepts ``cache_control: {type: ephemeral}`` on the system block
+        and the last tool — the endpoint honours the marker and lifts
+        cache hit rate on continuation turns from near-zero (Moonshot's
+        auto-caching drifts) to the Anthropic ~60-95% ballpark.  The
+        ``anthropic-beta`` header stays off because Moonshot doesn't need
+        it; OpenRouter would strip the unknown header anyway."""
        state = _BaselineStreamState(model="moonshotai/kimi-k2.6")

        mock_client = MagicMock()
@@ -1901,15 +1903,29 @@ class TestBaselineReasoningStreaming:
        # cheap-but-still-reasoning-capable path.
        assert "reasoning" in extra_body
        assert extra_body["reasoning"]["max_tokens"] > 0
-        # Anthropic-only fields stay off.
-        assert "extra_headers" not in call_kwargs
+        # No ``anthropic-beta`` header — that beta is specifically for
+        # native Anthropic endpoints; Moonshot's shim accepts
+        # ``cache_control`` without it, and sending it would be wasted
+        # bytes (OR strips it before forwarding to Moonshot).
+        assert "extra_headers" not in call_kwargs or not call_kwargs.get(
+            "extra_headers"
+        )
+        # System block MUST carry ``cache_control`` so Moonshot's cache
+        # breakpoint is honoured.  The cached system-message builder
+        # emits list-shape content with the marker on the first (and
+        # only) block — assert on that shape.
        sys_msg = call_kwargs["messages"][0]
        sys_content = sys_msg.get("content")
-        if isinstance(sys_content, list):
-            assert all("cache_control" not in block for block in sys_content)
-        tools = call_kwargs.get("tools", [])
-        for t in tools:
-            assert "cache_control" not in t
+        assert isinstance(
+            sys_content, list
+        ), "Cached system message should be a list-shape content block"
+        assert any(
+            "cache_control" in block for block in sys_content if isinstance(block, dict)
+        ), "Kimi system message should now carry cache_control markers"
+        # Tool-level cache marking is applied by ``stream_chat_completion_baseline``
+        # (see ``_mark_tools_with_cache_control``) before tools reach
+        # ``_baseline_llm_caller``, so this unit test doesn't exercise
+        # that path — covered by the outer integration test.

    @pytest.mark.asyncio
    async def test_reasoning_only_stream_still_closes_block(self):
--- a/autogpt_platform/backend/backend/copilot/moonshot.py
+++ b/autogpt_platform/backend/backend/copilot/moonshot.py
@@ -0,0 +1,137 @@
+"""Moonshot-specific pricing and cache-control helpers.
+
+Moonshot's Kimi K2.x family is routed through OpenRouter's Anthropic-compat
+shim — it speaks Anthropic's API shape but its pricing and cache behaviour
+diverge from Anthropic in ways the Claude Agent SDK CLI and our baseline
+cache-control gating don't handle on their own:
+
+* **Rate card** — NOT the canonical cost source.  The authoritative number
+  for every OpenRouter-routed turn is the reconcile task
+  (:mod:`openrouter_cost`), which reads ``total_cost`` directly from
+  ``/api/v1/generation`` post-turn.  This module exists purely so the
+  CLI's in-turn ``ResultMessage.total_cost_usd`` (which silently bills
+  Moonshot at Sonnet rates, ~5x the real Moonshot price because the CLI
+  pricing table only knows Anthropic) isn't left wildly wrong before the
+  reconcile fires AND so the reconcile's lookup-fail fallback records a
+  plausible Moonshot estimate rather than a Sonnet-rate overcharge.
+  Signal authority: reconcile >> this module's rate card >> CLI.
+
+* **Cache-control** — Anthropic and Moonshot both accept the
+  ``cache_control: {type: ephemeral}`` breakpoint on message blocks, but
+  our baseline path currently gates cache markers on an
+  ``anthropic/`` / ``claude`` name match because non-Anthropic providers
+  (OpenAI, Grok, Gemini) 400 on the unknown field.  Moonshot's
+  Anthropic-compat endpoint silently accepts and honours the marker —
+  empirically boosts cache hit rate on continuation turns — but was
+  caught in the non-Anthropic branch of the original gate.
+  :func:`supports_cache_control` lets callers widen the gate to include
+  Moonshot without weakening the ``false`` answer for OpenAI et al.
+
+Detection is prefix-based (``moonshotai/``).  Moonshot routes every Kimi
+SKU through the same Anthropic-compat surface and currently prices them
+identically, so a new ``moonshotai/kimi-k3.0`` slug transparently
+inherits both the rate card and the cache-control gate without editing
+this file.  Per-slug overrides are in :data:`_RATE_OVERRIDES_USD_PER_MTOK`
+for when Moonshot eventually splits prices.
+"""
+
+from __future__ import annotations
+
+# All Moonshot slugs share these rates as of April 2026 — Moonshot prices
+# every Kimi K2.x SKU at $0.60/$2.80 per million (input/output) via
+# OpenRouter.  Cache-read / cache-write discounts are NOT applied here:
+# OpenRouter currently exposes only a single input price per Moonshot
+# endpoint; the real billed amount (with cache savings) lands via the
+# reconcile path.  Keep in sync with https://platform.moonshot.ai/docs/pricing.
+_DEFAULT_MOONSHOT_RATE_USD_PER_MTOK: tuple[float, float] = (0.60, 2.80)
+
+# Per-slug overrides for when Moonshot splits pricing across SKUs.  Empty
+# today — every slug matching ``moonshotai/`` falls back to
+# :data:`_DEFAULT_MOONSHOT_RATE_USD_PER_MTOK`.
+_RATE_OVERRIDES_USD_PER_MTOK: dict[str, tuple[float, float]] = {}
+
+# Vendor prefix — matches any OpenRouter slug Moonshot ships.  Keep as a
+# module constant so the prefix check stays in exactly one place.
+_MOONSHOT_PREFIX = "moonshotai/"
+
+
+def is_moonshot_model(model: str | None) -> bool:
+    """True when *model* is a Moonshot OpenRouter slug.
+
+    Prefix match against ``moonshotai/`` covers every Kimi SKU Moonshot
+    ships today (``kimi-k2``, ``kimi-k2.5``, ``kimi-k2.6``,
+    ``kimi-k2-thinking``) plus any future SKU Moonshot publishes under
+    the same namespace.  Used by both pricing and cache-control gating.
+    """
+    return isinstance(model, str) and model.startswith(_MOONSHOT_PREFIX)
+
+
+def rate_card_usd(model: str) -> tuple[float, float] | None:
+    """Return (input, output) $/Mtok for *model* or None if non-Moonshot.
+
+    Looks up a per-slug override first, falling back to the shared
+    default for anything under ``moonshotai/``.  Returns None for
+    non-Moonshot slugs so callers can skip the override safely.
+    """
+    if not is_moonshot_model(model):
+        return None
+    return _RATE_OVERRIDES_USD_PER_MTOK.get(model, _DEFAULT_MOONSHOT_RATE_USD_PER_MTOK)
+
+
+def override_cost_usd(
+    *,
+    model: str | None,
+    sdk_reported_usd: float,
+    prompt_tokens: int,
+    completion_tokens: int,
+    cache_read_tokens: int,
+    cache_creation_tokens: int,
+) -> float:
+    """Recompute SDK turn cost from the Moonshot rate card.
+
+    Not the canonical cost source — the OpenRouter ``/generation``
+    reconcile (:mod:`openrouter_cost`) lands the authoritative billed
+    amount post-turn.  This helper exists only to improve the CLI's
+    in-turn ``ResultMessage.total_cost_usd``:
+
+    1. So the ``cost_usd`` the client sees before the reconcile completes
+       isn't wildly wrong (the CLI would otherwise ship a Sonnet-rate
+       estimate, ~5x the real Moonshot bill).
+    2. So the reconcile's own lookup-fail fallback records a plausible
+       Moonshot estimate rather than the CLI's Sonnet number.
+
+    For Moonshot slugs we compute cost from the reported token counts;
+    for anything else (including Anthropic) we return the SDK number
+    unchanged — Anthropic slugs are priced accurately by the CLI.
+
+    Cache read / creation tokens are folded into ``prompt_tokens`` at
+    the full input rate because Moonshot's rate card doesn't distinguish
+    them at the OpenRouter surface; the reconcile has the authoritative
+    discount accounting for turns where Moonshot's cache engaged.
+    """
+    if model is None:
+        return sdk_reported_usd
+    rates = rate_card_usd(model)
+    if rates is None:
+        return sdk_reported_usd
+    input_rate, output_rate = rates
+    total_prompt = prompt_tokens + cache_read_tokens + cache_creation_tokens
+    return (total_prompt * input_rate + completion_tokens * output_rate) / 1_000_000
+
+
+def supports_cache_control(model: str | None) -> bool:
+    """True when *model* accepts Anthropic-style ``cache_control`` markers.
+
+    The baseline path ships ``cache_control: {type: ephemeral}`` on the
+    system message and the final tool block to trigger Anthropic prompt
+    caching.  Non-Anthropic providers (OpenAI, Grok, Gemini) 400 on the
+    unknown field — the default gate only allows Anthropic.
+
+    Moonshot's Anthropic-compat endpoint honours the marker.  Without
+    it Moonshot falls back to its own automatic prefix caching, which
+    drifts more readily between turns (internal testing saw 0/4 cache
+    hits across two continuation sessions).  With explicit
+    ``cache_control`` the upstream cache hit rate rises to the same
+    ballpark as Anthropic's ~60-95% on continuations.
+    """
+    return is_moonshot_model(model)
--- a/autogpt_platform/backend/backend/copilot/moonshot_test.py
+++ b/autogpt_platform/backend/backend/copilot/moonshot_test.py
@@ -0,0 +1,173 @@
+"""Unit tests for Moonshot pricing and cache-control helpers."""
+
+from __future__ import annotations
+
+import pytest
+
+from backend.copilot.moonshot import (
+    is_moonshot_model,
+    override_cost_usd,
+    rate_card_usd,
+    supports_cache_control,
+)
+
+
+class TestIsMoonshotModel:
+    """Prefix detection covers every Moonshot SKU without a slug list."""
+
+    @pytest.mark.parametrize(
+        "model",
+        [
+            "moonshotai/kimi-k2.6",
+            "moonshotai/kimi-k2-thinking",
+            "moonshotai/kimi-k2.5",
+            "moonshotai/kimi-k2",
+            "moonshotai/kimi-k3.0",  # Future SKU must match transparently.
+        ],
+    )
+    def test_moonshot_slugs_match(self, model: str) -> None:
+        assert is_moonshot_model(model) is True
+
+    @pytest.mark.parametrize(
+        "model",
+        [
+            "anthropic/claude-sonnet-4.6",
+            "anthropic/claude-opus-4.7",
+            "openai/gpt-4o",
+            "google/gemini-2.5-flash",
+            "xai/grok-4",
+            "deepseek/deepseek-v3",
+            "",  # Empty string — not Moonshot.
+        ],
+    )
+    def test_non_moonshot_slugs_do_not_match(self, model: str) -> None:
+        assert is_moonshot_model(model) is False
+
+    @pytest.mark.parametrize("model", [None, 123, ["moonshotai/kimi-k2.6"]])
+    def test_non_string_returns_false(self, model) -> None:
+        # Type-robust: never raise on unexpected types; callers pass None.
+        assert is_moonshot_model(model) is False
+
+
+class TestRateCardUsd:
+    """Rate card defaults to the shared Moonshot price for every SKU."""
+
+    def test_moonshot_default_rate(self) -> None:
+        assert rate_card_usd("moonshotai/kimi-k2.6") == (0.60, 2.80)
+
+    def test_future_moonshot_sku_inherits_default(self) -> None:
+        # Verifies the prefix-based fallback — new SKUs don't need a code
+        # edit to get a reasonable rate card.
+        assert rate_card_usd("moonshotai/kimi-k3.0") == (0.60, 2.80)
+
+    def test_non_moonshot_returns_none(self) -> None:
+        assert rate_card_usd("anthropic/claude-sonnet-4.6") is None
+        assert rate_card_usd("openai/gpt-4o") is None
+
+
+class TestOverrideCostUsd:
+    """Rate-card override replaces the CLI's Sonnet-rate estimate for
+    Moonshot turns; Anthropic and unknown slugs pass through unchanged."""
+
+    def test_moonshot_recomputes_from_rate_card(self) -> None:
+        """A 29.5K-prompt Kimi turn should land at ~$0.018 on the
+        Moonshot rate card, not the CLI's $0.09 Sonnet-rate estimate."""
+        recomputed = override_cost_usd(
+            model="moonshotai/kimi-k2.6",
+            sdk_reported_usd=0.089862,  # What the CLI reported (Sonnet price).
+            prompt_tokens=29564,
+            completion_tokens=78,
+            cache_read_tokens=0,
+            cache_creation_tokens=0,
+        )
+        expected = (29564 * 0.60 + 78 * 2.80) / 1_000_000
+        assert recomputed == pytest.approx(expected, rel=1e-9)
+        assert 0.017 < recomputed < 0.019  # Sanity against Moonshot's rate card.
+
+    def test_anthropic_passes_through(self) -> None:
+        """Anthropic slugs are priced accurately by the CLI already —
+        the override returns the SDK number unchanged."""
+        assert (
+            override_cost_usd(
+                model="anthropic/claude-sonnet-4.6",
+                sdk_reported_usd=0.089862,
+                prompt_tokens=29564,
+                completion_tokens=78,
+                cache_read_tokens=0,
+                cache_creation_tokens=0,
+            )
+            == 0.089862
+        )
+
+    def test_unknown_non_moonshot_passes_through(self) -> None:
+        """A non-Moonshot, non-Anthropic slug falls back to the SDK value
+        — best-effort rather than leaking a zero or a wrong rate card."""
+        assert (
+            override_cost_usd(
+                model="deepseek/deepseek-v3",
+                sdk_reported_usd=0.05,
+                prompt_tokens=10_000,
+                completion_tokens=500,
+                cache_read_tokens=0,
+                cache_creation_tokens=0,
+            )
+            == 0.05
+        )
+
+    def test_none_model_passes_through(self) -> None:
+        """Subscription mode sets model=None — return the SDK value."""
+        assert (
+            override_cost_usd(
+                model=None,
+                sdk_reported_usd=0.07,
+                prompt_tokens=100,
+                completion_tokens=10,
+                cache_read_tokens=0,
+                cache_creation_tokens=0,
+            )
+            == 0.07
+        )
+
+    def test_cache_tokens_priced_at_input_rate(self) -> None:
+        """OpenRouter's Moonshot endpoints don't expose a discounted
+        cached-input price — cache_read / cache_creation tokens are
+        priced at the full input rate.  The reconcile path has the
+        authoritative discount for turns where Moonshot's cache engaged."""
+        recomputed = override_cost_usd(
+            model="moonshotai/kimi-k2.6",
+            sdk_reported_usd=0.5,
+            prompt_tokens=1000,
+            completion_tokens=0,
+            cache_read_tokens=5000,
+            cache_creation_tokens=2000,
+        )
+        expected = (1000 + 5000 + 2000) * 0.60 / 1_000_000
+        assert recomputed == pytest.approx(expected, rel=1e-9)
+
+
+class TestSupportsCacheControl:
+    """Gate for emitting ``cache_control: {type: ephemeral}`` on message
+    blocks.  True for Moonshot (Anthropic-compat endpoint accepts it)
+    and False for everything else this module knows about — Anthropic
+    callers use their own ``_is_anthropic_model`` check which is
+    combined with this one into a wider gate."""
+
+    def test_moonshot_supports_cache_control(self) -> None:
+        assert supports_cache_control("moonshotai/kimi-k2.6") is True
+
+    def test_future_moonshot_sku_supports_cache_control(self) -> None:
+        assert supports_cache_control("moonshotai/kimi-k3.0") is True
+
+    @pytest.mark.parametrize(
+        "model",
+        [
+            "openai/gpt-4o",
+            "google/gemini-2.5-flash",
+            "xai/grok-4",
+            "deepseek/deepseek-v3",
+            "",
+            None,
+        ],
+    )
+    def test_non_moonshot_does_not_support_cache_control(self, model) -> None:
+        assert supports_cache_control(model) is False
--- a/autogpt_platform/backend/backend/copilot/sdk/service.py
+++ b/autogpt_platform/backend/backend/copilot/sdk/service.py
@@ -57,6 +57,7 @@ from ..constants import (
 from ..session_cleanup import prune_orphan_tool_calls
 from ..context import encode_cwd_for_cli, get_workspace_manager
 from ..graphiti.config import is_enabled_for_user
+from ..moonshot import override_cost_usd as _override_cost_for_moonshot
 from ..model import (
    ChatMessage,
    ChatSession,
@@ -723,55 +724,6 @@ def _normalize_model_name(raw_model: str) -> str:
    return model.replace(".", "-")


-# Per-million-token rates ($USD) for non-Anthropic OpenRouter slugs that
-# the Claude Agent SDK CLI doesn't recognise.  The CLI's bundled pricing
-# table only knows Anthropic models — for anything else its
-# ``ResultMessage.total_cost_usd`` silently falls back to Sonnet rates,
-# over-billing by ~5x for cheaper models like Kimi K2.6.  Values are taken
-# directly from each provider's published rate card and must be kept in
-# sync when prices change.  Cache discounts are not applied — Kimi via
-# OpenRouter does not currently expose a separate cached-input price.
-_NON_ANTHROPIC_RATES_USD_PER_MTOK: dict[str, tuple[float, float]] = {
-    # vendor/model: (input_per_mtok, output_per_mtok)
-    "moonshotai/kimi-k2.6": (0.60, 2.80),
-    "moonshotai/kimi-k2-thinking": (0.60, 2.80),
-    "moonshotai/kimi-k2.5": (0.60, 2.80),
-    "moonshotai/kimi-k2": (0.60, 2.80),
-}
-
-
-def _override_cost_for_non_anthropic(
-    raw_model: str | None,
-    sdk_reported_usd: float,
-    prompt_tokens: int,
-    completion_tokens: int,
-    cache_read_tokens: int,
-    cache_creation_tokens: int,
-) -> float:
-    """Recompute turn cost from a known rate card for non-Anthropic models.
-
-    The Claude Agent SDK CLI's ``total_cost_usd`` is computed from a
-    static Anthropic pricing table baked into the binary — it doesn't
-    know Kimi/DeepSeek/etc rates and silently bills at Sonnet prices,
-    which would over-charge a Kimi-default deployment by ~5x.  Mirror
-    the baseline path's behaviour by computing the real cost from the
-    token counts whenever we recognise the slug; otherwise trust the
-    SDK number (correct for Anthropic models, best-effort for unknown
-    providers).
-    """
-    if raw_model is None:
-        return sdk_reported_usd
-    rates = _NON_ANTHROPIC_RATES_USD_PER_MTOK.get(raw_model)
-    if rates is None:
-        return sdk_reported_usd
-    input_rate, output_rate = rates
-    # Treat cache reads/creation as plain prompt tokens since OpenRouter
-    # does not currently report a discounted cached-input price for the
-    # tracked Moonshot endpoints.
-    total_prompt = prompt_tokens + cache_read_tokens + cache_creation_tokens
-    return (total_prompt * input_rate + completion_tokens * output_rate) / 1_000_000
-
-
 def _resolve_sdk_model() -> str | None:
    """Resolve the model name for the Claude Agent SDK CLI.

@@ -2354,11 +2306,17 @@ async def _run_stream_attempt(
                    # Anthropic model (e.g. Kimi K2.6) the CLI doesn't
                    # know the real per-token price and silently falls
                    # back to Sonnet rates — over-billing the user ~5x.
-                    # Recompute from a known rate card for non-Anthropic
-                    # OpenRouter slugs so the cost row, the rate-limit
-                    # counter, and the UI cost display reflect reality.
-                    state.usage.cost_usd = _override_cost_for_non_anthropic(
-                        raw_model=getattr(state.options, "model", None),
+                    # ``override_cost_usd`` replaces the CLI number with
+                    # the Moonshot rate-card estimate when the slug
+                    # matches, otherwise returns the SDK-reported value
+                    # unchanged.  Reconcile
+                    # (``record_turn_cost_from_openrouter``) overrides
+                    # both with the authoritative bill when every gen-ID
+                    # lookup succeeds; this estimate only ships as the
+                    # sync-path cost or the reconcile's lookup-fail
+                    # fallback.
+                    state.usage.cost_usd = _override_cost_for_moonshot(
+                        model=getattr(state.options, "model", None),
                        sdk_reported_usd=sdk_msg.total_cost_usd,
                        prompt_tokens=state.usage.prompt_tokens,
                        completion_tokens=state.usage.completion_tokens,
--- a/autogpt_platform/backend/backend/copilot/sdk/service_test.py
+++ b/autogpt_platform/backend/backend/copilot/sdk/service_test.py
@@ -15,7 +15,6 @@ from .service import (
    _build_system_prompt_value,
    _is_sdk_disconnect_error,
    _normalize_model_name,
-    _override_cost_for_non_anthropic,
    _prepare_file_attachments,
    _resolve_sdk_model,
    _safe_close_sdk_client,
@@ -707,79 +706,3 @@ class TestIdleTimeoutConstant:

    def test_idle_timeout_is_10_min(self):
        assert _IDLE_TIMEOUT_SECONDS == 10 * 60
-
-
-class TestOverrideCostForNonAnthropic:
-    """Verifies that turn costs routed through OpenRouter to non-Anthropic
-    vendors use the platform's per-model rate card instead of the SDK
-    CLI's static Anthropic pricing table — which silently falls back to
-    Sonnet rates for unknown models and over-bills by ~5x."""
-
-    def test_kimi_cost_recomputed_from_rate_card(self):
-        """Kimi K2.6 @ $0.60 input / $2.80 output per MTok.  29564 prompt
-        tokens + 78 completion should land at ~$0.018, not $0.09 (Sonnet)."""
-        recomputed = _override_cost_for_non_anthropic(
-            raw_model="moonshotai/kimi-k2.6",
-            sdk_reported_usd=0.089862,  # what the SDK CLI reported (Sonnet price)
-            prompt_tokens=29564,
-            completion_tokens=78,
-            cache_read_tokens=0,
-            cache_creation_tokens=0,
-        )
-        expected = (29564 * 0.60 + 78 * 2.80) / 1_000_000
-        assert recomputed == pytest.approx(expected, rel=1e-9)
-        # Sanity-check against a hand-computed magnitude.
-        assert 0.017 < recomputed < 0.019
-
-    def test_anthropic_cost_unchanged(self):
-        """Anthropic slugs pass through the SDK-reported value since the
-        CLI's pricing table is correct for them."""
-        result = _override_cost_for_non_anthropic(
-            raw_model="anthropic/claude-sonnet-4.6",
-            sdk_reported_usd=0.089862,
-            prompt_tokens=29564,
-            completion_tokens=78,
-            cache_read_tokens=0,
-            cache_creation_tokens=0,
-        )
-        assert result == 0.089862
-
-    def test_unknown_non_anthropic_vendor_passes_through(self):
-        """A non-Anthropic slug not in the rate card falls back to the
-        SDK-reported value — best-effort rather than misleading zero."""
-        result = _override_cost_for_non_anthropic(
-            raw_model="deepseek/some-new-model",
-            sdk_reported_usd=0.05,
-            prompt_tokens=10000,
-            completion_tokens=500,
-            cache_read_tokens=0,
-            cache_creation_tokens=0,
-        )
-        assert result == 0.05
-
-    def test_none_model_passes_through(self):
-        """Subscription mode / no-model case returns the SDK value."""
-        result = _override_cost_for_non_anthropic(
-            raw_model=None,
-            sdk_reported_usd=0.07,
-            prompt_tokens=100,
-            completion_tokens=10,
-            cache_read_tokens=0,
-            cache_creation_tokens=0,
-        )
-        assert result == 0.07
-
-    def test_cache_tokens_folded_into_prompt(self):
-        """Since the Moonshot endpoints don't report discounted cached-
-        input pricing, cache_read/creation tokens are priced at the same
-        rate as regular prompt tokens."""
-        recomputed = _override_cost_for_non_anthropic(
-            raw_model="moonshotai/kimi-k2.6",
-            sdk_reported_usd=0.5,
-            prompt_tokens=1000,
-            completion_tokens=0,
-            cache_read_tokens=5000,
-            cache_creation_tokens=2000,
-        )
-        expected = (1000 + 5000 + 2000) * 0.60 / 1_000_000
-        assert recomputed == pytest.approx(expected, rel=1e-9)
--- a/autogpt_platform/backend/backend/copilot/service.py
+++ b/autogpt_platform/backend/backend/copilot/service.py
@@ -34,6 +34,7 @@ from .model import (
    update_session_title,
    upsert_chat_session,
 )
+from .token_tracking import persist_and_record_usage

 logger = logging.getLogger(__name__)

@@ -498,6 +499,13 @@ async def _generate_session_title(
 ) -> str | None:
    """Generate a concise title for a chat session based on the first message.

+    Also persists the title-generation call's cost to ``PlatformCostLog``
+    so the admin dashboard's provider totals match the real OpenRouter
+    bill.  Before this, the title LLM call (a background task, one per
+    session) bypassed the main turn's reconcile entirely and silently
+    wasn't tracked — low per-call cost but 100% of sessions pay it, so
+    it adds up.
+
    Args:
        message: The first user message in the session
        user_id: User ID for OpenRouter tracing (optional)
@@ -507,8 +515,11 @@ async def _generate_session_title(
        A short title (3-6 words) or None if generation fails
    """
    try:
-        # Build extra_body for OpenRouter tracing and PostHog analytics
-        extra_body: dict[str, Any] = {}
+        # Build extra_body for OpenRouter tracing and PostHog analytics.
+        # ``usage: {"include": True}`` asks OR to embed the real billed
+        # cost into the final usage chunk — matches the baseline path's
+        # ``_OPENROUTER_INCLUDE_USAGE_COST`` pattern, same read path.
+        extra_body: dict[str, Any] = {"usage": {"include": True}}
        if user_id:
            extra_body["user"] = user_id[:128]  # OpenRouter limit
            extra_body["posthogDistinctId"] = user_id
@@ -534,6 +545,17 @@ async def _generate_session_title(
            max_tokens=20,
            extra_body=extra_body,
        )
+
+        # Best-effort cost capture — the title call is a one-shot LLM
+        # round we'd otherwise miss in admin totals.  Runs inside the
+        # ``try`` block so a persist failure downgrades to the existing
+        # "title generation failed" warning path rather than raising.
+        await _record_title_generation_cost(
+            response=response,
+            user_id=user_id,
+            session_id=session_id,
+        )
+
        title = response.choices[0].message.content
        if title:
            # Clean up the title
@@ -548,6 +570,63 @@ async def _generate_session_title(
        return None


+async def _record_title_generation_cost(
+    *,
+    response: Any,
+    user_id: str | None,
+    session_id: str | None,
+) -> None:
+    """Persist the title LLM call's cost to ``PlatformCostLog``.
+
+    Title generation runs in a background task per-session — low cost
+    (~$0.0001 per title) but 100% of sessions pay it.  Without this the
+    admin dashboard under-reports total provider spend by the aggregate
+    of those calls.  Separate ``block_name="copilot:title"`` so the row
+    is clearly distinguishable from the turn's main ``copilot:SDK`` /
+    ``copilot:baseline`` attributions.
+
+    Best-effort: any error downgrades to a debug log so title generation
+    itself never breaks on a cost-tracking hiccup.
+    """
+    try:
+        usage = getattr(response, "usage", None)
+        if usage is None:
+            return
+        prompt_tokens = getattr(usage, "prompt_tokens", 0) or 0
+        completion_tokens = getattr(usage, "completion_tokens", 0) or 0
+        # OR piggybacks the real billed cost on ``usage.cost`` when
+        # ``extra_body={"usage":{"include":true}}`` — stashed in
+        # pydantic's ``model_extra``.  Absent for non-OR routes.
+        extras = getattr(usage, "model_extra", None) or {}
+        cost_raw = extras.get("cost") if isinstance(extras, dict) else None
+        cost_usd: float | None
+        try:
+            cost_usd = float(cost_raw) if cost_raw is not None else None
+        except (TypeError, ValueError):
+            cost_usd = None
+
+        # ``persist_and_record_usage`` needs the session object to
+        # append the usage row to the session's per-turn usage list —
+        # load it lazily so the hot title path doesn't pay the DB round
+        # trip unless a cost actually needs recording.
+        session = None
+        if session_id and user_id:
+            session = await get_chat_session(session_id, user_id)
+
+        await persist_and_record_usage(
+            session=session,
+            user_id=user_id,
+            prompt_tokens=prompt_tokens,
+            completion_tokens=completion_tokens,
+            log_prefix="[title]",
+            cost_usd=cost_usd,
+            model=config.title_model,
+            provider="open_router",
+        )
+    except Exception as exc:  # noqa: BLE001
+        logger.debug("Title cost tracking skipped: %s", exc)
+
+
 async def _update_title_async(
    session_id: str, message: str, user_id: str | None = None
 ) -> None: