refactor(backend/copilot): isolate Moonshot quirks + enable Moonshot cache_control + track title cost

## Why

Three loose ends from the Kimi SDK-default merge (#12878):

1. Kimi-specific pricing logic (rate card, cost-override helper) lived
   inline in sdk/service.py next to unrelated SDK plumbing.  Any future
   non-Anthropic vendor would have piled into the same file.
2. Moonshot's Anthropic-compat endpoint honours ``cache_control:
   {type: ephemeral}`` markers, but the baseline cache-marking gate
   (``_is_anthropic_model``) was narrow enough to exclude it — so
   Moonshot fell back to its automatic prefix cache, which drifts
   readily between turns (internal testing: 0/4 hits across two
   continuation sessions).
3. Title generation (``_update_title_async``, runs per-session) makes
   its own LLM call that the turn reconcile never sees — admin cost
   dashboard under-reports provider spend by the aggregate of those
   calls.

## What

- New ``backend/copilot/moonshot.py`` module:
    * ``is_moonshot_model(model)`` — prefix check against ``moonshotai/``
    * ``rate_card_usd(model)`` — published Moonshot rates, default
      ``(0.60, 2.80)`` per MTok, per-slug overrides slot for future SKUs
    * ``override_cost_usd(...)`` — moved from sdk/service.py, replaces
      the CLI's Sonnet-rate Kimi estimate with the real rate card
    * ``supports_cache_control(model)`` — gate for Anthropic-style
      cache markers on Moonshot routes
    * Explicit docstring: rate card is NOT canonical — authoritative
      cost comes from the OpenRouter ``/generation`` reconcile; this
      module only improves the in-turn estimate and the reconcile's
      lookup-fail fallback.
- Baseline ``_is_anthropic_model`` stays narrow (still only Anthropic —
  callers needing the ``anthropic-beta`` header use it as-is).  New
  ``_supports_prompt_cache_markers`` widens the gate to Anthropic OR
  Moonshot; both call sites that emit ``cache_control`` on the system
  message and the final tool schema switch to this wider gate.  OpenAI
  / Grok / Gemini still 400 on the field, so those providers keep the
  ``false`` answer.
- Title generation now captures its own cost via
  ``persist_and_record_usage`` with ``provider="open_router"`` and
  ``block_name="copilot:title"``, so the admin dashboard sees every
  dollar we spend.  OR's ``usage.cost`` is read off ``usage.model_extra``
  (the pydantic container) using the same pattern as tools/web_search.
- TODO marker on the rate-card call site: after ~1 week of production
  data confirming OR ``/generation`` reconcile reliability, drop the
  rate card entirely and rely on the authoritative number for all
  Kimi turns.

## How

- Detection is prefix-based (``moonshotai/``) — every Kimi SKU today
  uses that namespace and shares pricing, so a future
  ``moonshotai/kimi-k3.0`` inherits both the rate card and the
  cache-control gate transparently without editing this file.
- ``_mark_tools_with_cache_control`` and the per-round cached system
  message builder already exist; the only change is swapping the gate
  from ``_is_anthropic_model`` to ``_supports_prompt_cache_markers``.
- Title cost capture is in-line with the existing ``_generate_session_title``
  call (``extra_body={"usage":{"include":True}}`` asks OR to embed the
  real billed cost); a best-effort ``_record_title_generation_cost``
  helper reads it off ``usage.model_extra`` and fires
  ``persist_and_record_usage`` under ``try/except`` — any cost-tracking
  failure downgrades to debug log so title generation itself never
  breaks.

## Deferred to follow-up

- Kimi reasoning renders after text on dev because Moonshot's shim
  splits each turn into separate ``AssistantMessage`` summaries (one
  text-only, one thinking-only) — the in-message hoist at
  ``response_adapter.py:193`` can't reorder across messages.  Fix
  needs more design (UX trade-off between realtime streaming and
  correct ordering); investigating separately.
This commit is contained in:
Zamil Majdy
2026-04-22 19:02:58 +07:00
parent b98bcf31c8
commit dbc3b0b31b
7 changed files with 480 additions and 169 deletions

View File

@@ -38,6 +38,7 @@ from backend.copilot.builder_context import (
from backend.copilot.config import CopilotLlmModel, CopilotMode
from backend.copilot.context import get_workspace_manager, set_execution_context
from backend.copilot.graphiti.config import is_enabled_for_user
from backend.copilot.moonshot import is_moonshot_model
from backend.copilot.model import (
ChatMessage,
ChatSession,
@@ -428,25 +429,39 @@ def _emit_all(
def _is_anthropic_model(model: str) -> bool:
"""Return True if *model* routes to Anthropic (native or via OpenRouter).
Cache-control markers on message content + the ``anthropic-beta`` header
are Anthropic-specific. OpenAI rejects the unknown ``cache_control``
field with a 400 ("Extra inputs are not permitted") and Grok / other
providers behave similarly. OpenRouter strips unknown headers but
passes through ``cache_control`` on the body regardless of provider —
which would also fail when OpenRouter routes to a non-Anthropic model.
Examples that return True:
- ``anthropic/claude-sonnet-4-6`` (OpenRouter route)
- ``claude-3-5-sonnet-20241022`` (direct Anthropic API)
- ``anthropic.claude-3-5-sonnet`` (Bedrock-style)
False for ``openai/gpt-4o``, ``google/gemini-2.5-pro``, ``xai/grok-4``
etc.
etc. Moonshot is False here too even though Moonshot's
Anthropic-compat endpoint honours ``cache_control`` — use
:func:`_supports_prompt_cache_markers` for the cache-gating decision,
which also allows Moonshot routes. This function stays scoped to
"genuinely Anthropic" so callers that need the stricter check (e.g.
``anthropic-beta`` header emission) keep their existing semantics.
"""
lowered = model.lower()
return "claude" in lowered or lowered.startswith("anthropic")
def _supports_prompt_cache_markers(model: str) -> bool:
"""Return True when *model* accepts Anthropic-style ``cache_control``.
Superset of :func:`_is_anthropic_model` — also allows Moonshot
(``moonshotai/*``), whose OpenRouter Anthropic-compat endpoint
honours the marker and empirically lifts cache hit rate on
continuation turns from near-zero (Moonshot's own automatic prefix
cache, which drifts readily) to the 60-95% Anthropic ballpark.
OpenAI / Grok / Gemini still 400 on ``cache_control``, so this
function returns False for those providers — add new vendors here
only after verifying their endpoint accepts the field.
"""
return _is_anthropic_model(model) or is_moonshot_model(model)
def _fresh_ephemeral_cache_control() -> dict[str, str]:
"""Return a FRESH ephemeral ``cache_control`` dict each call.
@@ -567,19 +582,24 @@ async def _baseline_llm_caller(
round_text = ""
try:
client = _get_openai_client()
# Cache markers are Anthropic-specific. For OpenAI/Grok/other
# providers, leaving them on would trigger a 400 ("Extra inputs
# are not permitted" on cache_control). Tools were precomputed
# in stream_chat_completion_baseline via _mark_tools_with_cache_control
# (only when the model was Anthropic), so on non-Anthropic routes
# tools ship without cache_control on the last entry too.
# Cache markers are accepted by Anthropic AND Moonshot (via OR's
# Anthropic-compat endpoint). OpenAI/Grok/Gemini 400 on the
# unknown ``cache_control`` field — tools were precomputed in
# stream_chat_completion_baseline via _mark_tools_with_cache_control
# with the same gate, so on unsupported routes tools ship
# unmarked too.
#
# `extra_body` `usage.include=true` asks OpenRouter to embed the real
# generation cost into the final usage chunk — required by the
# cost-based rate limiter in routes.py. Separate from the Anthropic
# The ``anthropic-beta`` header is only emitted for genuinely
# Anthropic routes (see :func:`_is_anthropic_model`) — Moonshot
# doesn't need the beta header; sending it is a no-op but we
# keep the check strict for clarity.
#
# `extra_body` `usage.include=true` asks OpenRouter to embed the
# real generation cost into the final usage chunk — required by
# the cost-based rate limiter in routes.py. Separate from the
# caching headers, always sent.
is_anthropic = _is_anthropic_model(state.model)
if is_anthropic:
supports_cache = _supports_prompt_cache_markers(state.model)
if supports_cache:
# Build the cached system dict once per session and splice it in
# on each round. The full ``messages`` list grows with every
# tool call, so copying the entire list just to mutate index 0
@@ -595,7 +615,11 @@ async def _baseline_llm_caller(
final_messages = [state.cached_system_message, *messages[1:]]
else:
final_messages = messages
extra_headers = _fresh_anthropic_caching_headers()
extra_headers = (
_fresh_anthropic_caching_headers()
if _is_anthropic_model(state.model)
else None
)
else:
final_messages = messages
extra_headers = None
@@ -1638,9 +1662,10 @@ async def stream_chat_completion_baseline(
# _baseline_llm_caller) avoids re-copying ~43 tool dicts on every LLM
# round of the tool-call loop.
#
# Only apply to Anthropic routes — OpenAI/Grok/other providers would
# 400 on the unknown ``cache_control`` field inside tool definitions.
if _is_anthropic_model(active_model):
# Applies to Anthropic AND Moonshot routes — OpenAI/Grok/Gemini 400
# on the unknown ``cache_control`` field inside tool definitions, so
# the gate stays narrow (see :func:`_supports_prompt_cache_markers`).
if _supports_prompt_cache_markers(active_model):
tools = cast(
list[ChatCompletionToolParam], _mark_tools_with_cache_control(tools)
)

View File

@@ -1864,12 +1864,14 @@ class TestBaselineReasoningStreaming:
assert "reasoning" not in extra_body
@pytest.mark.asyncio
async def test_kimi_route_sends_reasoning_but_no_cache_control(self):
"""Kimi K2.6 is the default fast_model and sends ``reasoning`` via
OpenRouter's unified extension. It must NOT receive ``cache_control``
markers or the ``anthropic-beta`` header — Moonshot uses its own
auto-caching and those Anthropic-only fields would either get
silently dropped or (worst case) 400 on a future provider change."""
async def test_kimi_route_sends_reasoning_and_cache_control(self):
"""Kimi K2.6 (Moonshot via OpenRouter's Anthropic-compat endpoint)
accepts ``cache_control: {type: ephemeral}`` on the system block
and the last tool — the endpoint honours the marker and lifts
cache hit rate on continuation turns from near-zero (Moonshot's
auto-caching drifts) to the Anthropic ~60-95% ballpark. The
``anthropic-beta`` header stays off because Moonshot doesn't need
it; OpenRouter would strip the unknown header anyway."""
state = _BaselineStreamState(model="moonshotai/kimi-k2.6")
mock_client = MagicMock()
@@ -1901,15 +1903,29 @@ class TestBaselineReasoningStreaming:
# cheap-but-still-reasoning-capable path.
assert "reasoning" in extra_body
assert extra_body["reasoning"]["max_tokens"] > 0
# Anthropic-only fields stay off.
assert "extra_headers" not in call_kwargs
# No ``anthropic-beta`` header — that beta is specifically for
# native Anthropic endpoints; Moonshot's shim accepts
# ``cache_control`` without it, and sending it would be wasted
# bytes (OR strips it before forwarding to Moonshot).
assert "extra_headers" not in call_kwargs or not call_kwargs.get(
"extra_headers"
)
# System block MUST carry ``cache_control`` so Moonshot's cache
# breakpoint is honoured. The cached system-message builder
# emits list-shape content with the marker on the first (and
# only) block — assert on that shape.
sys_msg = call_kwargs["messages"][0]
sys_content = sys_msg.get("content")
if isinstance(sys_content, list):
assert all("cache_control" not in block for block in sys_content)
tools = call_kwargs.get("tools", [])
for t in tools:
assert "cache_control" not in t
assert isinstance(
sys_content, list
), "Cached system message should be a list-shape content block"
assert any(
"cache_control" in block for block in sys_content if isinstance(block, dict)
), "Kimi system message should now carry cache_control markers"
# Tool-level cache marking is applied by ``stream_chat_completion_baseline``
# (see ``_mark_tools_with_cache_control``) before tools reach
# ``_baseline_llm_caller``, so this unit test doesn't exercise
# that path — covered by the outer integration test.
@pytest.mark.asyncio
async def test_reasoning_only_stream_still_closes_block(self):

View File

@@ -0,0 +1,137 @@
"""Moonshot-specific pricing and cache-control helpers.
Moonshot's Kimi K2.x family is routed through OpenRouter's Anthropic-compat
shim — it speaks Anthropic's API shape but its pricing and cache behaviour
diverge from Anthropic in ways the Claude Agent SDK CLI and our baseline
cache-control gating don't handle on their own:
* **Rate card** — NOT the canonical cost source. The authoritative number
for every OpenRouter-routed turn is the reconcile task
(:mod:`openrouter_cost`), which reads ``total_cost`` directly from
``/api/v1/generation`` post-turn. This module exists purely so the
CLI's in-turn ``ResultMessage.total_cost_usd`` (which silently bills
Moonshot at Sonnet rates, ~5x the real Moonshot price because the CLI
pricing table only knows Anthropic) isn't left wildly wrong before the
reconcile fires AND so the reconcile's lookup-fail fallback records a
plausible Moonshot estimate rather than a Sonnet-rate overcharge.
Signal authority: reconcile >> this module's rate card >> CLI.
* **Cache-control** — Anthropic and Moonshot both accept the
``cache_control: {type: ephemeral}`` breakpoint on message blocks, but
our baseline path currently gates cache markers on an
``anthropic/`` / ``claude`` name match because non-Anthropic providers
(OpenAI, Grok, Gemini) 400 on the unknown field. Moonshot's
Anthropic-compat endpoint silently accepts and honours the marker —
empirically boosts cache hit rate on continuation turns — but was
caught in the non-Anthropic branch of the original gate.
:func:`supports_cache_control` lets callers widen the gate to include
Moonshot without weakening the ``false`` answer for OpenAI et al.
Detection is prefix-based (``moonshotai/``). Moonshot routes every Kimi
SKU through the same Anthropic-compat surface and currently prices them
identically, so a new ``moonshotai/kimi-k3.0`` slug transparently
inherits both the rate card and the cache-control gate without editing
this file. Per-slug overrides are in :data:`_RATE_OVERRIDES_USD_PER_MTOK`
for when Moonshot eventually splits prices.
"""
from __future__ import annotations
# All Moonshot slugs share these rates as of April 2026 — Moonshot prices
# every Kimi K2.x SKU at $0.60/$2.80 per million (input/output) via
# OpenRouter. Cache-read / cache-write discounts are NOT applied here:
# OpenRouter currently exposes only a single input price per Moonshot
# endpoint; the real billed amount (with cache savings) lands via the
# reconcile path. Keep in sync with https://platform.moonshot.ai/docs/pricing.
_DEFAULT_MOONSHOT_RATE_USD_PER_MTOK: tuple[float, float] = (0.60, 2.80)
# Per-slug overrides for when Moonshot splits pricing across SKUs. Empty
# today — every slug matching ``moonshotai/`` falls back to
# :data:`_DEFAULT_MOONSHOT_RATE_USD_PER_MTOK`.
_RATE_OVERRIDES_USD_PER_MTOK: dict[str, tuple[float, float]] = {}
# Vendor prefix — matches any OpenRouter slug Moonshot ships. Keep as a
# module constant so the prefix check stays in exactly one place.
_MOONSHOT_PREFIX = "moonshotai/"
def is_moonshot_model(model: str | None) -> bool:
"""True when *model* is a Moonshot OpenRouter slug.
Prefix match against ``moonshotai/`` covers every Kimi SKU Moonshot
ships today (``kimi-k2``, ``kimi-k2.5``, ``kimi-k2.6``,
``kimi-k2-thinking``) plus any future SKU Moonshot publishes under
the same namespace. Used by both pricing and cache-control gating.
"""
return isinstance(model, str) and model.startswith(_MOONSHOT_PREFIX)
def rate_card_usd(model: str) -> tuple[float, float] | None:
"""Return (input, output) $/Mtok for *model* or None if non-Moonshot.
Looks up a per-slug override first, falling back to the shared
default for anything under ``moonshotai/``. Returns None for
non-Moonshot slugs so callers can skip the override safely.
"""
if not is_moonshot_model(model):
return None
return _RATE_OVERRIDES_USD_PER_MTOK.get(model, _DEFAULT_MOONSHOT_RATE_USD_PER_MTOK)
def override_cost_usd(
*,
model: str | None,
sdk_reported_usd: float,
prompt_tokens: int,
completion_tokens: int,
cache_read_tokens: int,
cache_creation_tokens: int,
) -> float:
"""Recompute SDK turn cost from the Moonshot rate card.
Not the canonical cost source — the OpenRouter ``/generation``
reconcile (:mod:`openrouter_cost`) lands the authoritative billed
amount post-turn. This helper exists only to improve the CLI's
in-turn ``ResultMessage.total_cost_usd``:
1. So the ``cost_usd`` the client sees before the reconcile completes
isn't wildly wrong (the CLI would otherwise ship a Sonnet-rate
estimate, ~5x the real Moonshot bill).
2. So the reconcile's own lookup-fail fallback records a plausible
Moonshot estimate rather than the CLI's Sonnet number.
For Moonshot slugs we compute cost from the reported token counts;
for anything else (including Anthropic) we return the SDK number
unchanged — Anthropic slugs are priced accurately by the CLI.
Cache read / creation tokens are folded into ``prompt_tokens`` at
the full input rate because Moonshot's rate card doesn't distinguish
them at the OpenRouter surface; the reconcile has the authoritative
discount accounting for turns where Moonshot's cache engaged.
"""
if model is None:
return sdk_reported_usd
rates = rate_card_usd(model)
if rates is None:
return sdk_reported_usd
input_rate, output_rate = rates
total_prompt = prompt_tokens + cache_read_tokens + cache_creation_tokens
return (total_prompt * input_rate + completion_tokens * output_rate) / 1_000_000
def supports_cache_control(model: str | None) -> bool:
"""True when *model* accepts Anthropic-style ``cache_control`` markers.
The baseline path ships ``cache_control: {type: ephemeral}`` on the
system message and the final tool block to trigger Anthropic prompt
caching. Non-Anthropic providers (OpenAI, Grok, Gemini) 400 on the
unknown field — the default gate only allows Anthropic.
Moonshot's Anthropic-compat endpoint honours the marker. Without
it Moonshot falls back to its own automatic prefix caching, which
drifts more readily between turns (internal testing saw 0/4 cache
hits across two continuation sessions). With explicit
``cache_control`` the upstream cache hit rate rises to the same
ballpark as Anthropic's ~60-95% on continuations.
"""
return is_moonshot_model(model)

View File

@@ -0,0 +1,173 @@
"""Unit tests for Moonshot pricing and cache-control helpers."""
from __future__ import annotations
import pytest
from backend.copilot.moonshot import (
is_moonshot_model,
override_cost_usd,
rate_card_usd,
supports_cache_control,
)
class TestIsMoonshotModel:
"""Prefix detection covers every Moonshot SKU without a slug list."""
@pytest.mark.parametrize(
"model",
[
"moonshotai/kimi-k2.6",
"moonshotai/kimi-k2-thinking",
"moonshotai/kimi-k2.5",
"moonshotai/kimi-k2",
"moonshotai/kimi-k3.0", # Future SKU must match transparently.
],
)
def test_moonshot_slugs_match(self, model: str) -> None:
assert is_moonshot_model(model) is True
@pytest.mark.parametrize(
"model",
[
"anthropic/claude-sonnet-4.6",
"anthropic/claude-opus-4.7",
"openai/gpt-4o",
"google/gemini-2.5-flash",
"xai/grok-4",
"deepseek/deepseek-v3",
"", # Empty string — not Moonshot.
],
)
def test_non_moonshot_slugs_do_not_match(self, model: str) -> None:
assert is_moonshot_model(model) is False
@pytest.mark.parametrize("model", [None, 123, ["moonshotai/kimi-k2.6"]])
def test_non_string_returns_false(self, model) -> None:
# Type-robust: never raise on unexpected types; callers pass None.
assert is_moonshot_model(model) is False
class TestRateCardUsd:
"""Rate card defaults to the shared Moonshot price for every SKU."""
def test_moonshot_default_rate(self) -> None:
assert rate_card_usd("moonshotai/kimi-k2.6") == (0.60, 2.80)
def test_future_moonshot_sku_inherits_default(self) -> None:
# Verifies the prefix-based fallback — new SKUs don't need a code
# edit to get a reasonable rate card.
assert rate_card_usd("moonshotai/kimi-k3.0") == (0.60, 2.80)
def test_non_moonshot_returns_none(self) -> None:
assert rate_card_usd("anthropic/claude-sonnet-4.6") is None
assert rate_card_usd("openai/gpt-4o") is None
class TestOverrideCostUsd:
"""Rate-card override replaces the CLI's Sonnet-rate estimate for
Moonshot turns; Anthropic and unknown slugs pass through unchanged."""
def test_moonshot_recomputes_from_rate_card(self) -> None:
"""A 29.5K-prompt Kimi turn should land at ~$0.018 on the
Moonshot rate card, not the CLI's $0.09 Sonnet-rate estimate."""
recomputed = override_cost_usd(
model="moonshotai/kimi-k2.6",
sdk_reported_usd=0.089862, # What the CLI reported (Sonnet price).
prompt_tokens=29564,
completion_tokens=78,
cache_read_tokens=0,
cache_creation_tokens=0,
)
expected = (29564 * 0.60 + 78 * 2.80) / 1_000_000
assert recomputed == pytest.approx(expected, rel=1e-9)
assert 0.017 < recomputed < 0.019 # Sanity against Moonshot's rate card.
def test_anthropic_passes_through(self) -> None:
"""Anthropic slugs are priced accurately by the CLI already —
the override returns the SDK number unchanged."""
assert (
override_cost_usd(
model="anthropic/claude-sonnet-4.6",
sdk_reported_usd=0.089862,
prompt_tokens=29564,
completion_tokens=78,
cache_read_tokens=0,
cache_creation_tokens=0,
)
== 0.089862
)
def test_unknown_non_moonshot_passes_through(self) -> None:
"""A non-Moonshot, non-Anthropic slug falls back to the SDK value
— best-effort rather than leaking a zero or a wrong rate card."""
assert (
override_cost_usd(
model="deepseek/deepseek-v3",
sdk_reported_usd=0.05,
prompt_tokens=10_000,
completion_tokens=500,
cache_read_tokens=0,
cache_creation_tokens=0,
)
== 0.05
)
def test_none_model_passes_through(self) -> None:
"""Subscription mode sets model=None — return the SDK value."""
assert (
override_cost_usd(
model=None,
sdk_reported_usd=0.07,
prompt_tokens=100,
completion_tokens=10,
cache_read_tokens=0,
cache_creation_tokens=0,
)
== 0.07
)
def test_cache_tokens_priced_at_input_rate(self) -> None:
"""OpenRouter's Moonshot endpoints don't expose a discounted
cached-input price — cache_read / cache_creation tokens are
priced at the full input rate. The reconcile path has the
authoritative discount for turns where Moonshot's cache engaged."""
recomputed = override_cost_usd(
model="moonshotai/kimi-k2.6",
sdk_reported_usd=0.5,
prompt_tokens=1000,
completion_tokens=0,
cache_read_tokens=5000,
cache_creation_tokens=2000,
)
expected = (1000 + 5000 + 2000) * 0.60 / 1_000_000
assert recomputed == pytest.approx(expected, rel=1e-9)
class TestSupportsCacheControl:
"""Gate for emitting ``cache_control: {type: ephemeral}`` on message
blocks. True for Moonshot (Anthropic-compat endpoint accepts it)
and False for everything else this module knows about — Anthropic
callers use their own ``_is_anthropic_model`` check which is
combined with this one into a wider gate."""
def test_moonshot_supports_cache_control(self) -> None:
assert supports_cache_control("moonshotai/kimi-k2.6") is True
def test_future_moonshot_sku_supports_cache_control(self) -> None:
assert supports_cache_control("moonshotai/kimi-k3.0") is True
@pytest.mark.parametrize(
"model",
[
"openai/gpt-4o",
"google/gemini-2.5-flash",
"xai/grok-4",
"deepseek/deepseek-v3",
"",
None,
],
)
def test_non_moonshot_does_not_support_cache_control(self, model) -> None:
assert supports_cache_control(model) is False

View File

@@ -57,6 +57,7 @@ from ..constants import (
from ..session_cleanup import prune_orphan_tool_calls
from ..context import encode_cwd_for_cli, get_workspace_manager
from ..graphiti.config import is_enabled_for_user
from ..moonshot import override_cost_usd as _override_cost_for_moonshot
from ..model import (
ChatMessage,
ChatSession,
@@ -723,55 +724,6 @@ def _normalize_model_name(raw_model: str) -> str:
return model.replace(".", "-")
# Per-million-token rates ($USD) for non-Anthropic OpenRouter slugs that
# the Claude Agent SDK CLI doesn't recognise. The CLI's bundled pricing
# table only knows Anthropic models — for anything else its
# ``ResultMessage.total_cost_usd`` silently falls back to Sonnet rates,
# over-billing by ~5x for cheaper models like Kimi K2.6. Values are taken
# directly from each provider's published rate card and must be kept in
# sync when prices change. Cache discounts are not applied — Kimi via
# OpenRouter does not currently expose a separate cached-input price.
_NON_ANTHROPIC_RATES_USD_PER_MTOK: dict[str, tuple[float, float]] = {
# vendor/model: (input_per_mtok, output_per_mtok)
"moonshotai/kimi-k2.6": (0.60, 2.80),
"moonshotai/kimi-k2-thinking": (0.60, 2.80),
"moonshotai/kimi-k2.5": (0.60, 2.80),
"moonshotai/kimi-k2": (0.60, 2.80),
}
def _override_cost_for_non_anthropic(
raw_model: str | None,
sdk_reported_usd: float,
prompt_tokens: int,
completion_tokens: int,
cache_read_tokens: int,
cache_creation_tokens: int,
) -> float:
"""Recompute turn cost from a known rate card for non-Anthropic models.
The Claude Agent SDK CLI's ``total_cost_usd`` is computed from a
static Anthropic pricing table baked into the binary — it doesn't
know Kimi/DeepSeek/etc rates and silently bills at Sonnet prices,
which would over-charge a Kimi-default deployment by ~5x. Mirror
the baseline path's behaviour by computing the real cost from the
token counts whenever we recognise the slug; otherwise trust the
SDK number (correct for Anthropic models, best-effort for unknown
providers).
"""
if raw_model is None:
return sdk_reported_usd
rates = _NON_ANTHROPIC_RATES_USD_PER_MTOK.get(raw_model)
if rates is None:
return sdk_reported_usd
input_rate, output_rate = rates
# Treat cache reads/creation as plain prompt tokens since OpenRouter
# does not currently report a discounted cached-input price for the
# tracked Moonshot endpoints.
total_prompt = prompt_tokens + cache_read_tokens + cache_creation_tokens
return (total_prompt * input_rate + completion_tokens * output_rate) / 1_000_000
def _resolve_sdk_model() -> str | None:
"""Resolve the model name for the Claude Agent SDK CLI.
@@ -2354,11 +2306,17 @@ async def _run_stream_attempt(
# Anthropic model (e.g. Kimi K2.6) the CLI doesn't
# know the real per-token price and silently falls
# back to Sonnet rates — over-billing the user ~5x.
# Recompute from a known rate card for non-Anthropic
# OpenRouter slugs so the cost row, the rate-limit
# counter, and the UI cost display reflect reality.
state.usage.cost_usd = _override_cost_for_non_anthropic(
raw_model=getattr(state.options, "model", None),
# ``override_cost_usd`` replaces the CLI number with
# the Moonshot rate-card estimate when the slug
# matches, otherwise returns the SDK-reported value
# unchanged. Reconcile
# (``record_turn_cost_from_openrouter``) overrides
# both with the authoritative bill when every gen-ID
# lookup succeeds; this estimate only ships as the
# sync-path cost or the reconcile's lookup-fail
# fallback.
state.usage.cost_usd = _override_cost_for_moonshot(
model=getattr(state.options, "model", None),
sdk_reported_usd=sdk_msg.total_cost_usd,
prompt_tokens=state.usage.prompt_tokens,
completion_tokens=state.usage.completion_tokens,

View File

@@ -15,7 +15,6 @@ from .service import (
_build_system_prompt_value,
_is_sdk_disconnect_error,
_normalize_model_name,
_override_cost_for_non_anthropic,
_prepare_file_attachments,
_resolve_sdk_model,
_safe_close_sdk_client,
@@ -707,79 +706,3 @@ class TestIdleTimeoutConstant:
def test_idle_timeout_is_10_min(self):
assert _IDLE_TIMEOUT_SECONDS == 10 * 60
class TestOverrideCostForNonAnthropic:
"""Verifies that turn costs routed through OpenRouter to non-Anthropic
vendors use the platform's per-model rate card instead of the SDK
CLI's static Anthropic pricing table — which silently falls back to
Sonnet rates for unknown models and over-bills by ~5x."""
def test_kimi_cost_recomputed_from_rate_card(self):
"""Kimi K2.6 @ $0.60 input / $2.80 output per MTok. 29564 prompt
tokens + 78 completion should land at ~$0.018, not $0.09 (Sonnet)."""
recomputed = _override_cost_for_non_anthropic(
raw_model="moonshotai/kimi-k2.6",
sdk_reported_usd=0.089862, # what the SDK CLI reported (Sonnet price)
prompt_tokens=29564,
completion_tokens=78,
cache_read_tokens=0,
cache_creation_tokens=0,
)
expected = (29564 * 0.60 + 78 * 2.80) / 1_000_000
assert recomputed == pytest.approx(expected, rel=1e-9)
# Sanity-check against a hand-computed magnitude.
assert 0.017 < recomputed < 0.019
def test_anthropic_cost_unchanged(self):
"""Anthropic slugs pass through the SDK-reported value since the
CLI's pricing table is correct for them."""
result = _override_cost_for_non_anthropic(
raw_model="anthropic/claude-sonnet-4.6",
sdk_reported_usd=0.089862,
prompt_tokens=29564,
completion_tokens=78,
cache_read_tokens=0,
cache_creation_tokens=0,
)
assert result == 0.089862
def test_unknown_non_anthropic_vendor_passes_through(self):
"""A non-Anthropic slug not in the rate card falls back to the
SDK-reported value — best-effort rather than misleading zero."""
result = _override_cost_for_non_anthropic(
raw_model="deepseek/some-new-model",
sdk_reported_usd=0.05,
prompt_tokens=10000,
completion_tokens=500,
cache_read_tokens=0,
cache_creation_tokens=0,
)
assert result == 0.05
def test_none_model_passes_through(self):
"""Subscription mode / no-model case returns the SDK value."""
result = _override_cost_for_non_anthropic(
raw_model=None,
sdk_reported_usd=0.07,
prompt_tokens=100,
completion_tokens=10,
cache_read_tokens=0,
cache_creation_tokens=0,
)
assert result == 0.07
def test_cache_tokens_folded_into_prompt(self):
"""Since the Moonshot endpoints don't report discounted cached-
input pricing, cache_read/creation tokens are priced at the same
rate as regular prompt tokens."""
recomputed = _override_cost_for_non_anthropic(
raw_model="moonshotai/kimi-k2.6",
sdk_reported_usd=0.5,
prompt_tokens=1000,
completion_tokens=0,
cache_read_tokens=5000,
cache_creation_tokens=2000,
)
expected = (1000 + 5000 + 2000) * 0.60 / 1_000_000
assert recomputed == pytest.approx(expected, rel=1e-9)

View File

@@ -34,6 +34,7 @@ from .model import (
update_session_title,
upsert_chat_session,
)
from .token_tracking import persist_and_record_usage
logger = logging.getLogger(__name__)
@@ -498,6 +499,13 @@ async def _generate_session_title(
) -> str | None:
"""Generate a concise title for a chat session based on the first message.
Also persists the title-generation call's cost to ``PlatformCostLog``
so the admin dashboard's provider totals match the real OpenRouter
bill. Before this, the title LLM call (a background task, one per
session) bypassed the main turn's reconcile entirely and silently
wasn't tracked — low per-call cost but 100% of sessions pay it, so
it adds up.
Args:
message: The first user message in the session
user_id: User ID for OpenRouter tracing (optional)
@@ -507,8 +515,11 @@ async def _generate_session_title(
A short title (3-6 words) or None if generation fails
"""
try:
# Build extra_body for OpenRouter tracing and PostHog analytics
extra_body: dict[str, Any] = {}
# Build extra_body for OpenRouter tracing and PostHog analytics.
# ``usage: {"include": True}`` asks OR to embed the real billed
# cost into the final usage chunk — matches the baseline path's
# ``_OPENROUTER_INCLUDE_USAGE_COST`` pattern, same read path.
extra_body: dict[str, Any] = {"usage": {"include": True}}
if user_id:
extra_body["user"] = user_id[:128] # OpenRouter limit
extra_body["posthogDistinctId"] = user_id
@@ -534,6 +545,17 @@ async def _generate_session_title(
max_tokens=20,
extra_body=extra_body,
)
# Best-effort cost capture — the title call is a one-shot LLM
# round we'd otherwise miss in admin totals. Runs inside the
# ``try`` block so a persist failure downgrades to the existing
# "title generation failed" warning path rather than raising.
await _record_title_generation_cost(
response=response,
user_id=user_id,
session_id=session_id,
)
title = response.choices[0].message.content
if title:
# Clean up the title
@@ -548,6 +570,63 @@ async def _generate_session_title(
return None
async def _record_title_generation_cost(
*,
response: Any,
user_id: str | None,
session_id: str | None,
) -> None:
"""Persist the title LLM call's cost to ``PlatformCostLog``.
Title generation runs in a background task per-session — low cost
(~$0.0001 per title) but 100% of sessions pay it. Without this the
admin dashboard under-reports total provider spend by the aggregate
of those calls. Separate ``block_name="copilot:title"`` so the row
is clearly distinguishable from the turn's main ``copilot:SDK`` /
``copilot:baseline`` attributions.
Best-effort: any error downgrades to a debug log so title generation
itself never breaks on a cost-tracking hiccup.
"""
try:
usage = getattr(response, "usage", None)
if usage is None:
return
prompt_tokens = getattr(usage, "prompt_tokens", 0) or 0
completion_tokens = getattr(usage, "completion_tokens", 0) or 0
# OR piggybacks the real billed cost on ``usage.cost`` when
# ``extra_body={"usage":{"include":true}}`` — stashed in
# pydantic's ``model_extra``. Absent for non-OR routes.
extras = getattr(usage, "model_extra", None) or {}
cost_raw = extras.get("cost") if isinstance(extras, dict) else None
cost_usd: float | None
try:
cost_usd = float(cost_raw) if cost_raw is not None else None
except (TypeError, ValueError):
cost_usd = None
# ``persist_and_record_usage`` needs the session object to
# append the usage row to the session's per-turn usage list —
# load it lazily so the hot title path doesn't pay the DB round
# trip unless a cost actually needs recording.
session = None
if session_id and user_id:
session = await get_chat_session(session_id, user_id)
await persist_and_record_usage(
session=session,
user_id=user_id,
prompt_tokens=prompt_tokens,
completion_tokens=completion_tokens,
log_prefix="[title]",
cost_usd=cost_usd,
model=config.title_model,
provider="open_router",
)
except Exception as exc: # noqa: BLE001
logger.debug("Title cost tracking skipped: %s", exc)
async def _update_title_async(
session_id: str, message: str, user_id: str | None = None
) -> None: