mirror of
https://github.com/Significant-Gravitas/AutoGPT.git
synced 2026-04-30 03:00:41 -04:00
refactor(backend/copilot): isolate Moonshot quirks + enable Moonshot cache_control + track title cost
## Why Three loose ends from the Kimi SDK-default merge (#12878): 1. Kimi-specific pricing logic (rate card, cost-override helper) lived inline in sdk/service.py next to unrelated SDK plumbing. Any future non-Anthropic vendor would have piled into the same file. 2. Moonshot's Anthropic-compat endpoint honours ``cache_control: {type: ephemeral}`` markers, but the baseline cache-marking gate (``_is_anthropic_model``) was narrow enough to exclude it — so Moonshot fell back to its automatic prefix cache, which drifts readily between turns (internal testing: 0/4 hits across two continuation sessions). 3. Title generation (``_update_title_async``, runs per-session) makes its own LLM call that the turn reconcile never sees — admin cost dashboard under-reports provider spend by the aggregate of those calls. ## What - New ``backend/copilot/moonshot.py`` module: * ``is_moonshot_model(model)`` — prefix check against ``moonshotai/`` * ``rate_card_usd(model)`` — published Moonshot rates, default ``(0.60, 2.80)`` per MTok, per-slug overrides slot for future SKUs * ``override_cost_usd(...)`` — moved from sdk/service.py, replaces the CLI's Sonnet-rate Kimi estimate with the real rate card * ``supports_cache_control(model)`` — gate for Anthropic-style cache markers on Moonshot routes * Explicit docstring: rate card is NOT canonical — authoritative cost comes from the OpenRouter ``/generation`` reconcile; this module only improves the in-turn estimate and the reconcile's lookup-fail fallback. - Baseline ``_is_anthropic_model`` stays narrow (still only Anthropic — callers needing the ``anthropic-beta`` header use it as-is). New ``_supports_prompt_cache_markers`` widens the gate to Anthropic OR Moonshot; both call sites that emit ``cache_control`` on the system message and the final tool schema switch to this wider gate. OpenAI / Grok / Gemini still 400 on the field, so those providers keep the ``false`` answer. - Title generation now captures its own cost via ``persist_and_record_usage`` with ``provider="open_router"`` and ``block_name="copilot:title"``, so the admin dashboard sees every dollar we spend. OR's ``usage.cost`` is read off ``usage.model_extra`` (the pydantic container) using the same pattern as tools/web_search. - TODO marker on the rate-card call site: after ~1 week of production data confirming OR ``/generation`` reconcile reliability, drop the rate card entirely and rely on the authoritative number for all Kimi turns. ## How - Detection is prefix-based (``moonshotai/``) — every Kimi SKU today uses that namespace and shares pricing, so a future ``moonshotai/kimi-k3.0`` inherits both the rate card and the cache-control gate transparently without editing this file. - ``_mark_tools_with_cache_control`` and the per-round cached system message builder already exist; the only change is swapping the gate from ``_is_anthropic_model`` to ``_supports_prompt_cache_markers``. - Title cost capture is in-line with the existing ``_generate_session_title`` call (``extra_body={"usage":{"include":True}}`` asks OR to embed the real billed cost); a best-effort ``_record_title_generation_cost`` helper reads it off ``usage.model_extra`` and fires ``persist_and_record_usage`` under ``try/except`` — any cost-tracking failure downgrades to debug log so title generation itself never breaks. ## Deferred to follow-up - Kimi reasoning renders after text on dev because Moonshot's shim splits each turn into separate ``AssistantMessage`` summaries (one text-only, one thinking-only) — the in-message hoist at ``response_adapter.py:193`` can't reorder across messages. Fix needs more design (UX trade-off between realtime streaming and correct ordering); investigating separately.
This commit is contained in:
@@ -38,6 +38,7 @@ from backend.copilot.builder_context import (
|
||||
from backend.copilot.config import CopilotLlmModel, CopilotMode
|
||||
from backend.copilot.context import get_workspace_manager, set_execution_context
|
||||
from backend.copilot.graphiti.config import is_enabled_for_user
|
||||
from backend.copilot.moonshot import is_moonshot_model
|
||||
from backend.copilot.model import (
|
||||
ChatMessage,
|
||||
ChatSession,
|
||||
@@ -428,25 +429,39 @@ def _emit_all(
|
||||
def _is_anthropic_model(model: str) -> bool:
|
||||
"""Return True if *model* routes to Anthropic (native or via OpenRouter).
|
||||
|
||||
Cache-control markers on message content + the ``anthropic-beta`` header
|
||||
are Anthropic-specific. OpenAI rejects the unknown ``cache_control``
|
||||
field with a 400 ("Extra inputs are not permitted") and Grok / other
|
||||
providers behave similarly. OpenRouter strips unknown headers but
|
||||
passes through ``cache_control`` on the body regardless of provider —
|
||||
which would also fail when OpenRouter routes to a non-Anthropic model.
|
||||
|
||||
Examples that return True:
|
||||
- ``anthropic/claude-sonnet-4-6`` (OpenRouter route)
|
||||
- ``claude-3-5-sonnet-20241022`` (direct Anthropic API)
|
||||
- ``anthropic.claude-3-5-sonnet`` (Bedrock-style)
|
||||
|
||||
False for ``openai/gpt-4o``, ``google/gemini-2.5-pro``, ``xai/grok-4``
|
||||
etc.
|
||||
etc. Moonshot is False here too even though Moonshot's
|
||||
Anthropic-compat endpoint honours ``cache_control`` — use
|
||||
:func:`_supports_prompt_cache_markers` for the cache-gating decision,
|
||||
which also allows Moonshot routes. This function stays scoped to
|
||||
"genuinely Anthropic" so callers that need the stricter check (e.g.
|
||||
``anthropic-beta`` header emission) keep their existing semantics.
|
||||
"""
|
||||
lowered = model.lower()
|
||||
return "claude" in lowered or lowered.startswith("anthropic")
|
||||
|
||||
|
||||
def _supports_prompt_cache_markers(model: str) -> bool:
|
||||
"""Return True when *model* accepts Anthropic-style ``cache_control``.
|
||||
|
||||
Superset of :func:`_is_anthropic_model` — also allows Moonshot
|
||||
(``moonshotai/*``), whose OpenRouter Anthropic-compat endpoint
|
||||
honours the marker and empirically lifts cache hit rate on
|
||||
continuation turns from near-zero (Moonshot's own automatic prefix
|
||||
cache, which drifts readily) to the 60-95% Anthropic ballpark.
|
||||
|
||||
OpenAI / Grok / Gemini still 400 on ``cache_control``, so this
|
||||
function returns False for those providers — add new vendors here
|
||||
only after verifying their endpoint accepts the field.
|
||||
"""
|
||||
return _is_anthropic_model(model) or is_moonshot_model(model)
|
||||
|
||||
|
||||
def _fresh_ephemeral_cache_control() -> dict[str, str]:
|
||||
"""Return a FRESH ephemeral ``cache_control`` dict each call.
|
||||
|
||||
@@ -567,19 +582,24 @@ async def _baseline_llm_caller(
|
||||
round_text = ""
|
||||
try:
|
||||
client = _get_openai_client()
|
||||
# Cache markers are Anthropic-specific. For OpenAI/Grok/other
|
||||
# providers, leaving them on would trigger a 400 ("Extra inputs
|
||||
# are not permitted" on cache_control). Tools were precomputed
|
||||
# in stream_chat_completion_baseline via _mark_tools_with_cache_control
|
||||
# (only when the model was Anthropic), so on non-Anthropic routes
|
||||
# tools ship without cache_control on the last entry too.
|
||||
# Cache markers are accepted by Anthropic AND Moonshot (via OR's
|
||||
# Anthropic-compat endpoint). OpenAI/Grok/Gemini 400 on the
|
||||
# unknown ``cache_control`` field — tools were precomputed in
|
||||
# stream_chat_completion_baseline via _mark_tools_with_cache_control
|
||||
# with the same gate, so on unsupported routes tools ship
|
||||
# unmarked too.
|
||||
#
|
||||
# `extra_body` `usage.include=true` asks OpenRouter to embed the real
|
||||
# generation cost into the final usage chunk — required by the
|
||||
# cost-based rate limiter in routes.py. Separate from the Anthropic
|
||||
# The ``anthropic-beta`` header is only emitted for genuinely
|
||||
# Anthropic routes (see :func:`_is_anthropic_model`) — Moonshot
|
||||
# doesn't need the beta header; sending it is a no-op but we
|
||||
# keep the check strict for clarity.
|
||||
#
|
||||
# `extra_body` `usage.include=true` asks OpenRouter to embed the
|
||||
# real generation cost into the final usage chunk — required by
|
||||
# the cost-based rate limiter in routes.py. Separate from the
|
||||
# caching headers, always sent.
|
||||
is_anthropic = _is_anthropic_model(state.model)
|
||||
if is_anthropic:
|
||||
supports_cache = _supports_prompt_cache_markers(state.model)
|
||||
if supports_cache:
|
||||
# Build the cached system dict once per session and splice it in
|
||||
# on each round. The full ``messages`` list grows with every
|
||||
# tool call, so copying the entire list just to mutate index 0
|
||||
@@ -595,7 +615,11 @@ async def _baseline_llm_caller(
|
||||
final_messages = [state.cached_system_message, *messages[1:]]
|
||||
else:
|
||||
final_messages = messages
|
||||
extra_headers = _fresh_anthropic_caching_headers()
|
||||
extra_headers = (
|
||||
_fresh_anthropic_caching_headers()
|
||||
if _is_anthropic_model(state.model)
|
||||
else None
|
||||
)
|
||||
else:
|
||||
final_messages = messages
|
||||
extra_headers = None
|
||||
@@ -1638,9 +1662,10 @@ async def stream_chat_completion_baseline(
|
||||
# _baseline_llm_caller) avoids re-copying ~43 tool dicts on every LLM
|
||||
# round of the tool-call loop.
|
||||
#
|
||||
# Only apply to Anthropic routes — OpenAI/Grok/other providers would
|
||||
# 400 on the unknown ``cache_control`` field inside tool definitions.
|
||||
if _is_anthropic_model(active_model):
|
||||
# Applies to Anthropic AND Moonshot routes — OpenAI/Grok/Gemini 400
|
||||
# on the unknown ``cache_control`` field inside tool definitions, so
|
||||
# the gate stays narrow (see :func:`_supports_prompt_cache_markers`).
|
||||
if _supports_prompt_cache_markers(active_model):
|
||||
tools = cast(
|
||||
list[ChatCompletionToolParam], _mark_tools_with_cache_control(tools)
|
||||
)
|
||||
|
||||
@@ -1864,12 +1864,14 @@ class TestBaselineReasoningStreaming:
|
||||
assert "reasoning" not in extra_body
|
||||
|
||||
@pytest.mark.asyncio
|
||||
async def test_kimi_route_sends_reasoning_but_no_cache_control(self):
|
||||
"""Kimi K2.6 is the default fast_model and sends ``reasoning`` via
|
||||
OpenRouter's unified extension. It must NOT receive ``cache_control``
|
||||
markers or the ``anthropic-beta`` header — Moonshot uses its own
|
||||
auto-caching and those Anthropic-only fields would either get
|
||||
silently dropped or (worst case) 400 on a future provider change."""
|
||||
async def test_kimi_route_sends_reasoning_and_cache_control(self):
|
||||
"""Kimi K2.6 (Moonshot via OpenRouter's Anthropic-compat endpoint)
|
||||
accepts ``cache_control: {type: ephemeral}`` on the system block
|
||||
and the last tool — the endpoint honours the marker and lifts
|
||||
cache hit rate on continuation turns from near-zero (Moonshot's
|
||||
auto-caching drifts) to the Anthropic ~60-95% ballpark. The
|
||||
``anthropic-beta`` header stays off because Moonshot doesn't need
|
||||
it; OpenRouter would strip the unknown header anyway."""
|
||||
state = _BaselineStreamState(model="moonshotai/kimi-k2.6")
|
||||
|
||||
mock_client = MagicMock()
|
||||
@@ -1901,15 +1903,29 @@ class TestBaselineReasoningStreaming:
|
||||
# cheap-but-still-reasoning-capable path.
|
||||
assert "reasoning" in extra_body
|
||||
assert extra_body["reasoning"]["max_tokens"] > 0
|
||||
# Anthropic-only fields stay off.
|
||||
assert "extra_headers" not in call_kwargs
|
||||
# No ``anthropic-beta`` header — that beta is specifically for
|
||||
# native Anthropic endpoints; Moonshot's shim accepts
|
||||
# ``cache_control`` without it, and sending it would be wasted
|
||||
# bytes (OR strips it before forwarding to Moonshot).
|
||||
assert "extra_headers" not in call_kwargs or not call_kwargs.get(
|
||||
"extra_headers"
|
||||
)
|
||||
# System block MUST carry ``cache_control`` so Moonshot's cache
|
||||
# breakpoint is honoured. The cached system-message builder
|
||||
# emits list-shape content with the marker on the first (and
|
||||
# only) block — assert on that shape.
|
||||
sys_msg = call_kwargs["messages"][0]
|
||||
sys_content = sys_msg.get("content")
|
||||
if isinstance(sys_content, list):
|
||||
assert all("cache_control" not in block for block in sys_content)
|
||||
tools = call_kwargs.get("tools", [])
|
||||
for t in tools:
|
||||
assert "cache_control" not in t
|
||||
assert isinstance(
|
||||
sys_content, list
|
||||
), "Cached system message should be a list-shape content block"
|
||||
assert any(
|
||||
"cache_control" in block for block in sys_content if isinstance(block, dict)
|
||||
), "Kimi system message should now carry cache_control markers"
|
||||
# Tool-level cache marking is applied by ``stream_chat_completion_baseline``
|
||||
# (see ``_mark_tools_with_cache_control``) before tools reach
|
||||
# ``_baseline_llm_caller``, so this unit test doesn't exercise
|
||||
# that path — covered by the outer integration test.
|
||||
|
||||
@pytest.mark.asyncio
|
||||
async def test_reasoning_only_stream_still_closes_block(self):
|
||||
|
||||
137
autogpt_platform/backend/backend/copilot/moonshot.py
Normal file
137
autogpt_platform/backend/backend/copilot/moonshot.py
Normal file
@@ -0,0 +1,137 @@
|
||||
"""Moonshot-specific pricing and cache-control helpers.
|
||||
|
||||
Moonshot's Kimi K2.x family is routed through OpenRouter's Anthropic-compat
|
||||
shim — it speaks Anthropic's API shape but its pricing and cache behaviour
|
||||
diverge from Anthropic in ways the Claude Agent SDK CLI and our baseline
|
||||
cache-control gating don't handle on their own:
|
||||
|
||||
* **Rate card** — NOT the canonical cost source. The authoritative number
|
||||
for every OpenRouter-routed turn is the reconcile task
|
||||
(:mod:`openrouter_cost`), which reads ``total_cost`` directly from
|
||||
``/api/v1/generation`` post-turn. This module exists purely so the
|
||||
CLI's in-turn ``ResultMessage.total_cost_usd`` (which silently bills
|
||||
Moonshot at Sonnet rates, ~5x the real Moonshot price because the CLI
|
||||
pricing table only knows Anthropic) isn't left wildly wrong before the
|
||||
reconcile fires AND so the reconcile's lookup-fail fallback records a
|
||||
plausible Moonshot estimate rather than a Sonnet-rate overcharge.
|
||||
Signal authority: reconcile >> this module's rate card >> CLI.
|
||||
|
||||
* **Cache-control** — Anthropic and Moonshot both accept the
|
||||
``cache_control: {type: ephemeral}`` breakpoint on message blocks, but
|
||||
our baseline path currently gates cache markers on an
|
||||
``anthropic/`` / ``claude`` name match because non-Anthropic providers
|
||||
(OpenAI, Grok, Gemini) 400 on the unknown field. Moonshot's
|
||||
Anthropic-compat endpoint silently accepts and honours the marker —
|
||||
empirically boosts cache hit rate on continuation turns — but was
|
||||
caught in the non-Anthropic branch of the original gate.
|
||||
:func:`supports_cache_control` lets callers widen the gate to include
|
||||
Moonshot without weakening the ``false`` answer for OpenAI et al.
|
||||
|
||||
Detection is prefix-based (``moonshotai/``). Moonshot routes every Kimi
|
||||
SKU through the same Anthropic-compat surface and currently prices them
|
||||
identically, so a new ``moonshotai/kimi-k3.0`` slug transparently
|
||||
inherits both the rate card and the cache-control gate without editing
|
||||
this file. Per-slug overrides are in :data:`_RATE_OVERRIDES_USD_PER_MTOK`
|
||||
for when Moonshot eventually splits prices.
|
||||
"""
|
||||
|
||||
from __future__ import annotations
|
||||
|
||||
# All Moonshot slugs share these rates as of April 2026 — Moonshot prices
|
||||
# every Kimi K2.x SKU at $0.60/$2.80 per million (input/output) via
|
||||
# OpenRouter. Cache-read / cache-write discounts are NOT applied here:
|
||||
# OpenRouter currently exposes only a single input price per Moonshot
|
||||
# endpoint; the real billed amount (with cache savings) lands via the
|
||||
# reconcile path. Keep in sync with https://platform.moonshot.ai/docs/pricing.
|
||||
_DEFAULT_MOONSHOT_RATE_USD_PER_MTOK: tuple[float, float] = (0.60, 2.80)
|
||||
|
||||
# Per-slug overrides for when Moonshot splits pricing across SKUs. Empty
|
||||
# today — every slug matching ``moonshotai/`` falls back to
|
||||
# :data:`_DEFAULT_MOONSHOT_RATE_USD_PER_MTOK`.
|
||||
_RATE_OVERRIDES_USD_PER_MTOK: dict[str, tuple[float, float]] = {}
|
||||
|
||||
# Vendor prefix — matches any OpenRouter slug Moonshot ships. Keep as a
|
||||
# module constant so the prefix check stays in exactly one place.
|
||||
_MOONSHOT_PREFIX = "moonshotai/"
|
||||
|
||||
|
||||
def is_moonshot_model(model: str | None) -> bool:
|
||||
"""True when *model* is a Moonshot OpenRouter slug.
|
||||
|
||||
Prefix match against ``moonshotai/`` covers every Kimi SKU Moonshot
|
||||
ships today (``kimi-k2``, ``kimi-k2.5``, ``kimi-k2.6``,
|
||||
``kimi-k2-thinking``) plus any future SKU Moonshot publishes under
|
||||
the same namespace. Used by both pricing and cache-control gating.
|
||||
"""
|
||||
return isinstance(model, str) and model.startswith(_MOONSHOT_PREFIX)
|
||||
|
||||
|
||||
def rate_card_usd(model: str) -> tuple[float, float] | None:
|
||||
"""Return (input, output) $/Mtok for *model* or None if non-Moonshot.
|
||||
|
||||
Looks up a per-slug override first, falling back to the shared
|
||||
default for anything under ``moonshotai/``. Returns None for
|
||||
non-Moonshot slugs so callers can skip the override safely.
|
||||
"""
|
||||
if not is_moonshot_model(model):
|
||||
return None
|
||||
return _RATE_OVERRIDES_USD_PER_MTOK.get(model, _DEFAULT_MOONSHOT_RATE_USD_PER_MTOK)
|
||||
|
||||
|
||||
def override_cost_usd(
|
||||
*,
|
||||
model: str | None,
|
||||
sdk_reported_usd: float,
|
||||
prompt_tokens: int,
|
||||
completion_tokens: int,
|
||||
cache_read_tokens: int,
|
||||
cache_creation_tokens: int,
|
||||
) -> float:
|
||||
"""Recompute SDK turn cost from the Moonshot rate card.
|
||||
|
||||
Not the canonical cost source — the OpenRouter ``/generation``
|
||||
reconcile (:mod:`openrouter_cost`) lands the authoritative billed
|
||||
amount post-turn. This helper exists only to improve the CLI's
|
||||
in-turn ``ResultMessage.total_cost_usd``:
|
||||
|
||||
1. So the ``cost_usd`` the client sees before the reconcile completes
|
||||
isn't wildly wrong (the CLI would otherwise ship a Sonnet-rate
|
||||
estimate, ~5x the real Moonshot bill).
|
||||
2. So the reconcile's own lookup-fail fallback records a plausible
|
||||
Moonshot estimate rather than the CLI's Sonnet number.
|
||||
|
||||
For Moonshot slugs we compute cost from the reported token counts;
|
||||
for anything else (including Anthropic) we return the SDK number
|
||||
unchanged — Anthropic slugs are priced accurately by the CLI.
|
||||
|
||||
Cache read / creation tokens are folded into ``prompt_tokens`` at
|
||||
the full input rate because Moonshot's rate card doesn't distinguish
|
||||
them at the OpenRouter surface; the reconcile has the authoritative
|
||||
discount accounting for turns where Moonshot's cache engaged.
|
||||
"""
|
||||
if model is None:
|
||||
return sdk_reported_usd
|
||||
rates = rate_card_usd(model)
|
||||
if rates is None:
|
||||
return sdk_reported_usd
|
||||
input_rate, output_rate = rates
|
||||
total_prompt = prompt_tokens + cache_read_tokens + cache_creation_tokens
|
||||
return (total_prompt * input_rate + completion_tokens * output_rate) / 1_000_000
|
||||
|
||||
|
||||
def supports_cache_control(model: str | None) -> bool:
|
||||
"""True when *model* accepts Anthropic-style ``cache_control`` markers.
|
||||
|
||||
The baseline path ships ``cache_control: {type: ephemeral}`` on the
|
||||
system message and the final tool block to trigger Anthropic prompt
|
||||
caching. Non-Anthropic providers (OpenAI, Grok, Gemini) 400 on the
|
||||
unknown field — the default gate only allows Anthropic.
|
||||
|
||||
Moonshot's Anthropic-compat endpoint honours the marker. Without
|
||||
it Moonshot falls back to its own automatic prefix caching, which
|
||||
drifts more readily between turns (internal testing saw 0/4 cache
|
||||
hits across two continuation sessions). With explicit
|
||||
``cache_control`` the upstream cache hit rate rises to the same
|
||||
ballpark as Anthropic's ~60-95% on continuations.
|
||||
"""
|
||||
return is_moonshot_model(model)
|
||||
173
autogpt_platform/backend/backend/copilot/moonshot_test.py
Normal file
173
autogpt_platform/backend/backend/copilot/moonshot_test.py
Normal file
@@ -0,0 +1,173 @@
|
||||
"""Unit tests for Moonshot pricing and cache-control helpers."""
|
||||
|
||||
from __future__ import annotations
|
||||
|
||||
import pytest
|
||||
|
||||
from backend.copilot.moonshot import (
|
||||
is_moonshot_model,
|
||||
override_cost_usd,
|
||||
rate_card_usd,
|
||||
supports_cache_control,
|
||||
)
|
||||
|
||||
|
||||
class TestIsMoonshotModel:
|
||||
"""Prefix detection covers every Moonshot SKU without a slug list."""
|
||||
|
||||
@pytest.mark.parametrize(
|
||||
"model",
|
||||
[
|
||||
"moonshotai/kimi-k2.6",
|
||||
"moonshotai/kimi-k2-thinking",
|
||||
"moonshotai/kimi-k2.5",
|
||||
"moonshotai/kimi-k2",
|
||||
"moonshotai/kimi-k3.0", # Future SKU must match transparently.
|
||||
],
|
||||
)
|
||||
def test_moonshot_slugs_match(self, model: str) -> None:
|
||||
assert is_moonshot_model(model) is True
|
||||
|
||||
@pytest.mark.parametrize(
|
||||
"model",
|
||||
[
|
||||
"anthropic/claude-sonnet-4.6",
|
||||
"anthropic/claude-opus-4.7",
|
||||
"openai/gpt-4o",
|
||||
"google/gemini-2.5-flash",
|
||||
"xai/grok-4",
|
||||
"deepseek/deepseek-v3",
|
||||
"", # Empty string — not Moonshot.
|
||||
],
|
||||
)
|
||||
def test_non_moonshot_slugs_do_not_match(self, model: str) -> None:
|
||||
assert is_moonshot_model(model) is False
|
||||
|
||||
@pytest.mark.parametrize("model", [None, 123, ["moonshotai/kimi-k2.6"]])
|
||||
def test_non_string_returns_false(self, model) -> None:
|
||||
# Type-robust: never raise on unexpected types; callers pass None.
|
||||
assert is_moonshot_model(model) is False
|
||||
|
||||
|
||||
class TestRateCardUsd:
|
||||
"""Rate card defaults to the shared Moonshot price for every SKU."""
|
||||
|
||||
def test_moonshot_default_rate(self) -> None:
|
||||
assert rate_card_usd("moonshotai/kimi-k2.6") == (0.60, 2.80)
|
||||
|
||||
def test_future_moonshot_sku_inherits_default(self) -> None:
|
||||
# Verifies the prefix-based fallback — new SKUs don't need a code
|
||||
# edit to get a reasonable rate card.
|
||||
assert rate_card_usd("moonshotai/kimi-k3.0") == (0.60, 2.80)
|
||||
|
||||
def test_non_moonshot_returns_none(self) -> None:
|
||||
assert rate_card_usd("anthropic/claude-sonnet-4.6") is None
|
||||
assert rate_card_usd("openai/gpt-4o") is None
|
||||
|
||||
|
||||
class TestOverrideCostUsd:
|
||||
"""Rate-card override replaces the CLI's Sonnet-rate estimate for
|
||||
Moonshot turns; Anthropic and unknown slugs pass through unchanged."""
|
||||
|
||||
def test_moonshot_recomputes_from_rate_card(self) -> None:
|
||||
"""A 29.5K-prompt Kimi turn should land at ~$0.018 on the
|
||||
Moonshot rate card, not the CLI's $0.09 Sonnet-rate estimate."""
|
||||
recomputed = override_cost_usd(
|
||||
model="moonshotai/kimi-k2.6",
|
||||
sdk_reported_usd=0.089862, # What the CLI reported (Sonnet price).
|
||||
prompt_tokens=29564,
|
||||
completion_tokens=78,
|
||||
cache_read_tokens=0,
|
||||
cache_creation_tokens=0,
|
||||
)
|
||||
expected = (29564 * 0.60 + 78 * 2.80) / 1_000_000
|
||||
assert recomputed == pytest.approx(expected, rel=1e-9)
|
||||
assert 0.017 < recomputed < 0.019 # Sanity against Moonshot's rate card.
|
||||
|
||||
def test_anthropic_passes_through(self) -> None:
|
||||
"""Anthropic slugs are priced accurately by the CLI already —
|
||||
the override returns the SDK number unchanged."""
|
||||
assert (
|
||||
override_cost_usd(
|
||||
model="anthropic/claude-sonnet-4.6",
|
||||
sdk_reported_usd=0.089862,
|
||||
prompt_tokens=29564,
|
||||
completion_tokens=78,
|
||||
cache_read_tokens=0,
|
||||
cache_creation_tokens=0,
|
||||
)
|
||||
== 0.089862
|
||||
)
|
||||
|
||||
def test_unknown_non_moonshot_passes_through(self) -> None:
|
||||
"""A non-Moonshot, non-Anthropic slug falls back to the SDK value
|
||||
— best-effort rather than leaking a zero or a wrong rate card."""
|
||||
assert (
|
||||
override_cost_usd(
|
||||
model="deepseek/deepseek-v3",
|
||||
sdk_reported_usd=0.05,
|
||||
prompt_tokens=10_000,
|
||||
completion_tokens=500,
|
||||
cache_read_tokens=0,
|
||||
cache_creation_tokens=0,
|
||||
)
|
||||
== 0.05
|
||||
)
|
||||
|
||||
def test_none_model_passes_through(self) -> None:
|
||||
"""Subscription mode sets model=None — return the SDK value."""
|
||||
assert (
|
||||
override_cost_usd(
|
||||
model=None,
|
||||
sdk_reported_usd=0.07,
|
||||
prompt_tokens=100,
|
||||
completion_tokens=10,
|
||||
cache_read_tokens=0,
|
||||
cache_creation_tokens=0,
|
||||
)
|
||||
== 0.07
|
||||
)
|
||||
|
||||
def test_cache_tokens_priced_at_input_rate(self) -> None:
|
||||
"""OpenRouter's Moonshot endpoints don't expose a discounted
|
||||
cached-input price — cache_read / cache_creation tokens are
|
||||
priced at the full input rate. The reconcile path has the
|
||||
authoritative discount for turns where Moonshot's cache engaged."""
|
||||
recomputed = override_cost_usd(
|
||||
model="moonshotai/kimi-k2.6",
|
||||
sdk_reported_usd=0.5,
|
||||
prompt_tokens=1000,
|
||||
completion_tokens=0,
|
||||
cache_read_tokens=5000,
|
||||
cache_creation_tokens=2000,
|
||||
)
|
||||
expected = (1000 + 5000 + 2000) * 0.60 / 1_000_000
|
||||
assert recomputed == pytest.approx(expected, rel=1e-9)
|
||||
|
||||
|
||||
class TestSupportsCacheControl:
|
||||
"""Gate for emitting ``cache_control: {type: ephemeral}`` on message
|
||||
blocks. True for Moonshot (Anthropic-compat endpoint accepts it)
|
||||
and False for everything else this module knows about — Anthropic
|
||||
callers use their own ``_is_anthropic_model`` check which is
|
||||
combined with this one into a wider gate."""
|
||||
|
||||
def test_moonshot_supports_cache_control(self) -> None:
|
||||
assert supports_cache_control("moonshotai/kimi-k2.6") is True
|
||||
|
||||
def test_future_moonshot_sku_supports_cache_control(self) -> None:
|
||||
assert supports_cache_control("moonshotai/kimi-k3.0") is True
|
||||
|
||||
@pytest.mark.parametrize(
|
||||
"model",
|
||||
[
|
||||
"openai/gpt-4o",
|
||||
"google/gemini-2.5-flash",
|
||||
"xai/grok-4",
|
||||
"deepseek/deepseek-v3",
|
||||
"",
|
||||
None,
|
||||
],
|
||||
)
|
||||
def test_non_moonshot_does_not_support_cache_control(self, model) -> None:
|
||||
assert supports_cache_control(model) is False
|
||||
@@ -57,6 +57,7 @@ from ..constants import (
|
||||
from ..session_cleanup import prune_orphan_tool_calls
|
||||
from ..context import encode_cwd_for_cli, get_workspace_manager
|
||||
from ..graphiti.config import is_enabled_for_user
|
||||
from ..moonshot import override_cost_usd as _override_cost_for_moonshot
|
||||
from ..model import (
|
||||
ChatMessage,
|
||||
ChatSession,
|
||||
@@ -723,55 +724,6 @@ def _normalize_model_name(raw_model: str) -> str:
|
||||
return model.replace(".", "-")
|
||||
|
||||
|
||||
# Per-million-token rates ($USD) for non-Anthropic OpenRouter slugs that
|
||||
# the Claude Agent SDK CLI doesn't recognise. The CLI's bundled pricing
|
||||
# table only knows Anthropic models — for anything else its
|
||||
# ``ResultMessage.total_cost_usd`` silently falls back to Sonnet rates,
|
||||
# over-billing by ~5x for cheaper models like Kimi K2.6. Values are taken
|
||||
# directly from each provider's published rate card and must be kept in
|
||||
# sync when prices change. Cache discounts are not applied — Kimi via
|
||||
# OpenRouter does not currently expose a separate cached-input price.
|
||||
_NON_ANTHROPIC_RATES_USD_PER_MTOK: dict[str, tuple[float, float]] = {
|
||||
# vendor/model: (input_per_mtok, output_per_mtok)
|
||||
"moonshotai/kimi-k2.6": (0.60, 2.80),
|
||||
"moonshotai/kimi-k2-thinking": (0.60, 2.80),
|
||||
"moonshotai/kimi-k2.5": (0.60, 2.80),
|
||||
"moonshotai/kimi-k2": (0.60, 2.80),
|
||||
}
|
||||
|
||||
|
||||
def _override_cost_for_non_anthropic(
|
||||
raw_model: str | None,
|
||||
sdk_reported_usd: float,
|
||||
prompt_tokens: int,
|
||||
completion_tokens: int,
|
||||
cache_read_tokens: int,
|
||||
cache_creation_tokens: int,
|
||||
) -> float:
|
||||
"""Recompute turn cost from a known rate card for non-Anthropic models.
|
||||
|
||||
The Claude Agent SDK CLI's ``total_cost_usd`` is computed from a
|
||||
static Anthropic pricing table baked into the binary — it doesn't
|
||||
know Kimi/DeepSeek/etc rates and silently bills at Sonnet prices,
|
||||
which would over-charge a Kimi-default deployment by ~5x. Mirror
|
||||
the baseline path's behaviour by computing the real cost from the
|
||||
token counts whenever we recognise the slug; otherwise trust the
|
||||
SDK number (correct for Anthropic models, best-effort for unknown
|
||||
providers).
|
||||
"""
|
||||
if raw_model is None:
|
||||
return sdk_reported_usd
|
||||
rates = _NON_ANTHROPIC_RATES_USD_PER_MTOK.get(raw_model)
|
||||
if rates is None:
|
||||
return sdk_reported_usd
|
||||
input_rate, output_rate = rates
|
||||
# Treat cache reads/creation as plain prompt tokens since OpenRouter
|
||||
# does not currently report a discounted cached-input price for the
|
||||
# tracked Moonshot endpoints.
|
||||
total_prompt = prompt_tokens + cache_read_tokens + cache_creation_tokens
|
||||
return (total_prompt * input_rate + completion_tokens * output_rate) / 1_000_000
|
||||
|
||||
|
||||
def _resolve_sdk_model() -> str | None:
|
||||
"""Resolve the model name for the Claude Agent SDK CLI.
|
||||
|
||||
@@ -2354,11 +2306,17 @@ async def _run_stream_attempt(
|
||||
# Anthropic model (e.g. Kimi K2.6) the CLI doesn't
|
||||
# know the real per-token price and silently falls
|
||||
# back to Sonnet rates — over-billing the user ~5x.
|
||||
# Recompute from a known rate card for non-Anthropic
|
||||
# OpenRouter slugs so the cost row, the rate-limit
|
||||
# counter, and the UI cost display reflect reality.
|
||||
state.usage.cost_usd = _override_cost_for_non_anthropic(
|
||||
raw_model=getattr(state.options, "model", None),
|
||||
# ``override_cost_usd`` replaces the CLI number with
|
||||
# the Moonshot rate-card estimate when the slug
|
||||
# matches, otherwise returns the SDK-reported value
|
||||
# unchanged. Reconcile
|
||||
# (``record_turn_cost_from_openrouter``) overrides
|
||||
# both with the authoritative bill when every gen-ID
|
||||
# lookup succeeds; this estimate only ships as the
|
||||
# sync-path cost or the reconcile's lookup-fail
|
||||
# fallback.
|
||||
state.usage.cost_usd = _override_cost_for_moonshot(
|
||||
model=getattr(state.options, "model", None),
|
||||
sdk_reported_usd=sdk_msg.total_cost_usd,
|
||||
prompt_tokens=state.usage.prompt_tokens,
|
||||
completion_tokens=state.usage.completion_tokens,
|
||||
|
||||
@@ -15,7 +15,6 @@ from .service import (
|
||||
_build_system_prompt_value,
|
||||
_is_sdk_disconnect_error,
|
||||
_normalize_model_name,
|
||||
_override_cost_for_non_anthropic,
|
||||
_prepare_file_attachments,
|
||||
_resolve_sdk_model,
|
||||
_safe_close_sdk_client,
|
||||
@@ -707,79 +706,3 @@ class TestIdleTimeoutConstant:
|
||||
|
||||
def test_idle_timeout_is_10_min(self):
|
||||
assert _IDLE_TIMEOUT_SECONDS == 10 * 60
|
||||
|
||||
|
||||
class TestOverrideCostForNonAnthropic:
|
||||
"""Verifies that turn costs routed through OpenRouter to non-Anthropic
|
||||
vendors use the platform's per-model rate card instead of the SDK
|
||||
CLI's static Anthropic pricing table — which silently falls back to
|
||||
Sonnet rates for unknown models and over-bills by ~5x."""
|
||||
|
||||
def test_kimi_cost_recomputed_from_rate_card(self):
|
||||
"""Kimi K2.6 @ $0.60 input / $2.80 output per MTok. 29564 prompt
|
||||
tokens + 78 completion should land at ~$0.018, not $0.09 (Sonnet)."""
|
||||
recomputed = _override_cost_for_non_anthropic(
|
||||
raw_model="moonshotai/kimi-k2.6",
|
||||
sdk_reported_usd=0.089862, # what the SDK CLI reported (Sonnet price)
|
||||
prompt_tokens=29564,
|
||||
completion_tokens=78,
|
||||
cache_read_tokens=0,
|
||||
cache_creation_tokens=0,
|
||||
)
|
||||
expected = (29564 * 0.60 + 78 * 2.80) / 1_000_000
|
||||
assert recomputed == pytest.approx(expected, rel=1e-9)
|
||||
# Sanity-check against a hand-computed magnitude.
|
||||
assert 0.017 < recomputed < 0.019
|
||||
|
||||
def test_anthropic_cost_unchanged(self):
|
||||
"""Anthropic slugs pass through the SDK-reported value since the
|
||||
CLI's pricing table is correct for them."""
|
||||
result = _override_cost_for_non_anthropic(
|
||||
raw_model="anthropic/claude-sonnet-4.6",
|
||||
sdk_reported_usd=0.089862,
|
||||
prompt_tokens=29564,
|
||||
completion_tokens=78,
|
||||
cache_read_tokens=0,
|
||||
cache_creation_tokens=0,
|
||||
)
|
||||
assert result == 0.089862
|
||||
|
||||
def test_unknown_non_anthropic_vendor_passes_through(self):
|
||||
"""A non-Anthropic slug not in the rate card falls back to the
|
||||
SDK-reported value — best-effort rather than misleading zero."""
|
||||
result = _override_cost_for_non_anthropic(
|
||||
raw_model="deepseek/some-new-model",
|
||||
sdk_reported_usd=0.05,
|
||||
prompt_tokens=10000,
|
||||
completion_tokens=500,
|
||||
cache_read_tokens=0,
|
||||
cache_creation_tokens=0,
|
||||
)
|
||||
assert result == 0.05
|
||||
|
||||
def test_none_model_passes_through(self):
|
||||
"""Subscription mode / no-model case returns the SDK value."""
|
||||
result = _override_cost_for_non_anthropic(
|
||||
raw_model=None,
|
||||
sdk_reported_usd=0.07,
|
||||
prompt_tokens=100,
|
||||
completion_tokens=10,
|
||||
cache_read_tokens=0,
|
||||
cache_creation_tokens=0,
|
||||
)
|
||||
assert result == 0.07
|
||||
|
||||
def test_cache_tokens_folded_into_prompt(self):
|
||||
"""Since the Moonshot endpoints don't report discounted cached-
|
||||
input pricing, cache_read/creation tokens are priced at the same
|
||||
rate as regular prompt tokens."""
|
||||
recomputed = _override_cost_for_non_anthropic(
|
||||
raw_model="moonshotai/kimi-k2.6",
|
||||
sdk_reported_usd=0.5,
|
||||
prompt_tokens=1000,
|
||||
completion_tokens=0,
|
||||
cache_read_tokens=5000,
|
||||
cache_creation_tokens=2000,
|
||||
)
|
||||
expected = (1000 + 5000 + 2000) * 0.60 / 1_000_000
|
||||
assert recomputed == pytest.approx(expected, rel=1e-9)
|
||||
|
||||
@@ -34,6 +34,7 @@ from .model import (
|
||||
update_session_title,
|
||||
upsert_chat_session,
|
||||
)
|
||||
from .token_tracking import persist_and_record_usage
|
||||
|
||||
logger = logging.getLogger(__name__)
|
||||
|
||||
@@ -498,6 +499,13 @@ async def _generate_session_title(
|
||||
) -> str | None:
|
||||
"""Generate a concise title for a chat session based on the first message.
|
||||
|
||||
Also persists the title-generation call's cost to ``PlatformCostLog``
|
||||
so the admin dashboard's provider totals match the real OpenRouter
|
||||
bill. Before this, the title LLM call (a background task, one per
|
||||
session) bypassed the main turn's reconcile entirely and silently
|
||||
wasn't tracked — low per-call cost but 100% of sessions pay it, so
|
||||
it adds up.
|
||||
|
||||
Args:
|
||||
message: The first user message in the session
|
||||
user_id: User ID for OpenRouter tracing (optional)
|
||||
@@ -507,8 +515,11 @@ async def _generate_session_title(
|
||||
A short title (3-6 words) or None if generation fails
|
||||
"""
|
||||
try:
|
||||
# Build extra_body for OpenRouter tracing and PostHog analytics
|
||||
extra_body: dict[str, Any] = {}
|
||||
# Build extra_body for OpenRouter tracing and PostHog analytics.
|
||||
# ``usage: {"include": True}`` asks OR to embed the real billed
|
||||
# cost into the final usage chunk — matches the baseline path's
|
||||
# ``_OPENROUTER_INCLUDE_USAGE_COST`` pattern, same read path.
|
||||
extra_body: dict[str, Any] = {"usage": {"include": True}}
|
||||
if user_id:
|
||||
extra_body["user"] = user_id[:128] # OpenRouter limit
|
||||
extra_body["posthogDistinctId"] = user_id
|
||||
@@ -534,6 +545,17 @@ async def _generate_session_title(
|
||||
max_tokens=20,
|
||||
extra_body=extra_body,
|
||||
)
|
||||
|
||||
# Best-effort cost capture — the title call is a one-shot LLM
|
||||
# round we'd otherwise miss in admin totals. Runs inside the
|
||||
# ``try`` block so a persist failure downgrades to the existing
|
||||
# "title generation failed" warning path rather than raising.
|
||||
await _record_title_generation_cost(
|
||||
response=response,
|
||||
user_id=user_id,
|
||||
session_id=session_id,
|
||||
)
|
||||
|
||||
title = response.choices[0].message.content
|
||||
if title:
|
||||
# Clean up the title
|
||||
@@ -548,6 +570,63 @@ async def _generate_session_title(
|
||||
return None
|
||||
|
||||
|
||||
async def _record_title_generation_cost(
|
||||
*,
|
||||
response: Any,
|
||||
user_id: str | None,
|
||||
session_id: str | None,
|
||||
) -> None:
|
||||
"""Persist the title LLM call's cost to ``PlatformCostLog``.
|
||||
|
||||
Title generation runs in a background task per-session — low cost
|
||||
(~$0.0001 per title) but 100% of sessions pay it. Without this the
|
||||
admin dashboard under-reports total provider spend by the aggregate
|
||||
of those calls. Separate ``block_name="copilot:title"`` so the row
|
||||
is clearly distinguishable from the turn's main ``copilot:SDK`` /
|
||||
``copilot:baseline`` attributions.
|
||||
|
||||
Best-effort: any error downgrades to a debug log so title generation
|
||||
itself never breaks on a cost-tracking hiccup.
|
||||
"""
|
||||
try:
|
||||
usage = getattr(response, "usage", None)
|
||||
if usage is None:
|
||||
return
|
||||
prompt_tokens = getattr(usage, "prompt_tokens", 0) or 0
|
||||
completion_tokens = getattr(usage, "completion_tokens", 0) or 0
|
||||
# OR piggybacks the real billed cost on ``usage.cost`` when
|
||||
# ``extra_body={"usage":{"include":true}}`` — stashed in
|
||||
# pydantic's ``model_extra``. Absent for non-OR routes.
|
||||
extras = getattr(usage, "model_extra", None) or {}
|
||||
cost_raw = extras.get("cost") if isinstance(extras, dict) else None
|
||||
cost_usd: float | None
|
||||
try:
|
||||
cost_usd = float(cost_raw) if cost_raw is not None else None
|
||||
except (TypeError, ValueError):
|
||||
cost_usd = None
|
||||
|
||||
# ``persist_and_record_usage`` needs the session object to
|
||||
# append the usage row to the session's per-turn usage list —
|
||||
# load it lazily so the hot title path doesn't pay the DB round
|
||||
# trip unless a cost actually needs recording.
|
||||
session = None
|
||||
if session_id and user_id:
|
||||
session = await get_chat_session(session_id, user_id)
|
||||
|
||||
await persist_and_record_usage(
|
||||
session=session,
|
||||
user_id=user_id,
|
||||
prompt_tokens=prompt_tokens,
|
||||
completion_tokens=completion_tokens,
|
||||
log_prefix="[title]",
|
||||
cost_usd=cost_usd,
|
||||
model=config.title_model,
|
||||
provider="open_router",
|
||||
)
|
||||
except Exception as exc: # noqa: BLE001
|
||||
logger.debug("Title cost tracking skipped: %s", exc)
|
||||
|
||||
|
||||
async def _update_title_async(
|
||||
session_id: str, message: str, user_id: str | None = None
|
||||
) -> None:
|
||||
|
||||
Reference in New Issue
Block a user