mirror of
https://github.com/Significant-Gravitas/AutoGPT.git
synced 2026-04-30 03:00:41 -04:00
## Why Investigation of two reported sessions ([85804387](https://dev-builder.agpt.co/copilot?sessionId=85804387-7708-4fdc-8ec9-64283cdd902d), [19d69dec](https://dev-builder.agpt.co/copilot?sessionId=19d69dec-210f-4439-a94b-2d7d443b9909)) where Kimi K2.6 via OpenRouter was running ~30 min per turn with no actions completed (Discord report from Toran). Langfuse traces showed: - 31 generation calls per turn at p90 = 151s, max = 415s - 2.57M uncached tokens, `cache_create=0`, ~4% cache_read — Moonshot's OpenRouter endpoint silently drops Anthropic-style cache writes - **3 SDK-internal compactions per turn** — each compaction is itself a slow LLM round-trip - Reconciled OpenRouter cost was being recorded to a DB row but never surfaced on the Langfuse trace, leaving operators to grep pod logs ## What Four commits, split by concern. ### 1. `fix(backend/copilot): skip CLAUDE_AUTOCOMPACT_PCT_OVERRIDE for Moonshot/Kimi` (`5fd9c5aa`) `env.py` was unconditionally setting `CLAUDE_AUTOCOMPACT_PCT_OVERRIDE=50` (introduced in #12747 to cap cache-creation cost on Anthropic where context >200K = 54% of total cost). On Kimi where `cache_create=0` silently, the cache-cost rationale doesn't apply — but the 50% threshold still made the bundled CLI auto-compact at ~100K tokens, triggering 3+ compactions per turn against Kimi's larger effective window. Each compaction added a slow LLM round-trip (one in our test ran 166s and burned the budget cap before the user got any output). Threads the resolved `sdk_model` (and `fallback_model`) into `build_sdk_env` and skips the env var when the model matches `is_moonshot_model(...)`. The CLI then uses its default ~93% threshold, cutting compaction passes to 0–1. ### 2. `feat(backend/copilot): backfill OpenRouter reconciled cost to Langfuse trace` (`f3de3624` + follow-ups `5ce3d038`, `d2c1a2cd`, `d8e08525`, `d243bf6c9`) `record_turn_cost_from_openrouter` runs as a fire-and-forget task after the OTel span closes, so the Langfuse trace UI showed the SDK CLI's rate-card estimate only — for non-Anthropic OpenRouter routes that estimate is Sonnet pricing on Kimi tokens (~5x too high). The backfill captures `langfuse.get_current_trace_id()` and threads it into the reconcile task, which emits an `openrouter-cost-reconcile` child event with the authoritative cost + token usage. **Bug caught during /pr-test:** `propagate_attributes` only annotates an existing OTel span, it doesn't create one — by the time the `finally` block runs, SDK-emitted spans have ended and `get_current_trace_id()` returns None. Fixed in `d8e08525` by wrapping the turn in `langfuse.start_as_current_span(name="copilot-sdk-turn")`. Also tags fallback-path events with `cost_source` so operators can distinguish reconciled vs estimated turns. ### 3. `feat(backend/copilot): expose CLAUDE_AUTOCOMPACT_PCT_OVERRIDE as a config knob` (`72416f73`) The previously-hardcoded `50` is now `claude_agent_autocompact_pct_override` (default 50, env `CHAT_CLAUDE_AGENT_AUTOCOMPACT_PCT_OVERRIDE`). Setting to 0 omits the env var entirely so the CLI uses its native ~93% threshold — useful when the post-compact floor (system prompt + tool defs ≈ 65–110K) sits close to an aggressive trigger and operators see back-to-back compaction cascades. Moonshot routes still skip the env var unconditionally regardless of config. ### 4. `fix(backend/copilot): align SDK retry compaction target with CLI autocompact threshold` (`730ad256`) `_reduce_context` was calling `compact_transcript` without an explicit `target_tokens`, so it fell back to `get_compression_target(model) = context_window - 60K`. For Sonnet 200K that's 140K — well above the CLI's PCT=50 trigger of 90K — and for Kimi 256K it's 196K, above the CLI's default 167K trigger. Result: a successful retry compaction landed at 140K/196K and the CLI immediately re-compacted on the next call → **two compactions per recovered turn**. New `_compaction_target_tokens(model)` mirrors the CLI's `i6_()` formula (`min(window * pct/100, window - 13K)`) with a 20K safety buffer so the post-compact context sits comfortably below the CLI's trigger. ## How — empirical validation against the actual long Kimi transcript Replayed the 199-message transcript from session 85804387 through the bundled CLI in two configurations: | | Post-fix (no override) | Pre-fix (`PCT_OVERRIDE=50`) | |---|---|---| | `autocompact: tokens=` | 126,312 | 126,341 | | `threshold=` | **167,000** | **90,000** | | Decision | 126K < 167K → **skip** | 126K > 90K → **COMPACTION FIRES** | | Duration | 21s | **166s** (8x slower) | | Cost | $0.34 | **$0.82** (2.4x more) | | Output | PONG (success) | empty (hit $0.50 budget cap, exit 1) | The pre-fix configuration burned $0.82 of compaction work over 166s and never produced a user response — exactly the failure mode reported. **Why cascade happens at 50%, not at 93%:** post-compaction context is `summary (~5–10K) + system_prompt + tool_definitions + skills + active TodoWrite + memory ≈ 65–110K floor`. With trigger at 90K, post-compact floor sits AT or above the trigger → next assistant message tips over → immediate re-compaction → cascade until the CLI's rapid-refill breaker trips at 3 attempts. With trigger at 167K, the same floor sits comfortably below trigger → no cascade. ## Considered but not done - **Force `cache_control` markers to reach Moonshot**: bundled CLI sends them by default; Moonshot silently drops them per their own docs (uses `X-Msh-Context-Cache` headers, not body markers). Real fix needs bypassing OpenRouter — out of scope. - **Slim the system prompt + tool definitions** to lower the post-compact floor: real win but separate refactor with tool-use accuracy A/B. - **LD-driven auto-fallback to Sonnet on Kimi degradation**: `claude_agent_fallback_model` already wires `--fallback-model` for overload (529); auto-flipping on slowness needs latency aggregation infra that doesn't exist yet. ## Test plan - [x] `poetry run pytest backend/copilot/sdk/env_test.py backend/copilot/sdk/openrouter_cost_test.py backend/copilot/sdk/service_helpers_test.py` — 111 passed (37 env + 23 cost + 51 helpers, including 6 new env tests, 3 backfill tests, 6 new compaction-target tests) - [x] `poetry run pytest backend/copilot/sdk/` — 970+ passed - [x] `poetry run pyright .` — 0 errors - [x] `poetry run format` — clean - [x] /pr-test --fix end-to-end against dev — 5/5 scenarios PASS, including Anthropic route ($0.0174 cost +0.0% delta) and Moonshot route ($0.028 vs $0.018 → +58.2% delta validates reconcile rationale) - [x] Transcript replay validation: pre-fix vs post-fix on real 126K-token transcript → 8x slower / 2.4x more expensive / fails entirely on pre-fix; clean PONG on post-fix