Commit Graph

8634 Commits

Author SHA1 Message Date
majdyz
12601f3ab9 fix(copilot): cap sessionModes at 200 entries to prevent localStorage leak 2026-04-13 12:54:53 +00:00
majdyz
47be9c7024 fix(copilot): default to thinking mode for sessions without recorded mode
Sessions created before the mode fix had no recorded mode. Previously
restoreSessionMode would leave the global mode unchanged (whatever it
was set to on another session). Now defaults to extended_thinking when
no mode is recorded — no need to clear localStorage.
2026-04-13 12:52:01 +00:00
majdyz
c9fadf20e1 fix(copilot): record current session mode before switching away
Old sessions (created before the mode fix) didn't have a recorded
mode, so switching away and back would lose the mode. Now we record
the current mode for the departing session before switching.
2026-04-13 12:48:03 +00:00
majdyz
7d16258a98 perf(copilot): reduce tool output truncation from 500K to 100K chars
500K chars (~125K tokens) per tool result was too generous — a few
large tool outputs could push context past 200K+ tokens. 100K chars
(~25K tokens) keeps individual results reasonable. The SDK writes
oversized results to tool-results/ files and returns a reference.
2026-04-13 12:24:35 +00:00
majdyz
ac054c31f6 perf(copilot): trigger compaction at 100K tokens instead of 140K
Set CLAUDE_AUTOCOMPACT_PCT_OVERRIDE=50 to compact at 50% of 200K
context window (100K) instead of the default 70% (140K). Context
>200K accounts for 54% of cost despite being only 3% of calls.
Earlier compaction keeps context smaller and reduces cache creation.
2026-04-13 12:15:52 +00:00
majdyz
1d3cce0ebf fix(copilot): strip <internal_reasoning> tags from Sonnet response stream
Models without extended thinking (e.g. Sonnet) sometimes emit
<internal_reasoning>...</internal_reasoning> tags as visible text.
Extract ThinkingStripper to a shared module and apply it to the SDK
streaming path so these tags are stripped before reaching the SSE
client and the persisted message.
2026-04-13 11:50:43 +00:00
majdyz
ea1d8485f5 fix: resolve openapi.json merge conflict — keep cost_bearing_request_count 2026-04-13 11:39:01 +00:00
majdyz
364d98aab6 fix(copilot): remove effort=low default to prevent internal_reasoning leak
effort=low on Sonnet causes <internal_reasoning> tags to leak into
visible output. Changed default to None (let model decide). Only
passed to SDK when explicitly set via CHAT_CLAUDE_AGENT_THINKING_EFFORT.
2026-04-13 11:36:16 +00:00
majdyz
f121dcd5c8 Resolve merge conflicts in copilot baseline service files
Keep HEAD's pre-drain count logic for transcript loading and drain error
handling, and merge incoming cache token extraction tests from PR #12762.
2026-04-13 10:49:02 +00:00
majdyz
ea0b5f70ad Fix merge conflict in platform_cost.py crashing all new pods
Resolve conflicts between cost dashboard PR (#12757) and cache token
columns PR (#12762). Keep all HEAD-side functionality (percentile
queries, histogram buckets, cost-bearing request counts, unfiltered
aggregate) while retaining cache token fields from the incoming side.
2026-04-13 10:37:49 +00:00
majdyz
dbaaa88e1b perf(copilot): switch default model from Opus to Sonnet
Opus at $15/$75 per M tokens is unsustainable for agentic sessions
(1M+ context after 30+ turns = $7+/turn). Sonnet at $3/$15 per M
is 5x cheaper with comparable quality for most tasks.

Override via CHAT_MODEL=anthropic/claude-opus-4.6 for premium tier.
2026-04-13 10:25:49 +00:00
majdyz
ec2acfb9e3 fix(frontend): add cache token fields to UserCostSummary in openapi.json
The backend added total_cache_read_tokens and total_cache_creation_tokens
to UserCostSummary but the OpenAPI spec was not updated, causing frontend
build failures.
2026-04-13 10:13:18 +00:00
majdyz
69e9a5bb22 fix(frontend): add cache token fields to UserCostSummary in openapi.json
The backend added total_cache_read_tokens and total_cache_creation_tokens
to UserCostSummary but the OpenAPI spec was not updated, causing frontend
build failures.
2026-04-13 10:12:44 +00:00
majdyz
95087cd170 Merge branch 'fix/copilot-mode-per-session' of https://github.com/Significant-Gravitas/AutoGPT into preview/all-prs 2026-04-13 09:58:49 +00:00
majdyz
1e7eadce26 fix(copilot): validate persisted session modes, add removeSessionMode, fix useEffect deps
- Validate entries from localStorage before constructing the sessionModes map,
  filtering out corrupt/unknown mode strings (addresses CodeRabbit review)
- Add removeSessionMode action and call it on session delete so the map does
  not grow unboundedly
- Add recordSessionMode to the useEffect dependency array to avoid stale-closure risk
- Add clarifying comment to restoreSessionMode no-op branch
- Extend tests to cover removeSessionMode, no-op, and corrupt-localStorage behaviour
2026-04-13 09:57:14 +00:00
majdyz
1485d1910c Merge branch 'fix/sse-replay-deduplication' of https://github.com/Significant-Gravitas/AutoGPT into preview/all-prs 2026-04-13 09:56:12 +00:00
majdyz
89c9c649d8 fix: resolve merge conflicts in UserTable.tsx — keep all columns (avg cost + cache read/write) 2026-04-13 09:55:56 +00:00
majdyz
a17f05f2b1 fix(copilot): scope dedup fingerprint by user message ID instead of text
Using user message text as the context key caused the deduplicator to
drop the second assistant reply when a user asked the same question twice
in one session. Switching to user message ID (which is unique per turn)
fixes the false positive while still preventing SSE-replayed duplicates.

Adds a regression test covering the same-question-twice scenario.
2026-04-13 09:55:54 +00:00
majdyz
62e4a8d3a4 Merge branch 'fix/copilot-mode-per-session' of https://github.com/Significant-Gravitas/AutoGPT into preview/all-prs 2026-04-13 09:54:21 +00:00
majdyz
c6af52033d fix(copilot): fix multi-turn cost over-estimation and add cache_creation_tokens extraction
Bug 1: Fallback cost estimation was using accumulated turn_prompt_tokens /
turn_completion_tokens across all tool-call rounds, causing compounding
over-estimation on the 2nd+ turn. Snapshot token counts before each call and
pass only the per-call delta to _estimate_cost_from_tokens.

Bug 2: turn_cache_creation_tokens was defined but never populated. Extract
cache_creation_input_tokens from prompt_tokens_details (available from some
providers such as Anthropic via OpenRouter).

Add regression tests for both fixes.
2026-04-13 09:53:05 +00:00
majdyz
1df9369dc3 perf(copilot): add effort=low thinking control + raise budget to $15
- Add claude_agent_thinking_effort config (default: 'low') to control
  thinking depth. 'low' minimizes thinking token usage — the #1 cost
  driver at 49% of total spend.
- Raise max_budget_usd from $5 to $15 — $5 was below p50 ($5.37),
  causing half of all turns to get budget-killed mid-task.
- Log raw SDK usage dict to discover thinking token fields.
2026-04-13 09:43:26 +00:00
majdyz
f6c7d1eaf7 fix(copilot): baseline cost tracking fallback and dashboard cache token display
When OpenRouter's x-total-cost header is missing, estimate cost from
token counts using a known model pricing table so cost is always logged.
Also extract cache token details from streaming usage chunks
(prompt_tokens_details.cached_tokens) and pass them through to
PlatformCostLog.

On the dashboard side, add cache read/write columns to the logs table
and user table, and include cache tokens in the UserCostSummary backend
model so they surface in the API response.
2026-04-13 09:39:44 +00:00
majdyz
85f76230a9 debug(copilot): log raw SDK usage dict to discover thinking token fields
Temporary debug logging to see all fields in ResultMessage.usage —
need to confirm if thinking_tokens or similar is available but not
being captured.
2026-04-13 09:35:05 +00:00
majdyz
f63440e955 fix(copilot): store mode per session so indicator updates on switch
The copilot mode (fast/extended_thinking) was stored as a single global
value. When switching between sessions, the mode indicator stayed on
whatever was last set globally rather than reflecting the mode each
session was created with.

Add a sessionModes map to the Zustand store that records the active
copilotMode when a session is created and restores it when the user
switches back to that session.
2026-04-13 09:32:45 +00:00
majdyz
f52c1e1f24 fix(copilot): raise max_budget_usd from $5 to $15
$5 was too aggressive — p50 cost is $5.37 so half of all turns were
getting budget-killed mid-task with no value delivered. $15 covers p75
($13.07) so ~75% of tasks complete. The thinking token cap is the
better cost lever but needs verification first.
2026-04-13 08:47:16 +00:00
majdyz
b5216da2d8 fix(copilot): disable gzip on API responses to prevent ZlibError
Add Accept-Encoding: identity to ANTHROPIC_CUSTOM_HEADERS in
build_sdk_env() to prevent ZlibError decompression failures in the
CLI subprocess. Appended after any existing custom headers (OpenRouter
trace headers).

See: oven-sh/bun#23149, anthropics/claude-code#18302
2026-04-13 08:26:01 +00:00
majdyz
ffa74177d0 fix: add ::timestamptz casts to raw SQL datetime comparisons in _build_raw_where
The raw SQL WHERE clause builder was passing datetime parameters without
explicit type casts, causing PostgreSQL to fail with "operator does not
exist: timestamp without time zone >= text".
2026-04-13 08:23:43 +00:00
majdyz
b6b94a2244 Merge branch 'fix/sse-replay-deduplication' of https://github.com/Significant-Gravitas/AutoGPT into preview/all-prs 2026-04-13 08:06:29 +00:00
majdyz
7cadce4c7b fix(copilot): deduplicate SSE-replayed messages by content fingerprint
When the SSE connection reconnects, resume_session_stream replays from
"0-0" and the replayed UIMessage objects get new IDs from useChat,
bypassing the adjacent-only content dedup. Switch deduplicateMessages
to track all seen role+context+content fingerprints globally, scoped
by the preceding user message to avoid false positives when the
assistant legitimately gives identical answers to different prompts.
2026-04-13 08:04:04 +00:00
majdyz
00a20bdfe6 fix(copilot): deduplicate SSE-replayed messages by content fingerprint
When the SSE connection reconnects, resume_session_stream replays from
"0-0" and the replayed UIMessage objects get new IDs from useChat,
bypassing the adjacent-only content dedup. Switch deduplicateMessages
to track all seen role+context+content fingerprints globally, scoped
by the preceding user message to avoid false positives when the
assistant legitimately gives identical answers to different prompts.
2026-04-13 08:03:51 +00:00
majdyz
e0ddb7d4d4 Merge branch 'feat/enhanced-cost-dashboard' of https://github.com/Significant-Gravitas/AutoGPT into preview/all-prs
# Conflicts:
#	autogpt_platform/backend/backend/data/platform_cost_test.py
2026-04-13 08:03:15 +00:00
majdyz
d8d0f752b5 Merge branch 'feat/builder-chat-panel' of https://github.com/Significant-Gravitas/AutoGPT into preview/all-prs
# Conflicts:
#	autogpt_platform/backend/backend/data/platform_cost_test.py
2026-04-13 08:02:58 +00:00
majdyz
c64d5a9c92 Merge branch 'perf/copilot-prompt-caching' of https://github.com/Significant-Gravitas/AutoGPT into preview/all-prs 2026-04-13 08:02:37 +00:00
majdyz
f8bca6f4bc Merge commit '2cf737dc0508a7753d067ed8425cfc0ef657b29f' into preview/all-prs
# Conflicts:
#	autogpt_platform/backend/backend/copilot/config.py
2026-04-13 08:02:31 +00:00
majdyz
6c21e58d31 Merge branch 'fix/orchestrator-per-iteration-cost' of https://github.com/Significant-Gravitas/AutoGPT into preview/all-prs 2026-04-13 08:01:50 +00:00
majdyz
895c9a0d29 Merge branch 'feat/copilot-pending-messages' of https://github.com/Significant-Gravitas/AutoGPT into preview/all-prs 2026-04-13 08:01:45 +00:00
majdyz
84e877e36d Merge branch 'fix/schedule-agent-cred-setup-ux' of https://github.com/Significant-Gravitas/AutoGPT into preview/all-prs 2026-04-13 08:01:40 +00:00
majdyz
a504ad6e1e Merge branch 'chore/sdk-dev-preview-0.1.58-with-proxy' of https://github.com/Significant-Gravitas/AutoGPT into preview/all-prs 2026-04-13 08:01:33 +00:00
majdyz
ca0c95b593 fix(frontend): add SUBSCRIPTION to CreditTransactionType enum in openapi.json
Syncs the OpenAPI spec with the Prisma schema which already includes the
SUBSCRIPTION enum value in CreditTransactionType.
2026-04-13 07:13:21 +00:00
majdyz
cbf309c9e4 Merge branch 'dev' of https://github.com/Significant-Gravitas/AutoGPT into feat/copilot-pending-messages 2026-04-13 07:12:49 +00:00
majdyz
6ccb44e0d5 fix(copilot): add 404/429 to route decorator, reformat routes.py, regenerate openapi.json
Add responses={404, 429} to the pending endpoint's @router.post decorator
so FastAPI auto-generates them in the OpenAPI spec. Previously these were
only manually added to openapi.json and the CI schema-check (export +
diff) stripped them. Also apply black formatting to the long warning
line that was failing the backend lint check.
2026-04-13 07:04:07 +00:00
majdyz
e558c60104 fix(orchestrator): don't propagate non-billing charge errors as tool failures
Non-IBE exceptions from charge_node_usage (e.g. DB timeout) were
re-raised and caught by the outer generic handler, incorrectly marking
a successful tool execution as failed. This could cause the LLM to
retry side-effectful operations. Now logs the error and continues to
the success path since the tool itself completed successfully.
2026-04-13 07:02:10 +00:00
majdyz
fbad856538 fix(backend/copilot): relax schedule race test assertion for setup_test_data fixture
The setup_test_data fixture creates a graph with credentials already
embedded in node defaults. The DB-stored credential schema may not
surface these as "missing" in build_missing_credentials_from_graph,
so assert the key exists rather than asserting non-empty count.
2026-04-13 06:59:18 +00:00
majdyz
3ebfa3d68b fix(backend/copilot): address round-6 review — DRY validation handler, improve tests
- Extract duplicated GraphValidationError handler from _run_agent and
  _schedule_agent into _handle_graph_validation_race helper method
- Use generator expressions instead of list comprehension for
  short-circuit evaluation in _build_setup_requirements_from_validation_error
- Improve mixed-error fallback message to be more user-friendly
- Add test for empty node_errors={} edge case
- Pin expected credential count in firecrawl fixture tests
- Add missing_credentials assertion to schedule race E2E test
- Add test for extras present with node_errors=None in service_test
2026-04-13 06:45:28 +00:00
majdyz
5ff46ff207 fix(backend): address review feedback on orchestrator billing
- Extract post-execution billing into _handle_post_execution_billing()
- Deduplicate IBE notification into _try_send_insufficient_funds_notif()
- Combine _charge_usage + _handle_low_balance into single thread dispatch
- Sanitize error messages to LLM (no internal details leaked)
- Default _is_error to True (fail-closed) for tool responses
- Add IBE propagation contract to OrchestratorBlock class docstring
- Reduce per-site IBE comments to one-liners referencing class docstring
- Fix _resolve_block_cost return type annotation (Block | None)
- Move test imports to module level, fix test_default_block_returns_zero
- Add tests for non-IBE billing failure and _charge_usage(count=0)
- Fix Black formatting (CI lint blocker)
2026-04-13 06:44:20 +00:00
majdyz
2cf737dc05 fix(backend): address review comments on cross-user prompt caching PR
- Add TODO(#12747) to _SystemPromptPreset for cleanup tracking
- Update docstring to note SDK version and migration path
- Add debug logging in _build_system_prompt_value for observability
- Document empty-string edge case in docstring
- Trim redundant block comment at call site to single line
- Add test for empty-string system_prompt with cache enabled
- Add test for CHAT_CLAUDE_AGENT_CROSS_USER_PROMPT_CACHE=false env var
2026-04-13 06:43:57 +00:00
majdyz
040637dd68 fix: force cost_usd for percentile/histogram queries, fix test + prettier
- Backend: always pass tracking_type=None to _build_raw_where for
  percentile and histogram queries so they compute stats on cost_usd
  rows regardless of the caller's tracking_type filter.
- Frontend test: use getAllByText for "5" which appears in both the
  Active Users card and the $1-2 bucket count.
- Frontend: fix prettier formatting in PlatformCostContent.tsx.
2026-04-13 06:36:59 +00:00
majdyz
90d8ae0ae2 fix(copilot): map non-E2B file tools in permissions and fix lint formatting
In non-E2B mode, to_sdk_names() failed to map whitelisted SDK built-in
file tool names (Write, Edit, Read) to their MCP-prefixed equivalents
(mcp__copilot__Write, etc.), causing them to be incorrectly filtered out
when users configured tool whitelists.

Add _SDK_TO_MCP mapping for non-E2B mode that maps Read->read_file,
Write->Write, Edit->Edit. Add test coverage for this case.

Also fix black formatting in permissions_test.py that was causing CI lint
failure.
2026-04-13 06:34:55 +00:00
majdyz
967f0c97c4 fix(copilot): fix black formatting for single-line ValueError raise 2026-04-13 06:29:25 +00:00
majdyz
7dc4319125 fix: correct group_by count in test_passes_filters_to_queries
The 6th group_by (total agg no-tracking-type) only runs when
tracking_type is set. This test doesn't pass tracking_type, so the
expected count is 5, not 6.
2026-04-13 05:28:12 +00:00