Commit Graph

8622 Commits

Author SHA1 Message Date
majdyz
ec2acfb9e3 fix(frontend): add cache token fields to UserCostSummary in openapi.json
The backend added total_cache_read_tokens and total_cache_creation_tokens
to UserCostSummary but the OpenAPI spec was not updated, causing frontend
build failures.
2026-04-13 10:13:18 +00:00
majdyz
95087cd170 Merge branch 'fix/copilot-mode-per-session' of https://github.com/Significant-Gravitas/AutoGPT into preview/all-prs 2026-04-13 09:58:49 +00:00
majdyz
1e7eadce26 fix(copilot): validate persisted session modes, add removeSessionMode, fix useEffect deps
- Validate entries from localStorage before constructing the sessionModes map,
  filtering out corrupt/unknown mode strings (addresses CodeRabbit review)
- Add removeSessionMode action and call it on session delete so the map does
  not grow unboundedly
- Add recordSessionMode to the useEffect dependency array to avoid stale-closure risk
- Add clarifying comment to restoreSessionMode no-op branch
- Extend tests to cover removeSessionMode, no-op, and corrupt-localStorage behaviour
2026-04-13 09:57:14 +00:00
majdyz
1485d1910c Merge branch 'fix/sse-replay-deduplication' of https://github.com/Significant-Gravitas/AutoGPT into preview/all-prs 2026-04-13 09:56:12 +00:00
majdyz
89c9c649d8 fix: resolve merge conflicts in UserTable.tsx — keep all columns (avg cost + cache read/write) 2026-04-13 09:55:56 +00:00
majdyz
a17f05f2b1 fix(copilot): scope dedup fingerprint by user message ID instead of text
Using user message text as the context key caused the deduplicator to
drop the second assistant reply when a user asked the same question twice
in one session. Switching to user message ID (which is unique per turn)
fixes the false positive while still preventing SSE-replayed duplicates.

Adds a regression test covering the same-question-twice scenario.
2026-04-13 09:55:54 +00:00
majdyz
62e4a8d3a4 Merge branch 'fix/copilot-mode-per-session' of https://github.com/Significant-Gravitas/AutoGPT into preview/all-prs 2026-04-13 09:54:21 +00:00
majdyz
c6af52033d fix(copilot): fix multi-turn cost over-estimation and add cache_creation_tokens extraction
Bug 1: Fallback cost estimation was using accumulated turn_prompt_tokens /
turn_completion_tokens across all tool-call rounds, causing compounding
over-estimation on the 2nd+ turn. Snapshot token counts before each call and
pass only the per-call delta to _estimate_cost_from_tokens.

Bug 2: turn_cache_creation_tokens was defined but never populated. Extract
cache_creation_input_tokens from prompt_tokens_details (available from some
providers such as Anthropic via OpenRouter).

Add regression tests for both fixes.
2026-04-13 09:53:05 +00:00
majdyz
1df9369dc3 perf(copilot): add effort=low thinking control + raise budget to $15
- Add claude_agent_thinking_effort config (default: 'low') to control
  thinking depth. 'low' minimizes thinking token usage — the #1 cost
  driver at 49% of total spend.
- Raise max_budget_usd from $5 to $15 — $5 was below p50 ($5.37),
  causing half of all turns to get budget-killed mid-task.
- Log raw SDK usage dict to discover thinking token fields.
2026-04-13 09:43:26 +00:00
majdyz
f6c7d1eaf7 fix(copilot): baseline cost tracking fallback and dashboard cache token display
When OpenRouter's x-total-cost header is missing, estimate cost from
token counts using a known model pricing table so cost is always logged.
Also extract cache token details from streaming usage chunks
(prompt_tokens_details.cached_tokens) and pass them through to
PlatformCostLog.

On the dashboard side, add cache read/write columns to the logs table
and user table, and include cache tokens in the UserCostSummary backend
model so they surface in the API response.
2026-04-13 09:39:44 +00:00
majdyz
85f76230a9 debug(copilot): log raw SDK usage dict to discover thinking token fields
Temporary debug logging to see all fields in ResultMessage.usage —
need to confirm if thinking_tokens or similar is available but not
being captured.
2026-04-13 09:35:05 +00:00
majdyz
f63440e955 fix(copilot): store mode per session so indicator updates on switch
The copilot mode (fast/extended_thinking) was stored as a single global
value. When switching between sessions, the mode indicator stayed on
whatever was last set globally rather than reflecting the mode each
session was created with.

Add a sessionModes map to the Zustand store that records the active
copilotMode when a session is created and restores it when the user
switches back to that session.
2026-04-13 09:32:45 +00:00
majdyz
f52c1e1f24 fix(copilot): raise max_budget_usd from $5 to $15
$5 was too aggressive — p50 cost is $5.37 so half of all turns were
getting budget-killed mid-task with no value delivered. $15 covers p75
($13.07) so ~75% of tasks complete. The thinking token cap is the
better cost lever but needs verification first.
2026-04-13 08:47:16 +00:00
majdyz
b5216da2d8 fix(copilot): disable gzip on API responses to prevent ZlibError
Add Accept-Encoding: identity to ANTHROPIC_CUSTOM_HEADERS in
build_sdk_env() to prevent ZlibError decompression failures in the
CLI subprocess. Appended after any existing custom headers (OpenRouter
trace headers).

See: oven-sh/bun#23149, anthropics/claude-code#18302
2026-04-13 08:26:01 +00:00
majdyz
ffa74177d0 fix: add ::timestamptz casts to raw SQL datetime comparisons in _build_raw_where
The raw SQL WHERE clause builder was passing datetime parameters without
explicit type casts, causing PostgreSQL to fail with "operator does not
exist: timestamp without time zone >= text".
2026-04-13 08:23:43 +00:00
majdyz
b6b94a2244 Merge branch 'fix/sse-replay-deduplication' of https://github.com/Significant-Gravitas/AutoGPT into preview/all-prs 2026-04-13 08:06:29 +00:00
majdyz
7cadce4c7b fix(copilot): deduplicate SSE-replayed messages by content fingerprint
When the SSE connection reconnects, resume_session_stream replays from
"0-0" and the replayed UIMessage objects get new IDs from useChat,
bypassing the adjacent-only content dedup. Switch deduplicateMessages
to track all seen role+context+content fingerprints globally, scoped
by the preceding user message to avoid false positives when the
assistant legitimately gives identical answers to different prompts.
2026-04-13 08:04:04 +00:00
majdyz
00a20bdfe6 fix(copilot): deduplicate SSE-replayed messages by content fingerprint
When the SSE connection reconnects, resume_session_stream replays from
"0-0" and the replayed UIMessage objects get new IDs from useChat,
bypassing the adjacent-only content dedup. Switch deduplicateMessages
to track all seen role+context+content fingerprints globally, scoped
by the preceding user message to avoid false positives when the
assistant legitimately gives identical answers to different prompts.
2026-04-13 08:03:51 +00:00
majdyz
e0ddb7d4d4 Merge branch 'feat/enhanced-cost-dashboard' of https://github.com/Significant-Gravitas/AutoGPT into preview/all-prs
# Conflicts:
#	autogpt_platform/backend/backend/data/platform_cost_test.py
2026-04-13 08:03:15 +00:00
majdyz
d8d0f752b5 Merge branch 'feat/builder-chat-panel' of https://github.com/Significant-Gravitas/AutoGPT into preview/all-prs
# Conflicts:
#	autogpt_platform/backend/backend/data/platform_cost_test.py
2026-04-13 08:02:58 +00:00
majdyz
c64d5a9c92 Merge branch 'perf/copilot-prompt-caching' of https://github.com/Significant-Gravitas/AutoGPT into preview/all-prs 2026-04-13 08:02:37 +00:00
majdyz
f8bca6f4bc Merge commit '2cf737dc0508a7753d067ed8425cfc0ef657b29f' into preview/all-prs
# Conflicts:
#	autogpt_platform/backend/backend/copilot/config.py
2026-04-13 08:02:31 +00:00
majdyz
6c21e58d31 Merge branch 'fix/orchestrator-per-iteration-cost' of https://github.com/Significant-Gravitas/AutoGPT into preview/all-prs 2026-04-13 08:01:50 +00:00
majdyz
895c9a0d29 Merge branch 'feat/copilot-pending-messages' of https://github.com/Significant-Gravitas/AutoGPT into preview/all-prs 2026-04-13 08:01:45 +00:00
majdyz
84e877e36d Merge branch 'fix/schedule-agent-cred-setup-ux' of https://github.com/Significant-Gravitas/AutoGPT into preview/all-prs 2026-04-13 08:01:40 +00:00
majdyz
a504ad6e1e Merge branch 'chore/sdk-dev-preview-0.1.58-with-proxy' of https://github.com/Significant-Gravitas/AutoGPT into preview/all-prs 2026-04-13 08:01:33 +00:00
majdyz
ca0c95b593 fix(frontend): add SUBSCRIPTION to CreditTransactionType enum in openapi.json
Syncs the OpenAPI spec with the Prisma schema which already includes the
SUBSCRIPTION enum value in CreditTransactionType.
2026-04-13 07:13:21 +00:00
majdyz
cbf309c9e4 Merge branch 'dev' of https://github.com/Significant-Gravitas/AutoGPT into feat/copilot-pending-messages 2026-04-13 07:12:49 +00:00
majdyz
6ccb44e0d5 fix(copilot): add 404/429 to route decorator, reformat routes.py, regenerate openapi.json
Add responses={404, 429} to the pending endpoint's @router.post decorator
so FastAPI auto-generates them in the OpenAPI spec. Previously these were
only manually added to openapi.json and the CI schema-check (export +
diff) stripped them. Also apply black formatting to the long warning
line that was failing the backend lint check.
2026-04-13 07:04:07 +00:00
majdyz
e558c60104 fix(orchestrator): don't propagate non-billing charge errors as tool failures
Non-IBE exceptions from charge_node_usage (e.g. DB timeout) were
re-raised and caught by the outer generic handler, incorrectly marking
a successful tool execution as failed. This could cause the LLM to
retry side-effectful operations. Now logs the error and continues to
the success path since the tool itself completed successfully.
2026-04-13 07:02:10 +00:00
majdyz
fbad856538 fix(backend/copilot): relax schedule race test assertion for setup_test_data fixture
The setup_test_data fixture creates a graph with credentials already
embedded in node defaults. The DB-stored credential schema may not
surface these as "missing" in build_missing_credentials_from_graph,
so assert the key exists rather than asserting non-empty count.
2026-04-13 06:59:18 +00:00
majdyz
3ebfa3d68b fix(backend/copilot): address round-6 review — DRY validation handler, improve tests
- Extract duplicated GraphValidationError handler from _run_agent and
  _schedule_agent into _handle_graph_validation_race helper method
- Use generator expressions instead of list comprehension for
  short-circuit evaluation in _build_setup_requirements_from_validation_error
- Improve mixed-error fallback message to be more user-friendly
- Add test for empty node_errors={} edge case
- Pin expected credential count in firecrawl fixture tests
- Add missing_credentials assertion to schedule race E2E test
- Add test for extras present with node_errors=None in service_test
2026-04-13 06:45:28 +00:00
majdyz
5ff46ff207 fix(backend): address review feedback on orchestrator billing
- Extract post-execution billing into _handle_post_execution_billing()
- Deduplicate IBE notification into _try_send_insufficient_funds_notif()
- Combine _charge_usage + _handle_low_balance into single thread dispatch
- Sanitize error messages to LLM (no internal details leaked)
- Default _is_error to True (fail-closed) for tool responses
- Add IBE propagation contract to OrchestratorBlock class docstring
- Reduce per-site IBE comments to one-liners referencing class docstring
- Fix _resolve_block_cost return type annotation (Block | None)
- Move test imports to module level, fix test_default_block_returns_zero
- Add tests for non-IBE billing failure and _charge_usage(count=0)
- Fix Black formatting (CI lint blocker)
2026-04-13 06:44:20 +00:00
majdyz
2cf737dc05 fix(backend): address review comments on cross-user prompt caching PR
- Add TODO(#12747) to _SystemPromptPreset for cleanup tracking
- Update docstring to note SDK version and migration path
- Add debug logging in _build_system_prompt_value for observability
- Document empty-string edge case in docstring
- Trim redundant block comment at call site to single line
- Add test for empty-string system_prompt with cache enabled
- Add test for CHAT_CLAUDE_AGENT_CROSS_USER_PROMPT_CACHE=false env var
2026-04-13 06:43:57 +00:00
majdyz
040637dd68 fix: force cost_usd for percentile/histogram queries, fix test + prettier
- Backend: always pass tracking_type=None to _build_raw_where for
  percentile and histogram queries so they compute stats on cost_usd
  rows regardless of the caller's tracking_type filter.
- Frontend test: use getAllByText for "5" which appears in both the
  Active Users card and the $1-2 bucket count.
- Frontend: fix prettier formatting in PlatformCostContent.tsx.
2026-04-13 06:36:59 +00:00
majdyz
90d8ae0ae2 fix(copilot): map non-E2B file tools in permissions and fix lint formatting
In non-E2B mode, to_sdk_names() failed to map whitelisted SDK built-in
file tool names (Write, Edit, Read) to their MCP-prefixed equivalents
(mcp__copilot__Write, etc.), causing them to be incorrectly filtered out
when users configured tool whitelists.

Add _SDK_TO_MCP mapping for non-E2B mode that maps Read->read_file,
Write->Write, Edit->Edit. Add test coverage for this case.

Also fix black formatting in permissions_test.py that was causing CI lint
failure.
2026-04-13 06:34:55 +00:00
majdyz
967f0c97c4 fix(copilot): fix black formatting for single-line ValueError raise 2026-04-13 06:29:25 +00:00
majdyz
7dc4319125 fix: correct group_by count in test_passes_filters_to_queries
The 6th group_by (total agg no-tracking-type) only runs when
tracking_type is set. This test doesn't pass tracking_type, so the
expected count is 5, not 6.
2026-04-13 05:28:12 +00:00
majdyz
a8cfe27f6b fix: use real temp files in CLI path env var tests
The path validator rejects non-existent paths, so tests must create
real executable temp files via tmp_path instead of hardcoded paths.
2026-04-13 05:28:08 +00:00
majdyz
4cc8ef4409 fix(platform-cost): address PR review — deduplicate filter logic, skip redundant query, improve frontend
Backend:
- Extract _build_raw_where() helper so raw SQL and Prisma WHERE share
  filter logic (review item #4 — duplicated filter logic)
- Skip redundant total_agg_no_tracking_type_groups query when
  tracking_type is None since it duplicates total_agg_groups (item #3)
- Convert CostBucket from TypedDict to BaseModel for consistency (nit #1)
- Replace fragile 8-way positional tuple unpack with indexed list access

Frontend:
- Make 12 SummaryCards data-driven via a cards config array (item #5)
- Use friendlier percentile labels: Typical/Upper/High/Peak Cost (P50/P75/P95/P99)
- Update test fixtures with all new dashboard fields (item #1)
- Add test assertions for new summary card labels, cost buckets, token
  values, and user table columns
2026-04-13 05:16:55 +00:00
majdyz
359b7f1b81 fix(copilot): address PR reviewer feedback on CLI path validation and defaults
- Reject non-existent and non-file CLI paths at config validation time
  instead of letting them fail with opaque OS errors at runtime
- Add negative test coverage for CLI path validator (non-existent,
  non-executable, directory paths)
- Document breaking default changes (max_turns 1000->50, max_budget
  $100->$5) in field descriptions with env var override instructions
- Narrow broad `except Exception` to `except (ImportError, AttributeError)`
  in cli_openrouter_compat_test.py
2026-04-13 05:13:56 +00:00
Zamil Majdy
a3b0cea942 fix(frontend/builder): route text parts through MessagePartRenderer
Text parts in assistant messages were being rendered as plain <span>
elements, bypassing MessagePartRenderer's case "text" handler and
parseSpecialMarkers(). This broke styled error/system messages
([ERROR:], [RETRYABLE_ERROR:], [SYSTEM:] markers) and markdown
rendering in the builder chat panel.

Route all assistant message parts (text and tool) through
MessagePartRenderer so parseSpecialMarkers() runs on text content.
2026-04-13 04:42:18 +00:00
majdyz
ae1600a99d fix(copilot): rename SDK read_tool_result tool and fix path leak in error message
- Rename `_READ_TOOL_NAME` from `"Read"` to `"read_tool_result"` so the LLM
  can distinguish it from `read_file` (working-directory tool).  The new name
  plus an updated description make its narrow scope (tool-results/ paths and
  workspace:// URIs) unambiguous.
- Fix path leak in `_read_file_handler`: use `os.path.basename(file_path)` in
  the "Path not allowed" error, consistent with write/edit handlers.
- Update `permissions.py` comment and all `permissions_test.py` assertions to
  use the new `mcp__copilot__read_tool_result` name.
2026-04-13 04:27:17 +00:00
majdyz
45f96d5769 fix(copilot): wrap baseline turn-start drain in try/except; add 404/429 to OpenAPI spec
Baseline turn-start drain_pending_messages was unprotected — a transient
Redis error would propagate up and kill the entire turn stream, unlike the
already-protected mid-loop and SDK paths. Wrap with try/except + fallback
to [] so a Redis hiccup degrades gracefully.

Also adds 404 (session not found) and 429 (rate-limit exceeded) response
codes to the pending endpoint's OpenAPI spec so TypeScript clients can
handle these error paths correctly.
2026-04-13 04:24:29 +00:00
majdyz
5dbbdf9b27 fix(copilot): address round-6 review nits
- Remove redundant inner `ChatConfig` import in `_prewarm_cli` — it was
  already imported at module scope on line 16 (style guide: inner imports
  only for heavy optional deps)
- Correct stale comment in `sdk_compat_test.py`: 2.1.63/2.1.70 pre-date
  the context-management regression and are OpenRouter-safe without any
  env var; only 2.1.97+ requires CLAUDE_CODE_DISABLE_EXPERIMENTAL_BETAS=1
- Update `_assert_no_forbidden_patterns` error message in
  `cli_openrouter_compat_test.py`: remove the stale "above 0.1.45" ceiling
  (we've already upgraded to 0.1.58) and point at the correct remediation
  steps (add to _KNOWN_GOOD_BUNDLED_CLI_VERSIONS after bisect verification)
- Plug test coverage gap in `env_test.py`: add
  `CLAUDE_CODE_DISABLE_EXPERIMENTAL_BETAS == "1"` assertions to three
  OpenRouter test methods that were missing it
  (test_strips_trailing_v1, test_strips_trailing_v1_and_slash,
  test_no_v1_suffix_left_alone) — guards against the env var being
  accidentally dropped from a code path that the main test didn't exercise
2026-04-13 04:23:54 +00:00
majdyz
e901b64bed fix(test): fix _handle_low_balance mock signature to accept positional args
The gated_processor fixture's fake_low_balance mock used **kwargs, but
production code calls _handle_low_balance with positional args via
asyncio.to_thread. This caused a silent TypeError caught by the broad
except handler, making the handle_low_balance assertion fail (0 calls
instead of 1). Updated mock to match the actual method signature.
2026-04-13 04:22:03 +00:00
majdyz
64c3ef45df chore: apply Prettier formatting to BuilderChatPanel files
Three files were flagged by the CI lint/format check — apply prettier
--write to bring them into compliance.
2026-04-13 04:15:37 +00:00
majdyz
77ed619613 fix(frontend/builder): add flowID to tool-call effect deps for correct navigation guard 2026-04-13 04:09:05 +00:00
majdyz
626fe17aac fix(orchestrator): resolve None future on swallowed errors; add missing tests
- Move tool_node_stats None guard before node_exec_future.set_result so
  that when on_node_execution returns None (swallowed by @async_error_logged),
  the future carries set_exception(RuntimeError) rather than set_result(None),
  giving the tracking system an accurate error state
- Remove redundant `tool_node_stats is not None` check that was dead code
  after the early-return guard was added
- Add explanatory comment in _charge_extra_iterations_sync docstring explaining
  why the block lookup is intentionally repeated rather than cached from
  _charge_usage (two separate thread-pool workers, no shared mutable state)
- Add assertion to test_on_node_execution_charges_extra_iterations_when_gate_passes
  verifying _handle_low_balance is called when extra_cost > 0
- Add test_on_node_execution_failed_ibe_sends_notification covering the
  FAILED + InsufficientBalanceError path in on_node_execution (lines 822-836)
  that was previously untested
2026-04-13 04:03:08 +00:00
majdyz
3b7e678b97 fix(frontend/builder): address round-5 review comments on BuilderChatPanel
- Add type="button" and focus-visible ring to Stop/Send buttons in PanelInput
- Add type="button" to Retry button in MessageList and Apply button in ActionList
- Fix MessageList to render plain text directly and only pass dynamic-tool parts
  to MessagePartRenderer (text parts were being misrouted through a tool renderer)
- Replace clearGraphSessionCacheForTesting export with _graphSessionCache for
  tests — avoids leaking test scaffolding into the production bundle
- Add toast notification in undo restore when target node was deleted between
  apply and undo (prevents silent no-op)
- Fix misleading test: remove red-herring mockNodes.push from 'no auto-send' test
  since the guard is isGraphLoaded===false, not the node array
- Add truncation-path coverage to helpers.test.ts (MAX_NODES/MAX_EDGES branches)
- Add deleted-node undo test to actionApplicators.test.ts
2026-04-13 04:01:42 +00:00