Compare commits

..

84 Commits

Author SHA1 Message Date
Zamil Majdy
80104fbb3b Merge branch 'dev' of github.com:Significant-Gravitas/AutoGPT into feat/ask-question-tool 2026-04-02 15:55:52 +02:00
Zamil Majdy
6b031085bd feat(platform): add generic ask_question copilot tool (#12647)
### Why / What / How

**Why:** The copilot can ask clarifying questions in plain text, but
that text gets collapsed into hidden "reasoning" UI when the LLM also
calls tools in the same turn. This makes clarification questions
invisible to users. The existing `ClarificationNeededResponse` model and
`ClarificationQuestionsCard` UI component were built for this purpose
but had no tool wiring them up.

**What:** Adds a generic `ask_question` tool that produces a visible,
interactive clarification card instead of collapsible plain text. Unlike
the agent-generation-specific `clarify_agent_request` proposed in
#12601, this tool is workflow-agnostic — usable for agent building,
editing, troubleshooting, or any flow needing user input.

**How:** 
- Backend: New `AskQuestionTool` reuses existing
`ClarificationNeededResponse` model. Registered in `TOOL_REGISTRY` and
`ToolName` permissions.
- Frontend: New `AskQuestion/` renderer reuses
`ClarificationQuestionsCard` from CreateAgent. Registered in
`CUSTOM_TOOL_TYPES` (prevents collapse into reasoning) and
`MessagePartRenderer`.
- Guide: `agent_generation_guide.md` updated to reference `ask_question`
for the clarification step.

### Changes 🏗️

- **`copilot/tools/ask_question.py`** — New generic tool: takes
`question`, optional `options[]` and `keyword`, returns
`ClarificationNeededResponse`
- **`copilot/tools/__init__.py`** — Register `ask_question` in
`TOOL_REGISTRY`
- **`copilot/permissions.py`** — Add `ask_question` to `ToolName`
literal
- **`copilot/sdk/agent_generation_guide.md`** — Reference `ask_question`
tool in clarification step
- **`ChatMessagesContainer/helpers.ts`** — Add `tool-ask_question` to
`CUSTOM_TOOL_TYPES`
- **`MessagePartRenderer.tsx`** — Add switch case for
`tool-ask_question`
- **`AskQuestion/AskQuestion.tsx`** — Renderer reusing
`ClarificationQuestionsCard`
- **`AskQuestion/helpers.ts`** — Output parsing and animation text

### Checklist 📋

#### For code changes:
- [x] I have clearly listed my changes in the PR description
- [x] I have made a test plan
- [x] I have tested my changes according to the test plan:
  - [x] Backend format + pyright pass
  - [x] Frontend lint + types pass
  - [x] Pre-commit hooks pass
- [ ] Manual test: copilot uses `ask_question` and card renders visibly
(not collapsed)
2026-04-02 12:56:48 +00:00
Zamil Majdy
988edd6fe9 fix(platform): validate options/keyword params and pass sessionId to ClarificationQuestionsCard
- Validate options is actually a list (LLM may send string); coerce gracefully
- Validate keyword is a string
- Pass sessionId prop to ClarificationQuestionsCard for localStorage persistence
- Add test for invalid options coercion
2026-04-02 14:34:58 +02:00
Zamil Majdy
078e89f8a6 fix(backend): validate non-empty question in ask_question tool
Reject empty/whitespace-only question strings with ValueError instead
of producing an empty clarification card.
2026-04-02 14:18:58 +02:00
Zamil Majdy
9990e9e841 fix(backend): update prompting_test to match renamed guide section
The clarification section was renamed from "Clarifying Before Building"
to "Clarifying — Before or During Building". Update assertions and add
a test verifying ask_question is referenced.
2026-04-02 14:14:57 +02:00
Zamil Majdy
032fb061bb refactor(platform): consolidate clarification on ask_question tool
Remove dead clarification handling from CreateAgent and EditAgent —
both the isClarificationNeededOutput type guard and the
ClarificationQuestionsCard rendering. All clarification now goes
through the generic ask_question tool.

Update agent_generation_guide.md to allow ask_question at any point
in the workflow (not just before building), covering mid-flow
ambiguity during block discovery or JSON generation.
2026-04-02 14:04:33 +02:00
Zamil Majdy
5706c78341 fix(frontend): use normalized keywords in handleAnswers to match answer keys
The ClarificationQuestionsCard normalizes question keywords (dedup
suffixes), so answers are keyed by normalized keywords. handleAnswers
must use the same normalized questions to look up answers correctly.
2026-04-02 13:58:13 +02:00
Zamil Majdy
ffa5a5b0a7 fix(frontend): align isErrorOutput guard with parseOutput logic
Match the `"error" in output` check used in `parseOutput` so error
payloads without `type: "error"` are consistently recognized.
2026-04-02 13:50:40 +02:00
Zamil Majdy
2f50facfa9 refactor(frontend): remove re-export shim, import ClarificationQuestionsCard directly
Update CreateAgent and EditAgent to import ClarificationQuestionsCard
from the shared copilot/components/ location directly. Delete the
re-export shim to comply with no-barrel-files guideline.
2026-04-02 13:49:12 +02:00
Zamil Majdy
48b849934f fix(platform): address PR review — add tests, remove unused import, lift shared component
- Remove unused `logger` import from ask_question.py
- Add colocated ask_question_test.py with 3 tests covering options,
  no-options, and keyword-only cases
- Move ClarificationQuestionsCard to shared copilot/components/ location
  so AskQuestion imports from there instead of CreateAgent internals
2026-04-02 13:45:51 +02:00
Zamil Majdy
9678c4a86d feat(platform): add generic ask_question copilot tool
Add a generic `ask_question` tool that lets the copilot ask the user
clarifying questions via a dedicated UI card instead of plain text
(which gets collapsed into hidden reasoning). Reuses the existing
`ClarificationNeededResponse` model and `ClarificationQuestionsCard`
component.

Backend:
- New `AskQuestionTool` in `copilot/tools/ask_question.py`
- Registered in `TOOL_REGISTRY` and `ToolName` permissions literal
- Updated `agent_generation_guide.md` to reference `ask_question`

Frontend:
- Added `tool-ask_question` to `CUSTOM_TOOL_TYPES` (prevents collapse)
- New `AskQuestion/` renderer reusing `ClarificationQuestionsCard`
- Registered in `MessagePartRenderer` switch
2026-04-02 13:40:09 +02:00
Toran Bruce Richards
11b846dd49 fix(blocks): rename placeholder_values to options on AgentDropdownInputBlock (#12595)
## Summary

Resolves [REQ-78](https://linear.app/autogpt/issue/REQ-78): The
`placeholder_values` field on `AgentDropdownInputBlock` is misleadingly
named. In every major UI framework "placeholder" means non-binding hint
text that disappears on focus, but this field actually creates a
dropdown selector that restricts the user to only those values.

## Changes

### Core rename (`autogpt_platform/backend/backend/blocks/io.py`)
- Renamed `placeholder_values` → `options` on
`AgentDropdownInputBlock.Input`
- Added clear field description: *"If provided, renders the input as a
dropdown selector restricted to these values. Leave empty for free-text
input."*
- Updated class docstring to describe actual behavior
- Overrode `model_construct()` to remap legacy `placeholder_values` →
`options` for **backward compatibility** with existing persisted agent
JSON

### Tests (`autogpt_platform/backend/backend/blocks/test/test_block.py`)
- Updated existing tests to use canonical `options` field name
- Added 2 new backward-compat tests verifying legacy
`placeholder_values` still works through both `model_construct()` and
`Graph._generate_schema()` paths

### Documentation
- Updated
`autogpt_platform/backend/backend/copilot/sdk/agent_generation_guide.md`
— changed field name in CoPilot SDK guide
- Updated `docs/integrations/block-integrations/basic.md` — changed
field name and description in public docs

### Load tests
(`autogpt_platform/backend/load-tests/tests/api/graph-execution-test.js`)
- Removed spurious `placeholder_values: {}` from AgentInputBlock node
(this field never existed on AgentInputBlock)
- Fixed execution input to use `value` instead of `placeholder_values`

## Backward Compatibility

Existing agents with `placeholder_values` in their persisted
`input_default` JSON will continue to work — the `model_construct()`
override transparently remaps the old key to `options`. No database
migration needed since the field is stored inside a JSON blob, not as a
dedicated column.

## Testing

- All existing tests updated and passing
- 2 new backward-compat tests added
- No frontend changes needed (frontend reads `enum` from generated JSON
Schema, not the field name directly)

---------

Co-authored-by: Zamil Majdy <zamil.majdy@agpt.co>
2026-04-02 05:56:17 +00:00
Zamil Majdy
b9e29c96bd fix(backend/copilot): detect prompt-too-long in AssistantMessage content and ResultMessage success subtype (#12642)
## Why

PR #12625 fixed the prompt-too-long retry mechanism for most paths, but
two SDK-specific paths were still broken. The dev session `d2f7cba3`
kept accumulating synthetic "Prompt is too long" error entries on every
turn, growing the transcript from 2.5 MB → 3.2 MB, making recovery
impossible.

Root causes identified from production logs (`[T25]`, `[T28]`):

**Path 1 — AssistantMessage content check:**
When the Claude API rejects a prompt, the SDK surfaces it as
`AssistantMessage(error="invalid_request", content=[TextBlock("Prompt is
too long")])`. Our check only inspected `error_text = str(sdk_error)`
which is `"invalid_request"` — not a prompt-too-long pattern. The
content was then streamed out as `StreamText`, setting `events_yielded =
1`, which blocked retry even when the ResultMessage fired.

**Path 2 — ResultMessage success subtype:**
After the SDK auto-compacts internally (via `PreCompact` hook) and the
compacted transcript is _still_ too long, the SDK returns
`ResultMessage(subtype="success", result="Prompt is too long")`. Our
check only ran for `subtype="error"`. With `subtype="success"`, the
stream "completed normally", appended the synthetic error entry to the
transcript via `transcript_builder`, and uploaded it to GCS — causing
the transcript to grow on each failed turn.

## What

- **AssistantMessage handler**: when `sdk_error` is set, also check the
content text. `sdk_error` being non-`None` confirms this is an API error
message (not user-generated content), so content inspection is safe.
- **ResultMessage handler**: check `result` for prompt-too-long patterns
regardless of `subtype`, covering the SDK auto-compact path where
`subtype="success"` with `result="Prompt is too long"`.

## How

Two targeted one-line condition expansions in `_run_stream_attempt`,
plus two new integration tests in `retry_scenarios_test.py` that
reproduce each broken path and verify retry fires correctly.

## Changes

- `backend/copilot/sdk/service.py`: fix AssistantMessage content check +
ResultMessage subtype-independent check
- `backend/copilot/sdk/retry_scenarios_test.py`: add 2 integration tests
for the new scenarios

## Checklist

- [x] Tests added for both new scenarios (45 total, all pass)
- [x] Formatted (`poetry run format`)
- [x] No false-positive risk: AssistantMessage check gated behind
`sdk_error is not None`
- [x] Root cause verified from production pod logs
2026-04-01 22:32:09 +00:00
Zamil Majdy
4ac0ba570a fix(backend): fix copilot credential loading across event loops (#12628)
## Why

CoPilot autopilot sessions are inconsistently failing to load user
credentials (specifically GitHub OAuth). Some sessions proceed normally,
some show "provide credentials" prompts despite the user having valid
creds, and some are completely blocked.

Production logs confirmed the root cause: `RuntimeError: Task got Future
<Future pending> attached to a different loop` in the credential refresh
path, cascading into null-cache poisoning that blocks credential lookups
for 60 seconds.

## What

Three interrelated bugs in the credential system:

1. **`refresh_if_needed` always acquired Redis locks even with
`lock=False`** — The `lock` parameter only controlled the inner
credential lock, but the outer "refresh" scope lock was always acquired.
The copilot executor uses multiple worker threads with separate event
loops; the `asyncio.Lock` inside `AsyncRedisKeyedMutex` was bound to one
loop and failed on others.

2. **Stale event loop in `locks()` singleton** — Both
`IntegrationCredentialsManager` and `IntegrationCredentialsStore` cached
their `AsyncRedisKeyedMutex` without tracking which event loop created
it. When a different worker thread (with a different loop) reused the
singleton, it got the "Future attached to different loop" error.

3. **Null-cache poisoning on refresh failure** — When OAuth refresh
failed (due to the event loop error), the code fell through to cache "no
credentials found" for 60 seconds via `_null_cache`. This blocked ALL
subsequent credential lookups for that user+provider, even though the
credentials existed and could refresh fine on retry.

## How

- Split `refresh_if_needed` into `_refresh_locked` / `_refresh_unlocked`
so `lock=False` truly skips ALL Redis locking (safe for copilot's
best-effort background injection)
- Added event loop tracking to `locks()` in both
`IntegrationCredentialsManager` and `IntegrationCredentialsStore` —
recreates the mutex when the running loop changes
- Only populate `_null_cache` when the user genuinely has no
credentials; skip caching when OAuth refresh failed transiently
- Updated existing test to verify null-cache is not poisoned on refresh
failure

## Test plan

- [x] All 14 existing `integration_creds_test.py` tests pass
- [x] Updated
`test_oauth2_refresh_failure_returns_none_without_null_cache` verifies
null-cache is not populated on refresh failure
- [x] Format, lint, and typecheck pass
- [ ] Deploy to staging and verify copilot sessions consistently load
GitHub credentials
2026-04-02 00:11:38 +07:00
Zamil Majdy
d61a2c6cd0 Revert "fix(backend/copilot): detect prompt-too-long in AssistantMessage content and ResultMessage success subtype"
This reverts commit 1c301b4b61.
2026-04-01 18:59:38 +02:00
Zamil Majdy
1c301b4b61 fix(backend/copilot): detect prompt-too-long in AssistantMessage content and ResultMessage success subtype
The SDK returns AssistantMessage(error="invalid_request", content=[TextBlock("Prompt is too long")])
followed by ResultMessage(subtype="success", result="Prompt is too long") when the transcript is
rejected after internal auto-compaction. Both paths bypassed the retry mechanism:

- AssistantMessage handler only checked error_text ("invalid_request"), not the content which
  holds the actual error description. The content was then streamed as text, setting events_yielded=1,
  which blocked retry even when ResultMessage fired.
- ResultMessage handler only triggered prompt-too-long detection for subtype="error", not
  subtype="success". The stream "completed normally", stored the synthetic error entry in the
  transcript, and uploaded it — causing the transcript to grow unboundedly on each failed turn.

Fixes:
1. AssistantMessage handler: when sdk_error is set (confirmed error message), also check content
   text. sdk_error being set guarantees this is an API error, not user-generated content, so
   content inspection is safe.
2. ResultMessage handler: check result for prompt-too-long regardless of subtype, covering the
   case where the SDK auto-compacts internally but the result is still too long.

Adds integration tests for both new scenarios.
2026-04-01 18:28:46 +02:00
Zamil Majdy
24d0c35ed3 fix(backend/copilot): prompt-too-long retry, compaction churn, model-aware compression, and truncated tool call recovery (#12625)
## Why

CoPilot has several context management issues that degrade long
sessions:
1. "Prompt is too long" errors crash the session instead of triggering
retry/compaction
2. Stale thinking blocks bloat transcripts, causing unnecessary
compaction every turn
3. Compression target is hardcoded regardless of model context window
size
4. Truncated tool calls (empty `{}` args from max_tokens) kill the
session instead of guiding the model to self-correct

## What

**Fix 1: Prompt-too-long retry bypass (SENTRY-1207)**
The SDK surfaces "prompt too long" via `AssistantMessage.error` and
`ResultMessage.result` — neither triggered the retry/compaction loop
(only Python exceptions did). Now both paths are intercepted and
re-raised.

**Fix 2: Strip stale thinking blocks before upload**
Thinking/redacted_thinking blocks in non-last assistant entries are
10-50K tokens each but only needed for API signature verification in the
*last* message. Stripping before upload reduces transcript size and
prevents per-turn compaction.

**Fix 3: Model-aware compression target**
`compress_context()` now computes `target_tokens` from the model's
context window (e.g. 140K for Opus 200K) instead of a hardcoded 120K
default. Larger models retain more history; smaller models compress more
aggressively.

**Fix 4: Self-correcting truncated tool calls**
When the model's response exceeds max_tokens, tool call inputs get
silently truncated to `{}`. Previously this tripped a circuit breaker
after 3 attempts. Now the MCP wrapper detects empty args and returns
guidance: "write in chunks with `cat >>`, pass via
`@@agptfile:filename`". The model can self-correct instead of the
session dying.

## How

- **service.py**: `_is_prompt_too_long` checks in both
`AssistantMessage.error` and `ResultMessage` error handlers. Circuit
breaker limit raised from 3→5.
- **transcript.py**: `strip_stale_thinking_blocks()` reverse-scans for
last assistant `message.id`, strips thinking blocks from all others.
Called in `upload_transcript()`.
- **prompt.py**: `get_compression_target(model)` computes
`context_window - 60K overhead`. `compress_context()` uses it when
`target_tokens` is None.
- **tool_adapter.py**: `_truncating` wrapper intercepts empty args on
tools with required params, returns actionable guidance instead of
failing.

## Related

- Fixes SENTRY-1207
- Sessions: `d2f7cba3` (repeated compaction), `08b807d4` (prompt too
long), `130d527c` (truncated tool calls)
- Extends #12413, consolidates #12626

## Test plan

- [x] 6 unit tests for `strip_stale_thinking_blocks`
- [x] 1 integration test for ResultMessage prompt-too-long → compaction
retry
- [x] Pyright clean (0 errors), all pre-commit hooks pass
- [ ] E2E: Load transcripts from affected sessions and verify behavior
2026-04-01 15:10:57 +00:00
Zamil Majdy
8aae7751dc fix(backend/copilot): prevent duplicate block execution from pre-launch arg mismatch (#12632)
## Why

CoPilot sessions are duplicating Linear tickets and GitHub PRs.
Investigation of 5 production sessions (March 31st) found that 3/5
created duplicate Linear issues — each with consecutive IDs at the exact
same timestamp, but only one visible in Langfuse traces.

Production gcloud logs confirm: **279 arg mismatch warnings per day**,
**37 duplicate block execution pairs**, and all LinearCreateIssueBlock
failures in pairs.

Related: SECRT-2204

## What

Replace the speculative pre-launch mechanism with the SDK's native
parallel dispatch via `readOnlyHint` tool annotations. Remove ~580 lines
of pre-launch infrastructure code.

## How

### Root cause
The pre-launch mechanism had three compounding bugs:
1. **Arg mismatch**: The SDK CLI normalises args between the
`AssistantMessage` (used for pre-launch) and the MCP `tools/call`
dispatch, causing frequent mismatches (279/day in prod)
2. **FIFO desync on denial**: Security hooks can deny tool calls,
causing the CLI to skip the MCP dispatch — but the pre-launched task
stays in the FIFO queue, misaligning all subsequent matches
3. **Cancel race**: `task.cancel()` is best-effort in asyncio — if the
HTTP call to Linear/GitHub already completed, the side effect is
irreversible

### Fix
- **Removed** `pre_launch_tool_call()`, `cancel_pending_tool_tasks()`,
`_tool_task_queues` ContextVar, all FIFO queue logic, and all 4
`cancel_pending_tool_tasks()` calls in `service.py`
- **Added** `readOnlyHint=True` annotations on 15+ read-only tools
(`find_block`, `search_docs`, `list_workspace_files`, etc.) — the SDK
CLI natively dispatches these in parallel ([ref:
anthropics/claude-code#14353](https://github.com/anthropics/claude-code/issues/14353))
- Side-effect tools (`run_block`, `bash_exec`, `create_agent`, etc.)
have no annotation → CLI runs them sequentially → no duplicate execution
risk

### Net change: -578 lines, +105 lines
2026-04-01 13:42:54 +00:00
An Vy Le
725da7e887 dx(backend/copilot): clarify ambiguous agent goals using find_block before generation (#12601)
### Why / What / How

**Why:** When a user asks CoPilot to build an agent with an ambiguous
goal (output format, delivery channel, data source, or trigger
unspecified), the agent generator previously made assumptions and jumped
straight into JSON generation. This produced agents that didn't match
what the user actually wanted, requiring multiple correction cycles.

**What:** Adds a "Clarifying Before Building" section to the agent
generation guide. When the goal is ambiguous, CoPilot first calls
`find_block` to discover what the platform actually supports for the
ambiguous dimension, then asks the user one concrete question grounded
in real platform options (e.g. "The platform supports Gmail, Slack, and
Google Docs — which should the agent use for delivery?"). Only after the
user answers does the full agent generation workflow proceed.

**How:** The clarification instruction is added to
`agent_generation_guide.md` — the guide loaded on-demand via
`get_agent_building_guide` when the LLM is about to build an agent. This
avoids polluting the system prompt supplement (which loads for every
CoPilot conversation, not just agent building). No dedicated tool is
needed — the LLM asks naturally in conversation text after discovering
real platform options via `find_block`.

### Changes 🏗️

- `backend/copilot/sdk/agent_generation_guide.md`: Adds "Clarifying
Before Building" section before the workflow steps. Instructs the model
to call `find_block` for the ambiguous dimension, ask the user one
grounded question, wait for the answer, then proceed to generation.
- `backend/copilot/prompting_test.py`: New test file verifying the guide
contains the clarification section and references `find_block`.

### Checklist 📋

#### For code changes:
- [x] I have clearly listed my changes in the PR description
- [x] I have made a test plan
- [x] I have tested my changes according to the test plan:
- [ ] Ask CoPilot to "build an agent to send a report" (ambiguous
output) — verify it calls `find_block` for delivery options and asks one
grounded question before generating JSON
- [ ] Ask CoPilot to "build an agent to scrape prices from Amazon and
email me daily" (specific goal) — verify it skips clarification and
proceeds directly to agent generation
- [ ] Verify the clarification question lists real block options (e.g.
Gmail, Slack, Google Docs) rather than abstract options

---------

Co-authored-by: Claude Sonnet 4.6 <noreply@anthropic.com>
Co-authored-by: Zamil Majdy <zamil.majdy@agpt.co>
2026-04-01 13:32:12 +00:00
seer-by-sentry[bot]
bd9e9ec614 fix(frontend): remove LaunchDarkly local storage bootstrapping (#12606)
### Why / What / How

<!-- Why: Why does this PR exist? What problem does it solve, or what's
broken/missing without it? -->
This PR fixes
[BUILDER-7HD](https://sentry.io/organizations/significant-gravitas/issues/7374387984/).
The issue was that: LaunchDarkly SDK fails to construct streaming URL
due to non-string `_url` from malformed `localStorage` bootstrap data.
<!-- What: What does this PR change? Summarize the changes at a high
level. -->
Removed the `bootstrap: "localStorage"` option from the LaunchDarkly
provider configuration.
<!-- How: How does it work? Describe the approach, key implementation
details, or architecture decisions. -->
This change ensures that LaunchDarkly no longer attempts to load initial
feature flag values from local storage. Flag values will now always be
fetched directly from the LaunchDarkly service, preventing potential
issues with stale local storage data.

### Changes 🏗️

<!-- List the key changes. Keep it higher level than the diff but
specific enough to highlight what's new/modified. -->
- Removed the `bootstrap: "localStorage"` option from the LaunchDarkly
provider configuration.
- LaunchDarkly will now always fetch flag values directly from its
service, bypassing local storage.

### Checklist 📋

#### For code changes:
- [x] I have clearly listed my changes in the PR description
- [ ] I have made a test plan
- [ ] I have tested my changes according to the test plan:
  <!-- Put your test plan here: -->
- [ ] Verify that LaunchDarkly flags are loaded correctly without
issues.
- [ ] Ensure no errors related to `localStorage` or streaming URL
construction appear in the console.

<details>
  <summary>Example test plan</summary>
  
  - [ ] Create from scratch and execute an agent with at least 3 blocks
- [ ] Import an agent from file upload, and confirm it executes
correctly
  - [ ] Upload agent to marketplace
- [ ] Import an agent from marketplace and confirm it executes correctly
  - [ ] Edit an agent from monitor, and confirm it executes correctly
</details>

#### For configuration changes:

- [ ] `.env.default` is updated or already compatible with my changes
- [ ] `docker-compose.yml` is updated or already compatible with my
changes
- [ ] I have included a list of my configuration changes in the PR
description (under **Changes**)

<details>
  <summary>Examples of configuration changes</summary>

  - Changing ports
  - Adding new services that need to communicate with each other
  - Secrets or environment variable changes
  - New or infrastructure changes such as databases
</details>

---------

Co-authored-by: Zamil Majdy <zamil.majdy@agpt.co>
Co-authored-by: seer-by-sentry[bot] <157164994+seer-by-sentry[bot]@users.noreply.github.com>
2026-04-01 19:12:54 +07:00
Nicholas Tindle
88589764b5 dx(platform): normalize agent instructions for Claude and Codex (#12592)
### Why / What / How

Why: repo guidance was split between Claude-specific `CLAUDE.md` files
and Codex-specific `AGENTS.md` files, which duplicated instruction
content and made the same repository behave differently across agents.
The repo also had Claude skills under `.claude/skills` but no
Codex-visible repo skill path.

What: this PR bridges the repo's Claude skills into Codex and normalizes
shared instruction files so `AGENTS.md` becomes the canonical source
while each `CLAUDE.md` imports its sibling `AGENTS.md`.

How: add a repo-local `.agents/skills` symlink pointing to
`../.claude/skills`; move nested `CLAUDE.md` content into sibling
`AGENTS.md` files; replace each repo `CLAUDE.md` with a one-line
`@AGENTS.md` shim so Claude and Codex read the same scoped guidance
without duplicating text. The root `CLAUDE.md` now imports the root
`AGENTS.md` rather than symlinking to it.

Note: the instruction-file normalization commit was created with
`--no-verify` because the repo's frontend pre-commit `tsc` hook
currently fails on unrelated existing errors, largely missing
`autogpt_platform/frontend/src/app/api/__generated__/*` modules.

### Changes 🏗️

- Add `.agents/skills` as a repo-local symlink to `../.claude/skills` so
Codex discovers the existing Claude repo skills.
- Add a real root `CLAUDE.md` shim that imports the canonical root
`AGENTS.md`.
- Promote nested scoped instruction content into sibling `AGENTS.md`
files under `autogpt_platform/`, `autogpt_platform/backend/`,
`autogpt_platform/frontend/`, `autogpt_platform/frontend/src/tests/`,
and `docs/`.
- Replace the corresponding nested `CLAUDE.md` files with one-line
`@AGENTS.md` shims.
- Preserve the existing scoped instruction hierarchy while making the
shared content cross-compatible between Claude and Codex.

### Checklist 📋

#### For code changes:
- [x] I have clearly listed my changes in the PR description
- [x] I have made a test plan
- [x] I have tested my changes according to the test plan:
  - [x] Verified `.agents/skills` resolves to `../.claude/skills`
  - [x] Verified each repo `CLAUDE.md` now contains only `@AGENTS.md`
- [x] Verified the expected `AGENTS.md` files exist at the root and
nested scoped directories
- [x] Verified the branch contains only the intended agent-guidance
commits relative to `dev` and the working tree is clean

#### For configuration changes:

- [x] `.env.default` is updated or already compatible with my changes
- [x] `docker-compose.yml` is updated or already compatible with my
changes
- [x] I have included a list of my configuration changes in the PR
description (under **Changes**)

No runtime configuration changes are included in this PR.

<!-- CURSOR_SUMMARY -->
---

> [!NOTE]
> **Low Risk**
> Low risk: documentation/instruction-file reshuffle plus an
`.agents/skills` pointer; no runtime code paths are modified.
> 
> **Overview**
> Unifies agent guidance so **`AGENTS.md` becomes canonical** and all
corresponding `CLAUDE.md` files become 1-line shims (`@AGENTS.md`) at
the repo root, `autogpt_platform/`, backend, frontend, frontend tests,
and `docs/`.
> 
> Adds `.agents/skills` pointing to `../.claude/skills` so non-Claude
agents discover the same shared skills/instructions, eliminating
duplicated/agent-specific guidance content.
> 
> <sup>Written by [Cursor
Bugbot](https://cursor.com/dashboard?tab=bugbot) for commit
839483c3b6. This will update automatically
on new commits. Configure
[here](https://cursor.com/dashboard?tab=bugbot).</sup>
<!-- /CURSOR_SUMMARY -->
2026-04-01 09:08:51 +00:00
Zamil Majdy
c659f3b058 fix(copilot): fix dry-run simulation showing INCOMPLETE/error status (#12580)
## Summary
- **Backend**: Strip empty `error` pins from dry-run simulation outputs
that the simulator always includes (set to `""` meaning "no error").
This was causing the LLM to misinterpret successful simulations as
failures and report "INCOMPLETE" status to users
- **Backend**: Add explicit "Status: COMPLETED" to dry-run response
message to prevent LLM misinterpretation
- **Backend**: Update simulation prompt to exclude `error` from the
"MUST include" keys list, and instruct LLM to omit error unless
simulating a logical failure
- **Frontend**: Fix `isRunBlockErrorOutput()` type guard that was too
broad (`"error" in output` matched BlockOutputResponse objects, not just
ErrorResponse), causing dry-run results to be displayed as errors
- **Frontend**: Fix `parseOutput()` fallback matching to not classify
BlockOutputResponse as ErrorResponse
- **Frontend**: Filter out empty error pins from `BlockOutputCard`
display and accordion metadata output key counting
- **Frontend**: Clear stale execution results before dry-run/no-input
runs so the UI shows fresh output
- **Frontend**: Fix first-click simulate race condition by invalidating
execution details query after WebSocket subscription confirms

## Test plan
- [x] All 12 existing + 5 new dry-run tests pass (`poetry run pytest
backend/copilot/tools/test_dry_run.py -x -v`)
- [x] All 23 helpers tests pass (`poetry run pytest
backend/copilot/tools/helpers_test.py -x -v`)
- [x] All 13 run_block tests pass (`poetry run pytest
backend/copilot/tools/run_block_test.py -x -v`)
- [x] Backend linting passes (ruff check + format)
- [x] Frontend linting passes (next lint)
- [ ] Manual: trigger dry-run on a block with error output pin (e.g.
Komodo Image Generator) — should show "Simulated" status with clean
output, no misleading "error" section
- [ ] Manual: first click on Simulate button should immediately show
results (no race condition)

---------

Co-authored-by: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Co-authored-by: Nicholas Tindle <nicholas.tindle@agpt.co>
Co-authored-by: copilot-swe-agent[bot] <198982749+Copilot@users.noreply.github.com>
2026-03-31 21:03:00 +00:00
Zamil Majdy
80581a8364 fix(copilot): add tool call circuit breakers and intermediate persistence (#12604)
## Why

CoPilot session `d2f7cba3` took **82 minutes** and cost **$20.66** for a
single user message. Root causes:
1. Redis session meta key expired after 1h, making the session invisible
to the resume endpoint — causing empty page on reload
2. Redis stream key also expired during sub-agent gaps (task_progress
events produced no chunks)
3. No intermediate persistence — session messages only saved to DB after
the entire turn completes
4. Sub-agents retried similar WebSearch queries (addressed via prompt
guidance)

## What

### Redis TTL fixes (root cause of empty session on reload)
- `publish_chunk()` now periodically refreshes **both** the session meta
key AND stream key TTL (every 60s).
- `task_progress` SDK events now emit `StreamHeartbeat` chunks, ensuring
`publish_chunk` is called even during long sub-agent gaps where no real
chunks are produced.
- Without this fix, turns exceeding the 1h `stream_ttl` lose their
"running" status and stream data, making `get_active_session()` return
False.

### Intermediate DB persistence
- Session messages flushed to DB every **30 seconds** or **10 new
messages** during the stream loop.
- Uses `asyncio.shield(upsert_chat_session())` matching the existing
`finally` block pattern.

### Orphaned message cleanup on rollback
- On stream attempt rollback, orphaned messages persisted by
intermediate flushes are now cleaned up from the DB via
`delete_messages_from_sequence`.
- Prevents stale messages from resurfacing on page reload after a failed
retry.

### Prompt guidance
- Added web search best practices to code supplement (search efficiency,
sub-agent scope separation).

### Approach: root cause fixes, not capability limits
- **No tool call caps** — artificial limits on WebSearch or total tool
calls would reduce autopilot capability without addressing why searches
were redundant.
- **Task tool remains enabled** — sub-agent delegation via Task is a
core capability. The existing `max_subtasks` concurrency guard is
sufficient.
- The real fixes (TTL refresh, persistence, prompt guidance) address the
underlying bugs and behavioral issues.

## How

### Files changed
- `stream_registry.py` — Redis meta + stream key TTL refresh in
`publish_chunk()`, module-level keepalive tracker
- `response_adapter.py` — `task_progress` SystemMessage →
StreamHeartbeat emission
- `service.py` — Intermediate DB persistence in `_run_stream_attempt`
stream loop, orphan cleanup on rollback
- `db.py` — `delete_messages_from_sequence` for rollback cleanup
- `prompting.py` — Web search best practices

### GCP log evidence
```
# Meta key expired during 82-min turn:
09:49 — GET_SESSION: active_session=False, msg_count=1  ← meta gone
10:18 — Session persisted in finally with 189 messages   ← turn completed

# T13 (1h45min) same bug reproduced live:
16:20 — task_progress events still arriving, but active_session=False

# Actual cost:
Turn usage: cache_read=347916, cache_create=212472, output=12375, cost_usd=20.66
```

### Test plan
- [x] task_progress emits StreamHeartbeat
- [x] Task background blocked, foreground allowed, slot release on
completion/failure
- [x] CI green (lint, type-check, tests, e2e, CodeQL)

---------

Co-authored-by: Zamil Majdy <majdy.zamil@gmail.com>
2026-03-31 21:01:56 +00:00
lif
3c046eb291 fix(frontend): show all agent outputs instead of only the last one (#12504)
Fixes #9175

### Changes 🏗️

The Agent Outputs panel only displayed the last execution result per
output node, discarding all prior outputs during a run.

**Root cause:** In `AgentOutputs.tsx`, the `outputs` useMemo extracted
only the last element from `nodeExecutionResults`:
```tsx
const latestResult = executionResults[executionResults.length - 1];
```

**Fix:** Changed `.map()` to `.flatMap()` over output nodes, iterating
through all `executionResults` for each node. Each execution result now
gets its own renderer lookup and metadata entry, so the panel shows
every output produced during the run.

### Checklist 📋

#### For code changes:
- [x] I have clearly listed my changes in the PR description
- [x] I have made a test plan
- [x] I have tested my changes according to the test plan:
  - [x] Verified TypeScript compiles without errors
- [x] Confirmed the flatMap logic correctly iterates all execution
results
  - [x] Verified existing filter for null renderers is preserved
- [x] Run an agent with multiple outputs and confirm all show in the
panel

---------

Signed-off-by: majiayu000 <1835304752@qq.com>
Co-authored-by: Zamil Majdy <zamil.majdy@agpt.co>
2026-03-31 20:31:12 +00:00
Zamil Majdy
3e25488b2d feat(copilot): add session-level dry_run flag to autopilot sessions (#12582)
## Summary
- Adds a session-level `dry_run` flag that forces ALL tool calls
(`run_block`, `run_agent`) in a copilot/autopilot session to use dry-run
simulation mode
- Stores the flag in a typed `ChatSessionMetadata` JSON model on the
`ChatSession` DB row, accessed via `session.dry_run` property
- Adds `dry_run` to the AutoPilot block Input schema so graph builders
can create dry-run autopilot nodes
- Refactors multiple copilot tools from `**kwargs` to explicit
parameters for type safety

## Changes
- **Prisma schema**: Added `metadata` JSON column to `ChatSession` model
with migration
- **Python models**: Added `ChatSessionMetadata` model with `dry_run`
field, added `metadata` field to `ChatSessionInfo` and `ChatSession`,
updated `from_db()`, `new()`, and `create_chat_session()`
- **Session propagation**: `set_execution_context(user_id, session)`
called from `baseline/service.py` so tool handlers can read
session-level flags via `session.dry_run`
- **Tool enforcement**: `run_block` and `run_agent` check
`session.dry_run` and force `dry_run=True` when set; `run_agent` blocks
scheduling in dry-run sessions
- **AutoPilot block**: Added `dry_run` input field, passes it when
creating sessions
- **Chat API**: Added `CreateSessionRequest` model with `dry_run` field
to `POST /sessions` endpoint; added `metadata` to session responses
- **Frontend**: Updated `useChatSession.ts` to pass body to the create
session mutation
- **Tool refactoring**: Multiple copilot tools refactored from
`**kwargs` to explicit named parameters (agent_browser, manage_folders,
workspace_files, connect_integration, agent_output, bash_exec, etc.) for
better type safety

## Test plan
- [x] Unit tests for `ChatSession.new()` with dry_run parameter
- [x] Unit tests for `RunBlockTool` session dry_run override
- [x] Unit tests for `RunAgentTool` session dry_run override
- [x] Unit tests for session dry_run blocks scheduling
- [x] Existing dry_run tests still pass (12/12)
- [x] Existing permissions tests still pass
- [x] All pre-commit hooks pass (ruff, isort, pyright, tsc)
- [ ] Manual: Create autopilot session with `dry_run=True`, verify
run_block/run_agent calls use simulation

---------

Co-authored-by: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-03-31 16:27:36 +00:00
Abhimanyu Yadav
57b17dc8e1 feat(platform): generic managed credential system with AgentMail auto-provisioning (#12537)
### Why / What / How

**Why:** We need a third credential type: **system-provided but unique
per user** (managed credentials). Currently we have system credentials
(same for all users) and user credentials (user provides their own
keys). Managed credentials bridge the gap — the platform provisions them
automatically, one per user, for integrations like AgentMail where each
user needs their own pod-scoped API key.

**What:**
- Generic **managed credential provider registry** — any integration can
register a provider that auto-provisions per-user credentials
- **AgentMail** is the first consumer: creates a pod + pod-scoped API
key using the org-level API key
- Managed credentials appear in the credential dropdown like normal API
keys but with `autogpt_managed=True` — users **cannot update or delete**
them
- **Auto-provisioning** on `GET /credentials` — lazily creates managed
credentials when users browse their credential list
- **Account deletion cleanup** utility — revokes external resources
(pods, API keys) before user deletion
- **Frontend UX** — hides the delete button for managed credentials on
the integrations page

**How:**

### Backend

**New files:**
- `backend/integrations/managed_credentials.py` —
`ManagedCredentialProvider` ABC, global registry,
`ensure_managed_credentials()` (with per-user asyncio lock +
`asyncio.gather` for concurrency), `cleanup_managed_credentials()`
- `backend/integrations/managed_providers/__init__.py` —
`register_all()` called at startup
- `backend/integrations/managed_providers/agentmail.py` —
`AgentMailManagedProvider` with `provision()` (creates pod + API key via
agentmail SDK) and `deprovision()` (deletes pod)

**Modified files:**
- `credentials_store.py` — `autogpt_managed` guards on update/delete,
`has_managed_credential()` / `add_managed_credential()` helpers
- `model.py` — `autogpt_managed: bool` + `metadata: dict` on
`_BaseCredentials`
- `router.py` — calls `ensure_managed_credentials()` in list endpoints,
removed explicit `/agentmail/connect` endpoint
- `user.py` — `cleanup_user_managed_credentials()` for account deletion
- `rest_api.py` — registers managed providers at startup
- `settings.py` — `agentmail_api_key` setting

### Frontend
- Added `autogpt_managed` to `CredentialsMetaResponse` type
- Conditionally hides delete button on integrations page for managed
credentials

### Key design decisions
- **Auto-provision in API layer, not data layer** — keeps
`get_all_creds()` side-effect-free
- **Race-safe** — per-(user, provider) asyncio lock with double-check
pattern prevents duplicate pods
- **Idempotent** — AgentMail SDK `client_id` ensures pod creation is
idempotent; `add_managed_credential()` uses upsert under Redis lock
- **Error-resilient** — provisioning failures are logged but never block
credential listing

### Changes 🏗️

| File | Action | Description |
|------|--------|-------------|
| `backend/integrations/managed_credentials.py` | NEW | ABC, registry,
ensure/cleanup |
| `backend/integrations/managed_providers/__init__.py` | NEW | Registers
all providers at startup |
| `backend/integrations/managed_providers/agentmail.py` | NEW |
AgentMail provisioning/deprovisioning |
| `backend/integrations/credentials_store.py` | MODIFY | Guards +
managed credential helpers |
| `backend/data/model.py` | MODIFY | `autogpt_managed` + `metadata`
fields |
| `backend/api/features/integrations/router.py` | MODIFY |
Auto-provision on list, removed `/agentmail/connect` |
| `backend/data/user.py` | MODIFY | Account deletion cleanup |
| `backend/api/rest_api.py` | MODIFY | Provider registration at startup
|
| `backend/util/settings.py` | MODIFY | `agentmail_api_key` setting |
| `frontend/.../integrations/page.tsx` | MODIFY | Hide delete for
managed creds |
| `frontend/.../types.ts` | MODIFY | `autogpt_managed` field |

### Checklist 📋

#### For code changes:
- [x] I have clearly listed my changes in the PR description
- [x] I have made a test plan
- [x] I have tested my changes according to the test plan:
- [x] 23 tests pass in `router_test.py` (9 new tests for
ensure/cleanup/auto-provisioning)
  - [x] `poetry run format && poetry run lint` — clean
  - [x] OpenAPI schema regenerated
- [x] Manual: verify managed credential appears in AgentMail block
dropdown
  - [x] Manual: verify delete button hidden for managed credentials
- [x] Manual: verify managed credential cannot be deleted via API (403)

#### For configuration changes:
- [x] `.env.default` is updated with `AGENTMAIL_API_KEY=`

---------

Co-authored-by: Zamil Majdy <zamil.majdy@agpt.co>
2026-03-31 12:56:18 +00:00
Krishna Chaitanya
a20188ae59 fix(blocks): validate non-empty input in AIConversationBlock before LLM call (#12545)
### Why / What / How

**Why:** When `AIConversationBlock` receives an empty messages list and
an empty prompt, the block blindly forwards the empty array to the
downstream LLM API, which returns a cryptic `400 Bad Request` error:
`"Invalid 'messages': empty array. Expected an array with minimum length
1."` This is confusing for users who don't understand why their agent
failed.

**What:** Add early input validation in `AIConversationBlock.run()` that
raises a clear `ValueError` when both `messages` and `prompt` are empty.
Also add three unit tests covering the validation logic.

**How:** A simple guard clause at the top of the `run` method checks `if
not input_data.messages and not input_data.prompt` before the LLM call
is made. If both are empty, a descriptive `ValueError` is raised. If
either one has content, the block proceeds normally.

### Changes

- `autogpt_platform/backend/backend/blocks/llm.py`: Add validation guard
in `AIConversationBlock.run()` to reject empty messages + empty prompt
before calling the LLM
- `autogpt_platform/backend/backend/blocks/test/test_llm.py`: Add
`TestAIConversationBlockValidation` with three tests:
- `test_empty_messages_and_empty_prompt_raises_error` — validates the
guard clause
- `test_empty_messages_with_prompt_succeeds` — ensures prompt-only usage
still works
- `test_nonempty_messages_with_empty_prompt_succeeds` — ensures
messages-only usage still works

### Checklist

#### For code changes:
- [x] I have clearly listed my changes in the PR description
- [x] I have made a test plan
- [x] I have tested my changes according to the test plan:
  - [x] Lint passes (`ruff check`)
  - [x] Formatting passes (`ruff format`)
- [x] New unit tests validate the empty-input guard and the happy paths

Closes #11875

---------

Co-authored-by: Zamil Majdy <zamil.majdy@agpt.co>
2026-03-31 12:43:42 +00:00
goingforstudying-ctrl
c410be890e fix: add empty choices guard in extract_openai_tool_calls() (#12540)
## Summary

`extract_openai_tool_calls()` in `llm.py` crashes with `IndexError` when
the LLM provider returns a response with an empty `choices` list.

### Changes 🏗️

- Added a guard check `if not response.choices: return None` before
accessing `response.choices[0]`
- This is consistent with the function's existing pattern of returning
`None` when no tool calls are found

### Bug Details

When an LLM provider returns a response with an empty choices list
(e.g., due to content filtering, rate limiting, or API errors),
`response.choices[0]` raises `IndexError`. This can crash the entire
agent execution pipeline.

### Checklist 📋

#### For code changes:
- [x] I have clearly listed my changes in the PR description
- [x] I have made a test plan
- [x] I have tested my changes according to the test plan:
- Verified that the function returns `None` when `response.choices` is
empty
- Verified existing behavior is unchanged when `response.choices` is
non-empty

---------

Co-authored-by: goingforstudying-ctrl <forgithubuse@gmail.com>
Co-authored-by: Claude Opus 4.6 <noreply@anthropic.com>
Co-authored-by: Zamil Majdy <zamil.majdy@agpt.co>
2026-03-31 20:10:27 +07:00
Zamil Majdy
37d9863552 feat(platform): add extended thinking execution mode to OrchestratorBlock (#12512)
## Summary
- Adds `ExecutionMode` enum with `BUILT_IN` (default built-in tool-call
loop) and `EXTENDED_THINKING` (delegates to Claude Agent SDK for richer
reasoning)
- Extracts shared `tool_call_loop` into `backend/util/tool_call_loop.py`
— reusable by both OrchestratorBlock agent mode and copilot baseline
- Refactors copilot baseline to use the shared `tool_call_loop` with
callback-driven iteration

## ExecutionMode enum
`ExecutionMode` (`backend/blocks/orchestrator.py`) controls how
OrchestratorBlock executes tool calls:
- **`BUILT_IN`** — Default mode. Runs the built-in tool-call loop
(supports all LLM providers).
- **`EXTENDED_THINKING`** — Delegates to the Claude Agent SDK for
extended thinking and multi-step planning. Requires Anthropic-compatible
providers (`anthropic` / `open_router`) and direct API credentials
(subscription mode not supported). Validates both provider and model
name at runtime.

## Shared tool_call_loop
`backend/util/tool_call_loop.py` provides a generic, provider-agnostic
conversation loop:
1. Call LLM with tools → 2. Extract tool calls → 3. Execute tools → 4.
Update conversation → 5. Repeat

Callers provide three callbacks:
- `llm_call`: wraps any LLM provider (OpenAI streaming, Anthropic,
llm.llm_call, etc.)
- `execute_tool`: wraps any tool execution (TOOL_REGISTRY, graph block
execution, etc.)
- `update_conversation`: formats messages for the specific protocol

## OrchestratorBlock EXTENDED_THINKING mode
- `_create_graph_mcp_server()` converts graph-connected blocks to MCP
tools
- `_execute_tools_sdk_mode()` runs `ClaudeSDKClient` with those MCP
tools
- Agent mode refactored to use shared `tool_call_loop`

## Copilot baseline refactored
- Streaming callbacks buffer `Stream*` events during loop execution
- Events are drained after `tool_call_loop` returns
- Same conversation logic, less code duplication

## SDK environment builder extraction
- `build_sdk_env()` extracted to `backend/copilot/sdk/env.py` for reuse
by both copilot SDK service and OrchestratorBlock

## Provider validation
EXTENDED_THINKING mode validates `provider in ('anthropic',
'open_router')` and `model_name.startswith('claude')` because the Claude
Agent SDK requires an Anthropic API key or OpenRouter key. Subscription
mode is not supported — it uses the platform's internal credit system
which doesn't provide raw API keys needed by the SDK. The validation
raises a clear `ValueError` if an unsupported provider or model is used.

## PR Dependencies
This PR builds on #12511 (Claude SDK client). It can be reviewed
independently — #12511 only adds the SDK client module which this PR
imports. If #12511 merges first, this PR will have no conflicts.

### Checklist 📋

#### For code changes:
- [x] I have clearly listed my changes in the PR description
- [x] I have made a test plan
- [x] I have tested my changes according to the test plan:
  - [x] All pre-commit hooks pass (typecheck, lint, format)
  - [x] Existing OrchestratorBlock tests still pass
- [x] Copilot baseline behavior unchanged (same stream events, same tool
execution)
- [x] Manual: OrchestratorBlock with execution_mode=EXTENDED_THINKING +
downstream blocks → SDK calls tools
  - [x] Agent mode regression test (non-SDK path works as before)
  - [x] SDK mode error handling (invalid provider raises ValueError)
2026-03-31 20:04:13 +07:00
Krishna Chaitanya
2f42ff9b47 fix(blocks): validate email recipients in Gmail blocks before API call (#12546)
### Why / What / How

**Why:** When a user or LLM supplies a malformed recipient string (e.g.
a bare username, a JSON blob, or an empty value) to `GmailSendBlock`,
`GmailCreateDraftBlock`, or any reply block, the Gmail API returns an
opaque `HttpError 400: "Invalid To header"`. This surfaces as a
`BlockUnknownError` with no actionable guidance, making it impossible
for the LLM to self-correct. (Fixes #11954)

**What:** Adds a lightweight `validate_email_recipients()` function that
checks every recipient against a simplified RFC 5322 pattern
(`local@domain.tld`) and raises a clear `ValueError` listing all invalid
entries before any API call is made.

**How:** The validation is called in two shared code paths —
`create_mime_message()` (used by send and draft blocks) and
`_build_reply_message()` (used by reply blocks) — so all Gmail blocks
that compose outgoing email benefit from it with zero per-block changes.
The regex is intentionally permissive (any `x@y.z` passes) to avoid
false positives on unusual but valid addresses.

### Changes 🏗️

- Added `validate_email_recipients()` helper in `gmail.py` with a
compiled regex
- Hooked validation into `create_mime_message()` for `to`, `cc`, and
`bcc` fields
- Hooked validation into `_build_reply_message()` for reply/draft-reply
blocks
- Added `TestValidateEmailRecipients` test class covering valid,
invalid, mixed, empty, JSON-string, and field-name scenarios

### Checklist 📋

#### For code changes:
- [x] I have clearly listed my changes in the PR description
- [x] I have made a test plan
- [x] I have tested my changes according to the test plan:
- [x] Verified `validate_email_recipients` correctly accepts valid
emails (`user@example.com`, `a@b.com`, `test@sub.domain.co`)
- [x] Verified it rejects malformed entries (bare names, missing domain
dot, empty strings, JSON strings)
- [x] Verified error messages include the field name and all invalid
entries
  - [x] Verified empty recipient lists pass without error
  - [x] Confirmed `gmail.py` and test file parse correctly (AST check)

---------

Co-authored-by: Zamil Majdy <zamil.majdy@agpt.co>
2026-03-31 12:37:33 +00:00
Zamil Majdy
914efc53e5 fix(backend): disambiguate duplicate tool names in OrchestratorBlock (#12555)
## Why
The OrchestratorBlock fails with `Tool names must be unique` when
multiple nodes use the same block type (e.g., two "Web Search" blocks
connected as tools). The Anthropic API rejects the request because
duplicate tool names are sent.

## What
- Detect duplicate tool names after building tool signatures
- Append `_1`, `_2`, etc. suffixes to disambiguate
- Enrich descriptions of duplicate tools with their hardcoded default
values so the LLM can distinguish between them
- Clean up internal `_hardcoded_defaults` metadata before sending to API
- Exclude sensitive/credential fields from default value descriptions

## How
- After `_create_tool_node_signatures` builds all tool functions, count
name occurrences
- For duplicates: rename with suffix and append `[Pre-configured:
key=value]` to description using the node's `input_default` (excluding
linked fields that the LLM provides)
- Added defensive `isinstance(defaults, dict)` check for compatibility
with test mocks
- Suffix collision avoidance: skips candidates that collide with
existing tool names
- Long tool names truncated to fit within 64-character API limit
- 47 unit tests covering: basic dedup, description enrichment, unique
names unchanged, no metadata leaks, single tool, triple duplicates,
linked field exclusion, mixed unique/duplicate scenarios, sensitive
field exclusion, long name truncation, suffix collision, malformed
tools, missing description, empty list, 10-tool all-same-name, multiple
distinct groups, large default truncation, suffix collision cascade,
parameter preservation, boundary name lengths, nested dict/list
defaults, null defaults, customized name priority, required fields

## Test plan
- [x] All 47 tests in `test_orchestrator_tool_dedup.py` pass
- [x] All 11 existing orchestrator unit tests pass (dict, dynamic
fields, responses API)
- [x] Pre-commit hooks pass (ruff, black, isort, pyright)
- [ ] Manual test: connect two same-type blocks to an orchestrator and
verify the LLM call succeeds

---------

Co-authored-by: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-03-31 11:54:10 +00:00
Carson Kahn
17e78ca382 fix(docs): remove extraneous whitespace in README (#12587)
### Why / What / How

Remove extraneous whitespace in README.md:
- "Workflow Management" description: extra spaces between "block" and
"performs"
- "Agent Interaction" description: extra spaces between "user-friendly"
and "interface"

---------

Co-authored-by: Zamil Majdy <zamil.majdy@agpt.co>
2026-03-31 08:38:45 +00:00
Ubbe
7ba05366ed feat(platform/copilot): live timer stats with persisted duration (#12583)
## Why

The copilot chat had no indication of how long the AI spent "thinking"
on a response. Users couldn't tell if a long wait was normal or
something was stuck. Additionally, the thinking duration was lost on
page reload since it was only tracked client-side.

## What

- **Live elapsed timer**: Shows elapsed time ("23s", "1m 5s") in the
ThinkingIndicator while the AI is processing (appears after 20s to avoid
spam on quick responses)
- **Frozen "Thought for Xm Ys"**: Displays the final thinking duration
in TurnStatsBar after the response completes
- **Persisted duration**: Saves `durationMs` on the last assistant
message in the DB so the timer survives page reloads

## How

**Backend:**
- Added `durationMs Int?` column to `ChatMessage` (Prisma migration)
- `mark_session_completed` in `stream_registry.py` computes wall-clock
duration from Redis session `created_at` and saves it via
`DatabaseManager.set_turn_duration()`
- Invalidates Redis session cache after writing so GET returns fresh
data

**Frontend:**
- `useElapsedTimer` hook tracks client-side elapsed seconds during
streaming
- `ThinkingIndicator` shows only the elapsed time (no phrases) after
20s, with `font-mono text-sm` styling
- `TurnStatsBar` displays "Thought for Xs" after completion, preferring
live `elapsedSeconds` and falling back to persisted `durationMs`
- `convertChatSessionToUiMessages` extracts `duration_ms` from
historical messages into a `Map<string, number>` threaded through to
`ChatMessagesContainer`

## Test plan

- [ ] Send a message in copilot — verify ThinkingIndicator shows elapsed
time after 20s
- [ ] After response completes — verify "Thought for Xs" appears below
the response
- [ ] Refresh the page — verify "Thought for Xs" still appears
(persisted from DB)
- [ ] Check older conversations — they should NOT show timer (no
historical data)
- [ ] Verify no Zod/SSE validation errors in browser console

🤖 Generated with [Claude Code](https://claude.com/claude-code)

---------

Co-authored-by: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-03-30 16:46:31 +07:00
Zamil Majdy
ca74f980c1 fix(copilot): resolve host-scoped credentials for authenticated web requests (#12579)
## Summary
- Fixed `_resolve_discriminated_credentials()` in `helpers.py` to handle
URL/host-based credential discrimination (used by
`SendAuthenticatedWebRequestBlock`)
- Previously, only provider-based discrimination (with
`discriminator_mapping`) was handled; URL-based discrimination (with
`discriminator` set but no `discriminator_mapping`) was silently skipped
- This caused host-scoped credentials to either match the wrong host or
fail to match at all when the CoPilot called `run_block` for
authenticated HTTP requests
- Added 14 targeted tests covering discriminator resolution, host
matching, credential resolution integration, and RunBlockTool end-to-end
flows

## Root Cause
`_resolve_discriminated_credentials()` checked `if
field_info.discriminator and field_info.discriminator_mapping:` which
excluded host-scoped credentials where `discriminator="url"` but
`discriminator_mapping=None`. The URL from `input_data` was never added
to `discriminator_values`, so `_credential_is_for_host()` received empty
`discriminator_values` and returned `True` for **any** host-scoped
credential regardless of URL match.

## Fix
When `discriminator` is set without `discriminator_mapping`, the URL
value from `input_data` is now copied into `discriminator_values` on a
shallow copy of the field info (to avoid mutating the cached schema).
This enables `_credential_is_for_host()` to properly match the
credential's host against the target URL.

## Test plan
- [x] `TestResolveDiscriminatedCredentials` - 4 tests verifying URL
discriminator populates values, handles missing URL, doesn't mutate
original, preserves provider/type
- [x] `TestFindMatchingHostScopedCredential` - 5 tests verifying
correct/wrong host matching, wildcard hosts, multiple credential
selection
- [x] `TestResolveBlockCredentials` - 3 integration tests verifying full
credential resolution with matching/wrong/missing hosts
- [x] `TestRunBlockToolAuthenticatedHttp` - 2 end-to-end tests verifying
SetupRequirementsResponse when creds missing and BlockDetailsResponse
when creds matched
- [x] All 28 existing + new tests pass
- [x] Ruff lint, isort, Black formatting, pyright typecheck all pass

---------

Co-authored-by: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-03-28 08:12:33 +00:00
Zamil Majdy
68f5d2ad08 fix(blocks): raise AIConditionBlock errors instead of swallowing them (#12593)
## Why

Sentry alert
[AUTOGPT-SERVER-8C8](https://significant-gravitas.sentry.io/issues/7367978095/)
— `AIConditionBlock` failing in prod with:

```
Invalid 'max_output_tokens': integer below minimum value.
Expected a value >= 16, but got 10 instead.
```

Two problems:
1. `max_tokens=10` is below OpenAI's new minimum of 16
2. The `except Exception` handler was calling `logger.error()` which
triggered Sentry for what are known block errors, AND silently
defaulting to `result=False` — making the block appear to succeed with
an incorrect answer

## What

- Bump `max_tokens` from 10 to 16 (fixes the root cause)
- Remove the `try/except` entirely — the executor already handles
exceptions correctly (`ValueError` = known/no Sentry, everything else =
unknown/Sentry). The old handler was just swallowing errors and
producing wrong results.

## Test plan

- [x] Existing `AIConditionBlock` tests pass (block only expects
"true"/"false", 16 tokens is plenty)
- [x] No more silent `result=False` on errors
- [x] No more spurious Sentry alerts from `logger.error()`

Fixes AUTOGPT-SERVER-8C8

🤖 Generated with [Claude Code](https://claude.com/claude-code)

---------

Co-authored-by: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-03-27 10:28:14 +00:00
Nicholas Tindle
2b3d730ca9 dx(skills): add /open-pr and /setup-repo skills (#12591)
### Why / What / How

**Why:** Agents working in worktrees lack guidance on two of the most
common workflows: properly opening PRs (using the repo template,
validating test coverage, triggering the review bot) and bootstrapping
the repo from scratch with a worktree-based layout. Without these
skills, agents either skip steps (no test plan, wrong template) or
require manual hand-holding for setup.

**What:** Adds two new Claude Code skills under `.claude/skills/`:
- `/open-pr` — A structured PR creation workflow that enforces the
canonical `.github/PULL_REQUEST_TEMPLATE.md`, validates test coverage
for existing and new behaviors, supports a configurable base branch, and
integrates the `/review` bot workflow for agents without local testing
capability. Cross-references `/pr-test`, `/pr-review`, and `/pr-address`
for the full PR lifecycle.
- `/setup-repo` — An interactive repo bootstrapping skill that creates a
worktree-based layout (main + reviews + N numbered work branches).
Handles .env file provisioning with graceful fallbacks (.env.default,
.env.example), copies branchlet config, installs dependencies, and is
fully idempotent (safe to re-run).

**How:** Markdown-based SKILL.md files following the existing skill
conventions. Both skills use proper bash patterns (seq-based loops
instead of brace expansion with variables, existence checks before
branch/worktree creation, error reporting on install failures).
`/open-pr` delegates to AskUserQuestion-style prompts for base branch
selection. `/setup-repo` uses AskUserQuestion for interactive branch
count and base branch selection.

### Changes 🏗️

- Added `.claude/skills/open-pr/SKILL.md` — PR creation workflow with:
  - Pre-flight checks (committed, pushed, formatted)
- Test coverage validation (existing behavior not broken, new behavior
covered)
- Canonical PR template enforcement (read and fill verbatim, no
pre-checked boxes)
  - Configurable base branch (defaults to dev)
- Review bot workflow (`/review` comment + 30min wait) for agents
without local testing
  - Related skills table linking `/pr-test`, `/pr-review`, `/pr-address`

- Added `.claude/skills/setup-repo/SKILL.md` — Repo bootstrap workflow
with:
- Interactive setup (branch count: 4/8/16/custom, base branch selection)
- Idempotent branch creation (skips existing branches with info message)
  - Idempotent worktree creation (skips existing directories)
- .env provisioning with fallback chain (.env → .env.default →
.env.example → warning)
  - Branchlet config propagation
  - Dependency installation with success/failure reporting per worktree

### Checklist 📋

#### For code changes:
- [x] I have clearly listed my changes in the PR description
- [x] I have made a test plan
- [x] I have tested my changes according to the test plan:
  - [x] Verified SKILL.md frontmatter follows existing skill conventions
  - [x] Verified trigger conditions match expected user intents
  - [x] Verified cross-references to existing skills are accurate
- [x] Verified PR template section matches
`.github/PULL_REQUEST_TEMPLATE.md`
- [x] Verified bash snippets use correct patterns (seq, show-ref, quoted
vars)
  - [x] Pre-commit hooks pass on all commits
  - [x] Addressed all CodeRabbit, Sentry, and Cursor review comments

🤖 Generated with [Claude Code](https://claude.com/claude-code)

<!-- CURSOR_SUMMARY -->
---

> [!NOTE]
> **Low Risk**
> Low risk documentation-only change: adds new markdown skills without
modifying runtime code. Main risk is workflow guidance drift (e.g.,
`.env`/worktree steps) if it diverges from actual repo conventions.
> 
> **Overview**
> Adds two new Claude Code skills under `.claude/skills/` to standardize
common developer workflows.
> 
> `/open-pr` documents a PR creation flow that enforces using
`.github/PULL_REQUEST_TEMPLATE.md` verbatim, calls out required test
coverage, and describes how to trigger/poll the `/review` bot when local
testing isn’t available.
> 
> `/setup-repo` documents an idempotent, interactive bootstrap for a
multi-worktree layout (creates `reviews` and `branch1..N`, provisions
`.env` files with `.env.default`/`.env.example` fallbacks, copies
`.branchlet.json`, and installs dependencies), complementing the
existing `/worktree` skill.
> 
> <sup>Written by [Cursor
Bugbot](https://cursor.com/dashboard?tab=bugbot) for commit
80dbeb1596. This will update automatically
on new commits. Configure
[here](https://cursor.com/dashboard?tab=bugbot).</sup>
<!-- /CURSOR_SUMMARY -->

---------

Co-authored-by: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Co-authored-by: copilot-swe-agent[bot] <198982749+Copilot@users.noreply.github.com>
2026-03-27 10:22:03 +00:00
Zamil Majdy
f28628e34b fix(backend): preserve thinking blocks during transcript compaction (#12574)
## Why

AutoPilot users hit `invalid_request_error` ("thinking or
redacted_thinking blocks in the latest assistant message cannot be
modified") when sessions get long enough to trigger transcript
compaction. The Anthropic API requires thinking blocks in the last
assistant message to be byte-for-byte identical to the original response
— our compaction was flattening them to plain text, destroying the
cryptographic signatures.

Reported in Discord `#breakage` by John Ababseh with session
`31d3f08a-cb94-45eb-9fce-56b3f0287ef4`.

## What

- **`compact_transcript`** now splits the transcript into a compressible
prefix and a preserved tail (last assistant entry + trailing entries).
Only the prefix is compressed; the tail is re-appended verbatim,
preserving thinking blocks exactly.
- **`_flatten_assistant_content`** now silently drops `thinking` and
`redacted_thinking` blocks instead of creating `[__thinking__]`
placeholders — they carry no useful context for compression summaries.
- **`response_adapter`** explicitly handles `ThinkingBlock` (skip
gracefully instead of silently falling through the isinstance chain).
- **`_format_sdk_content_blocks`** now passes through raw dict blocks
(e.g. `redacted_thinking` that the SDK may not have a typed class for)
verbatim to the transcript.

## How

The key insight is the Anthropic API's asymmetric constraint:
- **Last assistant message**: thinking/redacted_thinking blocks must be
preserved byte-for-byte
- **Older assistant messages**: thinking blocks can be removed entirely

`compact_transcript` uses `_find_last_assistant_entry()` to split the
JSONL into two parts:
1. **Prefix** (everything before the last assistant): flattened and
compressed normally
2. **Tail** (last assistant + any trailing user message): preserved
verbatim and re-chained via `_rechain_tail()` to maintain the
`parentUuid` chain

This ensures the API always sees the original thinking blocks in the
last assistant message while still achieving meaningful compression on
older turns.

## Test plan
- [x] 25 new tests across `thinking_blocks_test.py` (TDD: written before
implementation)
- [x] `_find_last_assistant_entry` splits correctly at last assistant,
handles edges (no assistant, index 0, trailing user)
  - [x] `_rechain_tail` patches parentUuid chain, handles empty tail
- [x] `_flatten_assistant_content` strips thinking/redacted_thinking
blocks, handles mixed content
  - [x] `compact_transcript` preserves last assistant's thinking blocks
- [x] `compact_transcript` strips thinking from older assistant messages
- [x] Edge cases: trailing user message, single assistant, no thinking
blocks
  - [x] `response_adapter` handles ThinkingBlock without crash
- [x] `_format_sdk_content_blocks` preserves thinking block format and
raw dict blocks
- [x] All existing copilot SDK tests pass
- [x] Pre-commit hooks (lint, format, typecheck) all pass
2026-03-27 06:36:52 +00:00
Zamil Majdy
b6a027fd2b fix(platform): fix prod Sentry errors and reduce on-call alert noise (#12565)
## Why

Multiple Sentry issues paging on-call in prod:

1. **AUTOGPT-SERVER-8BP**: `ConversionError: Failed to convert
anthropic/claude-sonnet-4-6 to <enum 'LlmModel'>` — the copilot passes
OpenRouter-style provider-prefixed model names
(`anthropic/claude-sonnet-4-6`) to blocks, but the `LlmModel` enum only
recognizes the bare model ID (`claude-sonnet-4-6`).

2. **BUILDER-7GF**: `Error invoking postEvent: Method not found` —
Sentry SDK internal error on Chrome Mobile Android, not a platform bug.

3. **XMLParserBlock**: `BlockUnknownError raised by XMLParserBlock with
message: Error in input xml syntax` — user sent bad XML but the block
raised `SyntaxError`, which gets wrapped as `BlockUnknownError`
(unexpected) instead of `BlockExecutionError` (expected).

4. **AUTOGPT-SERVER-8BS**: `Virus scanning failed for Screenshot
2026-03-26 091900.png: range() arg 3 must not be zero` — empty (0-byte)
file upload causes `range(0, 0, 0)` in the virus scanner chunking loop,
and the failure is logged at `error` level which pages on-call.

5. **AUTOGPT-SERVER-8BT**: `ValueError: <Token var=<ContextVar
name='current_context'>> was created in a different Context` —
OpenTelemetry `context.detach()` fails when the SDK streaming async
generator is garbage-collected in a different context than where it was
created (client disconnect mid-stream).

6. **AUTOGPT-SERVER-8BW**: `RuntimeError: Attempted to exit cancel scope
in a different task than it was entered in` — anyio's
`TaskGroup.__aexit__` detects cancel scope entered in one task but
exited in another when `GeneratorExit` interrupts the SDK cleanup during
client disconnect.

7. **Workspace UniqueViolationError**: `UniqueViolationError: Unique
constraint failed on (workspaceId, path)` — race condition during
concurrent file uploads handled by `WorkspaceManager._persist_db_record`
retry logic, but Sentry still captures the exception at the raise site.

8. **Library UniqueViolationError**: `UniqueViolationError` on
`LibraryAgent (userId, agentGraphId, agentGraphVersion)` — race
conditions in `add_graph_to_library` and `create_library_agent` caused
crashes or silent data loss.

9. **Graph version collision**: `UniqueViolationError` on `AgentGraph
(id, version)` — copilot re-saving an agent at an existing version
collides with the primary key.

## What

### Backend: `LlmModel._missing_()` for provider-prefixed model names
- Adds `_missing_` classmethod to `LlmModel` enum that strips the
provider prefix (e.g., `anthropic/`) when direct lookup fails
- Self-contained in the enum — no changes to the generic type conversion
system

### Frontend: Filter Sentry SDK noise
- Adds `postEvent: Method not found` to `ignoreErrors` — a known Sentry
SDK issue on certain mobile browsers

### Backend: XMLParserBlock — raise ValueError instead of SyntaxError
- Changed `_validate_tokens()` to raise `ValueError` instead of
`SyntaxError`
- Changed the `except SyntaxError` handler in `run()` to re-raise as
`ValueError`
- This ensures `Block.execute()` wraps XML parsing failures as
`BlockExecutionError` (expected/user-caused) instead of
`BlockUnknownError` (unexpected/alerts Sentry)

### Backend: Virus scanner — handle empty files + reduce alert noise
- Added early return for empty (0-byte) files in `scan_file()` to avoid
`range() arg 3 must not be zero` when `chunk_size` is 0
- Added `max(1, len(content))` guard on `chunk_size` as defense-in-depth
- Downgraded `scan_content_safe` failure log from `error` to `warning`
so single-file scan failures don't page on-call via Sentry

### Backend: Suppress SDK client cleanup errors on SSE disconnect
- Replaced `async with ClaudeSDKClient` in `_run_stream_attempt` with
manual `__aenter__`/`__aexit__` wrapped in new
`_safe_close_sdk_client()` helper
- `_safe_close_sdk_client()` catches `ValueError` (OTEL context token
mismatch) and `RuntimeError` (anyio cancel scope in wrong task) during
`__aexit__` and logs at `debug` level — these are expected when SSE
client disconnects mid-stream
- Added `_is_sdk_disconnect_error()` helper for defense-in-depth at the
outer `except BaseException` handler in `stream_chat_completion_sdk`
- Both Sentry errors (8BT and 8BW) are now suppressed without affecting
normal cleanup flow

### Backend: Filter workspace UniqueViolationError from Sentry alerts
- Added `before_send` filter in `_before_send()` to drop
`UniqueViolationError` events where the message contains `workspaceId`
and `path`
- The error is already handled by `WorkspaceManager._persist_db_record`
retry logic — it must propagate for the retry logic to work, so the fix
is at the Sentry filter level rather than catching/suppressing at source

### Backend: Library agent race condition fixes
- **`add_graph_to_library`**: Replaced check-then-create pattern with
create-then-catch-`UniqueViolationError`-then-update. On collision,
updates the existing row (restoring soft-deleted/archived agents)
instead of crashing.
- **`create_library_agent`**: Replaced `create` with `upsert` on the
`(userId, agentGraphId, agentGraphVersion)` composite unique constraint,
so concurrent adds restore soft-deleted entries instead of throwing.

### Backend: Graph version auto-increment on collision
- `__create_graph` now checks if the `(id, version)` already exists
before `create_many`, and auto-increments the version to `max_existing +
1` to avoid `UniqueViolationError` when the copilot re-saves an agent.

### Backend: Workspace `get_or_create_workspace` upsert
- Changed from find-then-create to `upsert` to atomically handle
concurrent workspace creation.

## Test plan

- [x] `LlmModel("anthropic/claude-sonnet-4-6")` resolves correctly
- [x] `LlmModel("claude-sonnet-4-6")` still works (no regression)
- [x] `LlmModel("invalid/nonexistent-model")` still raises `ValueError`
- [x] XMLParserBlock: unclosed tags, extra closing tags, empty XML all
raise `ValueError`
- [x] XMLParserBlock: `SyntaxError` from gravitasml library is caught
and re-raised as `ValueError`
- [x] Virus scanner: empty file (0 bytes) returns clean without hitting
ClamAV
- [x] Virus scanner: single-byte file scans normally (regression test)
- [x] Virus scanner: `scan_content_safe` logs at WARNING not ERROR on
failure
- [x] SDK disconnect: `_is_sdk_disconnect_error` correctly identifies
cancel scope and context var errors
- [x] SDK disconnect: `_is_sdk_disconnect_error` rejects unrelated
errors
- [x] SDK disconnect: `_safe_close_sdk_client` suppresses ValueError,
RuntimeError, and unexpected exceptions
- [x] SDK disconnect: `_safe_close_sdk_client` calls `__aexit__` on
clean exit
- [x] Library: `add_graph_to_library` creates new agent on first call
- [x] Library: `add_graph_to_library` updates existing on
UniqueViolationError
- [x] Library: `create_library_agent` uses upsert to handle concurrent
adds
- [x] All existing workspace overwrite tests still pass
- [x] All tests passing (existing + 4 XML syntax + 3 virus scanner + 10
SDK disconnect + library tests)
2026-03-27 06:09:42 +00:00
Zamil Majdy
fb74fcf4a4 feat(platform): add shared admin user search + rate-limit modal on spending page (#12577)
## Why
Admin rate-limit management required manually entering user UUIDs. The
spending page already had user search but it wasn't reusable.

## What
- Extract `AdminUserSearch` as shared component from spending page
search
- Add rate-limit modal (usage bars + reset) to spending page user rows
- Add email/name/UUID search to standalone rate-limits page
- Backend: add email query parameter to rate-limit endpoint

## How
- `AdminUserSearch` in `admin/components/` — reused by both spending and
rate-limits
- `RateLimitModal` opens from spending page "Rate Limits" button
- Backend `_resolve_user_id()` accepts email or user_id
- Smart routing: exact email → direct lookup, UUID → direct, partial →
fuzzy search

### Follow-up
- `AdminUserSearch` is a plain text input with no typeahead/fuzzy
suggestions — consider adding autocomplete dropdown with debounced
search

### Checklist 📋
- [x] Shared search component extracted and reused
- [x] Tests pass
- [x] Type-checked
2026-03-27 05:53:04 +00:00
Zamil Majdy
28b26dde94 feat(platform): spend credits to reset CoPilot daily rate limit (#12526)
## Summary
- When users hit their daily CoPilot token limit, they can now spend
credits ($2.00 default) to reset it and continue working
- Adds a dialog prompt when rate limit error occurs, offering the
credit-based reset option
- Adds a "Reset daily limit" button in the usage limits panel when the
daily limit is reached
- Backend: new `POST /api/chat/usage/reset` endpoint,
`reset_daily_usage()` Redis helper, `rate_limit_reset_cost` config
- Frontend: `RateLimitResetDialog` component, updated
`UsagePanelContent` with reset button, `useCopilotStream` exposes rate
limit state
- **NEW: Resetting the daily limit also reduces weekly usage by the
daily limit amount**, effectively granting 1 extra day's worth of weekly
capacity (e.g., daily_limit=10000 → weekly usage reduced by 10000,
clamped to 0)

## Context
Users have been confused about having credits available but being
blocked by rate limits (REQ-63, REQ-61). This provides a short-term
solution allowing users to spend credits to bypass their daily limit.

The weekly usage reduction ensures that a paid daily reset doesn't just
move the bottleneck to the weekly limit — users get genuine additional
capacity for the day they paid to unlock.

### Checklist 📋

#### For code changes:
- [x] I have clearly listed my changes in the PR description
- [x] I have made a test plan
- [x] I have tested my changes according to the test plan:
  - [x] Hit daily rate limit → dialog appears with reset option
- [x] Click "Reset for $2.00" → credits charged, daily counter reset,
dialog closes
- [x] Usage panel shows "Reset daily limit" button when at 100% daily
usage
- [x] When `rate_limit_reset_cost=0` (disabled), rate limit shows toast
instead of dialog
  - [x] Insufficient credits → error toast shown
  - [x] Verify existing rate limit tests pass
  - [x] Unit tests: weekly counter reduced by daily_limit on reset
  - [x] Unit tests: weekly counter clamped to 0 when usage < daily_limit
  - [x] Unit tests: no weekly reduction when daily_token_limit=0

#### For configuration changes:
- [x] `.env.default` is updated or already compatible with my changes
(new config fields `rate_limit_reset_cost` and `max_daily_resets` have
defaults in code)
- [x] `docker-compose.yml` is updated or already compatible with my
changes (no Docker changes needed)
2026-03-26 13:52:08 +00:00
Zamil Majdy
d677978c90 feat(platform): admin rate limit check and reset with LD-configurable global limits (#12566)
## Why
Admins need visibility into per-user CoPilot rate limit usage and the
ability to reset a user's counters when needed (e.g., after a false
positive or for debugging). Additionally, the global rate limits were
hardcoded deploy-time constants with no way to adjust without
redeploying.

## What
- Admin endpoints to **check** a user's current rate limit usage and
**reset** their daily/weekly counters to zero
- Global rate limits are now **LaunchDarkly-configurable** via
`copilot-daily-token-limit` and `copilot-weekly-token-limit` flags,
falling back to existing `ChatConfig` values
- Frontend admin page at `/admin/rate-limits` with user lookup, usage
visualization, and reset capability
- Chat routes updated to source global limits from LD flags

## How
- **Backend**: Added `reset_user_usage()` to `rate_limit.py` that
deletes Redis usage keys. New admin routes in
`rate_limit_admin_routes.py` (GET `/api/copilot/admin/rate_limit` and
POST `/api/copilot/admin/rate_limit/reset`). Added
`COPILOT_DAILY_TOKEN_LIMIT` and `COPILOT_WEEKLY_TOKEN_LIMIT` to the
`Flag` enum. Chat routes use `_get_global_rate_limits()` helper that
checks LD first.
- **Frontend**: New `/admin/rate-limits` page with `RateLimitManager`
(user lookup) and `RateLimitDisplay` (usage bars + reset button). Added
`getUserRateLimit` and `resetUserRateLimit` to `BackendAPI` client.

## Test plan
- [x] Backend: 4 tests covering get, reset, redis failure, and
admin-only access
- [ ] Manual: Look up a user's rate limits in the admin UI
- [ ] Manual: Reset a user's usage counters
- [ ] Manual: Verify LD flag overrides are respected for global limits
2026-03-26 08:29:40 +00:00
Otto
a347c274b7 fix(frontend): replace unrealistic CoPilot suggestion prompt (#12564)
Replaces "Sort my bookmarks into categories" with "Summarize my unread
emails" in the Organize suggestion category. CoPilot has no access to
browser bookmarks or local files, so the original prompt was misleading.

---
Co-authored-by: Toran Bruce Richards (@Torantulino)
<Torantulino@users.noreply.github.com>
2026-03-26 08:10:28 +00:00
Zamil Majdy
f79d8f0449 fix(backend): move placeholder_values exclusively to AgentDropdownInputBlock (#12551)
## Why

`AgentInputBlock` has a `placeholder_values` field whose
`generate_schema()` converts it into a JSON schema `enum`. The frontend
renders any field with `enum` as a dropdown/select. This means
AI-generated agents that populate `placeholder_values` with example
values (e.g. URLs) on regular `AgentInputBlock` nodes end up with
dropdowns instead of free-text inputs — users can't type custom values.

Only `AgentDropdownInputBlock` should produce dropdown behavior.

## What

- Removed `placeholder_values` field from `AgentInputBlock.Input`
- Moved the `enum` generation logic to
`AgentDropdownInputBlock.Input.generate_schema()`
- Cleaned up test data for non-dropdown input blocks
- Updated copilot agent generation guide to stop suggesting
`placeholder_values` for `AgentInputBlock`

## How

The base `AgentInputBlock.Input.generate_schema()` no longer converts
`placeholder_values` → `enum`. Only `AgentDropdownInputBlock.Input`
defines `placeholder_values` and overrides `generate_schema()` to
produce the `enum`.

**Backward compatibility**: Existing agents with `placeholder_values` on
`AgentInputBlock` nodes load fine — `model_construct()` silently ignores
extra fields not defined on the model. Those inputs will now render as
text fields (desired behavior).

## Test plan
- [x] `poetry run pytest backend/blocks/test/test_block.py -xvs` — all
block tests pass
- [x] `poetry run format && poetry run lint` — clean
- [ ] Import an agent JSON with `placeholder_values` on an
`AgentInputBlock` — verify it loads and renders as text input
- [ ] Create an agent with `AgentDropdownInputBlock` — verify dropdown
still works
2026-03-26 08:09:38 +00:00
Otto
1bc48c55d5 feat(copilot): add copy button to user prompt messages [SECRT-2172] (#12571)
Requested by @itsababseh

Users can copy assistant output messages but not their own prompts. This
adds the same copy button to user messages — appears on hover,
right-aligned, using the existing `CopyButton` component.

## Why

Users write long prompts and need to copy them to reuse or share.
Currently requires manual text selection. ChatGPT shows copy on hover
for user messages — this matches that pattern.

## What

- Added `CopyButton` to user prompt messages in
`ChatMessagesContainer.tsx`
- Shows on hover (`group-hover:opacity-100`), positioned right-aligned
below the message
- Reuses the existing `CopyButton` and `MessageActions` components —
zero new code

## How

One file changed, 11 lines added:
1. Import `MessageActions` and `CopyButton`
2. Render them after user `MessageContent`, gated on `message.role ===
"user"` and having text parts

---
Co-authored-by: itsababseh (@itsababseh)
<36419647+itsababseh@users.noreply.github.com>
2026-03-26 08:02:28 +00:00
Abhimanyu Yadav
9d0a31c0f1 fix(frontend/builder): fix array field item layout and add FormRenderer stories (#12532)
Fix broken UI when selecting nodes with array fields (list[str],
list[Enum]) in the builder. The select/input inside array items was
squeezed by the Remove button instead of taking full width.
<img width="2559" height="1077" alt="Screenshot 2026-03-26 at 10 23
34 AM"
src="https://github.com/user-attachments/assets/2ffc28a2-8d6c-428c-897c-021b1575723c"
/>

### Changes 🏗️

- **ArrayFieldItemTemplate**: Changed layout from horizontal flex-row to
vertical flex-col so the input takes full width and Remove button sits
below aligned left, with tighter spacing between them
- **Storybook config**: Added `renderers/**` glob to
`.storybook/main.ts` so renderer stories are discoverable
- **FormRenderer stories**: Added comprehensive Storybook stories
covering all backend field types (string, int, float, bool, enum,
date/time, list[str], list[int], list[Enum], list[bool], nested objects,
Optional, anyOf unions, oneOf discriminated unions, multi-select, list
of objects, and a kitchen sink). Includes exact Twitter GetUserBlock
schema for realistic oneOf + multi-select testing.

### Checklist 📋

#### For code changes:
- [x] I have clearly listed my changes in the PR description
- [x] I have made a test plan
- [x] I have tested my changes according to the test plan:
- [x] Verified array field items render with full-width input and Remove
button below in Storybook
  - [x] Verified list[Enum] select dropdown takes full width
  - [x] Verified list[str] text input takes full width
- [x] Verified all FormRenderer stories render without errors in
Storybook
- [x] Verified multi-select and oneOf discriminated union stories match
real backend schemas
2026-03-26 06:15:30 +00:00
Abhimanyu Yadav
9b086e39c6 fix(frontend): hide placeholder text when copilot voice recording is active (#12534)
### Why / What / How

**Why:** When voice recording is active in the CoPilot chat input, the
recording UI (waveform + timer) overlays on top of the placeholder/hint
text, creating a visually broken appearance. Reported by a user via
SECRT-2163.

**What:** Hide the textarea placeholder text while voice recording is
active so it doesn't bleed through the `RecordingIndicator` overlay.

**How:** When `isRecording` is true, the placeholder is set to an empty
string. The existing `RecordingIndicator` overlay (waveform animation +
elapsed time) then displays cleanly without the hint text showing
underneath.

### Changes 🏗️

- Clear the `PromptInputTextarea` placeholder to `""` when voice
recording is active, preventing it from rendering behind the
`RecordingIndicator` overlay

### Checklist 📋

#### For code changes:
- [x] I have clearly listed my changes in the PR description
- [x] I have made a test plan
- [x] I have tested my changes according to the test plan:
  - [x] Open CoPilot chat at /copilot
- [x] Click the microphone button or press Space to start voice
recording
- [x] Verify the placeholder text ("Type your message..." / "What else
can I help with?") is hidden during recording
- [x] Verify the RecordingIndicator (waveform + timer) displays cleanly
without overlapping text
  - [x] Stop recording and verify placeholder text reappears
  - [x] Verify "Transcribing..." placeholder shows during transcription
2026-03-26 05:41:09 +00:00
Zamil Majdy
5867e4d613 Merge branch 'master' of github.com:Significant-Gravitas/AutoGPT into dev 2026-03-26 07:30:56 +07:00
Zamil Majdy
85f0d8353a fix(platform): fix prod Sentry errors and reduce on-call alert noise (#12560)
## Summary
Hotfix targeting master for production Sentry errors that are triggering
on-call pages. Fixes actual bugs and expands Sentry filters to suppress
user-caused errors that are not platform issues.

### Bug Fixes
- **Workspace race condition** (`get_or_create_workspace`): Replaced
Prisma's non-atomic `upsert` with find-then-create pattern. Prisma's
upsert translates to SELECT + INSERT (not PostgreSQL's native `INSERT
... ON CONFLICT`), causing `UniqueViolationError` when concurrent
requests hit for the same user (e.g. copilot + file upload
simultaneously).
- **ChatSidebar crash**: Added null-safe `?.` for `sessions` which can
be `undefined` during error/loading states, preventing `TypeError:
Cannot read properties of undefined (reading 'length')`.
- **UsageLimits crash**: Added null-safe `?.` for
`usage.daily`/`usage.weekly` which can be `undefined` when the API
returns partial data, preventing `TypeError: Cannot read properties of
undefined (reading 'limit')`.

### Sentry Filter Improvements
Expanded backend `_before_send` to stop user-caused errors from reaching
Sentry and triggering on-call alerts:
- **Consolidated auth keywords** into a shared `_USER_AUTH_KEYWORDS`
list used by both exception-based and log-based filters (previously
duplicated).
- **Added missing auth keywords**: `"unauthorized"`, `"bad
credentials"`, `"insufficient authentication scopes"` — these were
leaking through.
- **Added user integration HTTP error filter**: `"http 401 error"`,
`"http 403 error"`, `"http 404 error"` — catches `BlockUnknownError` and
`HTTPClientError` from user integrations (expired GitHub tokens, wrong
Airtable IDs, etc.).
- **Fixed log-based event gap**: User auth errors logged via
`logger.error()` (not raised as exceptions) were bypassing the
`exc_info` filter. Now the same `_USER_AUTH_KEYWORDS` list is checked
against log messages too.

## On-Call Alerts Addressed

### Fixed (actual bugs)
| Alert | Issue | Root Cause |
|-------|-------|------------|
| `Unique constraint failed on the fields: (userId)` |
[AUTOGPT-SERVER-8BM](https://significant-gravitas.sentry.io/issues/AUTOGPT-SERVER-8BM)
| Prisma upsert race condition |
| `Unique constraint failed on the fields: (userId)` |
[AUTOGPT-SERVER-8BK](https://significant-gravitas.sentry.io/issues/AUTOGPT-SERVER-8BK)
| Same — via `/api/workspace/files/upload` |
| `Unique constraint failed on the fields: (userId)` |
[AUTOGPT-SERVER-8BN](https://significant-gravitas.sentry.io/issues/AUTOGPT-SERVER-8BN)
| Same — via `tools/call run_block` |
| `Upload failed (500): Unique constraint failed` |
[BUILDER-7GA](https://significant-gravitas.sentry.io/issues/BUILDER-7GA)
| Frontend surface of same workspace bug |
| `Cannot read properties of undefined (reading 'length')` |
[BUILDER-7GD](https://significant-gravitas.sentry.io/issues/BUILDER-7GD)
| `sessions` undefined in ChatSidebar |
| `Cannot read properties of undefined (reading 'limit')` |
[BUILDER-7GB](https://significant-gravitas.sentry.io/issues/BUILDER-7GB)
| `usage.daily` undefined in UsageLimits |

### Filtered (user-caused, not platform bugs)
| Alert | Issue | Why it's not a platform bug |
|-------|-------|-----------------------------|
| `Anthropic API error: invalid x-api-key` |
[AUTOGPT-SERVER-8B6](https://significant-gravitas.sentry.io/issues/AUTOGPT-SERVER-8B6),
8B7, 8B8 | User provided invalid Anthropic API key |
| `AI condition evaluation failed: Incorrect API key` |
[AUTOGPT-SERVER-83Y](https://significant-gravitas.sentry.io/issues/AUTOGPT-SERVER-83Y)
| User's OpenAI key is wrong (4.5K events, 1 user) |
| `GithubListIssuesBlock: HTTP 401 Bad credentials` |
[AUTOGPT-SERVER-8BF](https://significant-gravitas.sentry.io/issues/AUTOGPT-SERVER-8BF)
| User's GitHub token expired |
| `HTTPClientError: HTTP 401 Unauthorized` |
[AUTOGPT-SERVER-8BG](https://significant-gravitas.sentry.io/issues/AUTOGPT-SERVER-8BG)
| Same — credential check endpoint |
| `GithubReadIssueBlock: HTTP 401 Bad credentials` |
[AUTOGPT-SERVER-8BH](https://significant-gravitas.sentry.io/issues/AUTOGPT-SERVER-8BH)
| Same — different block |
| `AirtableCreateBaseBlock: HTTP 404 MODEL_ID_NOT_FOUND` |
[AUTOGPT-SERVER-8BC](https://significant-gravitas.sentry.io/issues/AUTOGPT-SERVER-8BC)
| User's Airtable model ID is wrong |

### Not addressed in this PR
| Alert | Issue | Reason |
|-------|-------|--------|
| `Unexpected token '<', "<html><hea"...` |
[BUILDER-7GC](https://significant-gravitas.sentry.io/issues/BUILDER-7GC)
| Transient — backend briefly returned HTML error page |
| `undefined is not an object (activeResponse.state)` |
[BUILDER-71J](https://significant-gravitas.sentry.io/issues/BUILDER-71J)
| Bug in Vercel AI SDK `ai@6.0.59`, already resolved |
| `Last Tool Output is needed` |
[AUTOGPT-SERVER-72T](https://significant-gravitas.sentry.io/issues/AUTOGPT-SERVER-72T)
| User graph misconfiguration (1 user, 21 events) |
| `Cannot set property ethereum` |
[BUILDER-7G6](https://significant-gravitas.sentry.io/issues/BUILDER-7G6)
| Browser wallet extension conflict |
| `File already exists at path` |
[BUILDER-7FS](https://significant-gravitas.sentry.io/issues/BUILDER-7FS)
| Expected 409 conflict |

## Test plan
- [ ] Verify workspace creation works for new users
- [ ] Verify concurrent workspace access (e.g. copilot + file upload)
doesn't error
- [ ] Verify copilot ChatSidebar and UsageLimits load correctly when API
returns partial/error data
- [ ] Verify user auth errors (invalid API keys, expired tokens) no
longer appear in Sentry after deployment
2026-03-25 23:25:32 +07:00
An Vy Le
f871717f68 fix(backend): add sink input validation to AgentValidator (#12514)
## Summary

- Added `validate_sink_input_existence` method to `AgentValidator` to
ensure all sink names in links and input defaults reference valid input
schema fields in the corresponding block
- Added comprehensive tests covering valid/invalid sink names, nested
inputs, and default key handling
- Updated `ReadDiscordMessagesBlock` description to clarify it reads new
messages and triggers on new posts
- Removed leftover test function file

## Test plan

- [ ] Run `pytest` on `validator_test.py` to verify all sink input
validation cases pass
- [ ] Verify existing agent validation flow is unaffected
- [ ] Confirm `ReadDiscordMessagesBlock` description update is accurate

🤖 Generated with [Claude Code](https://claude.com/claude-code)

---------

Co-authored-by: Claude Sonnet 4.6 <noreply@anthropic.com>
Co-authored-by: Zamil Majdy <zamil.majdy@agpt.co>
2026-03-25 16:08:17 +00:00
Ubbe
f08e52dc86 fix(frontend): marketplace card description 3 lines + fallback color (#12557)
## Summary
- Increase the marketplace StoreCard description from 2 lines to 3 lines
for better readability
- Change fallback background colour for missing agent images from
`bg-violet-50` to `rgb(216, 208, 255)`

<img width="933" height="458" alt="Screenshot 2026-03-25 at 20 25 41"
src="https://github.com/user-attachments/assets/ea433741-1397-4585-b64c-c7c3b8109584"
/>
<img width="350" height="457" alt="Screenshot 2026-03-25 at 20 25 55"
src="https://github.com/user-attachments/assets/e2029c09-518a-4404-aa95-e202b4064d0b"
/>


## Test plan
- [x] Verified `pnpm format`, `pnpm lint`, `pnpm types` all pass
- [x] Visually confirmed description shows 3 lines on marketplace cards
- [x] Visually confirmed fallback color renders correctly for cards
without images

🤖 Generated with [Claude Code](https://claude.com/claude-code)

---------

Co-authored-by: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-03-25 20:58:45 +08:00
Ubbe
500b345b3b fix(frontend): auto-reconnect copilot chat after device sleep/wake (#12519)
## Summary

- Adds `visibilitychange`-based sleep/wake detection to the copilot chat
— when the page becomes visible after >30s hidden, automatically refetch
the session and either resume an active stream or hydrate completed
messages
- Blocks chat input during re-sync (`isSyncing` state) to prevent users
from accidentally sending a message that overwrites the agent's
completed work
- Replaces `PulseLoader` with a spinning `CircleNotch` icon on sidebar
session names for background streaming sessions (closer to ChatGPT's UX)

## How it works

1. When the page goes hidden, we record a timestamp
2. When the page becomes visible, we check elapsed time
3. If >30s elapsed (indicating sleep or long background), we refetch the
session from the API
4. If backend still has `active_stream=true` → remove stale assistant
message and resume SSE
5. If backend is done → the refetch triggers React Query invalidation
which hydrates the completed messages
6. Chat input stays disabled (`isSyncing=true`) until re-sync completes

## Test plan

- [ ] Open copilot, start a long-running agent task
- [ ] Close laptop lid / lock screen for >30 seconds
- [ ] Wake device — verify chat shows the agent's completed response (or
resumes streaming)
- [ ] Verify chat input is temporarily disabled during re-sync, then
re-enables
- [ ] Verify sidebar shows spinning icon (not pulse loader) for
background sessions
- [ ] Verify no duplicate messages appear after wake
- [ ] Verify normal streaming (no sleep) still works as expected

Resolves: [SECRT-2159](https://linear.app/autogpt/issue/SECRT-2159)

🤖 Generated with [Claude Code](https://claude.com/claude-code)

---------

Co-authored-by: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-03-25 20:15:33 +08:00
Ubbe
995dd1b5f3 feat(platform): replace suggestion pills with themed prompt categories (#12515)
## Summary

<img width="700" height="575" alt="Screenshot 2026-03-23 at 21 40 07"
src="https://github.com/user-attachments/assets/f6138c63-dd5e-4bde-a2e4-7434d0d3ec72"
/>

Re-applies #12452 which was reverted as collateral in #12485 (invite
system revert).

Replaces the flat list of suggestion pills in the CoPilot empty session
with themed prompt categories (Learn, Create, Automate, Organize), each
shown as a popover with contextual prompts.

- **Backend**: Adds `suggested_prompts` as a themed `dict[str,
list[str]]` keyed by category. Updates Tally extraction LLM prompt to
generate prompts per theme, and the `/suggested-prompts` API to return
grouped themes. Legacy `list[str]` rows are preserved under a
`"General"` key for backward compatibility.
- **Frontend**: Replaces inline pill buttons with a `SuggestionThemes`
popover component. Each theme button (with icon) opens a dropdown of 5
relevant prompts. Falls back to hardcoded defaults when the API has no
personalized prompts. Normalizes partial API responses by padding
missing themes with defaults. Legacy `"General"` prompts are distributed
round-robin across themes.

### Changes 🏗️

- `backend/data/understanding.py`: `suggested_prompts` field added as
`dict[str, list[str]]`; legacy list rows preserved under `"General"` key
via `_json_to_themed_prompts`
- `backend/data/tally.py`: LLM prompt updated to generate themed
prompts; validation now per-theme with blank-string rejection
- `backend/api/features/chat/routes.py`: New `SuggestedTheme` model;
endpoint returns `themes[]`
- `frontend/copilot/components/EmptySession/EmptySession.tsx`: Uses
generated API hooks for suggested prompts
- `frontend/copilot/components/EmptySession/helpers.ts`:
`DEFAULT_THEMES` replaces `DEFAULT_QUICK_ACTIONS`; `getSuggestionThemes`
normalizes partial API responses
-
`frontend/copilot/components/EmptySession/components/SuggestionThemes/`:
New popover component with theme icons and loading states

### Checklist 📋

- [x] I have clearly listed my changes in the PR description
- [x] I have made a test plan
- [x] I have tested my changes according to the test plan:
  - [x] Verify themed suggestion buttons render on CoPilot empty session
  - [x] Click each theme button and confirm popover opens with prompts
  - [x] Click a prompt and confirm it sends the message
- [x] Verify fallback to default themes when API returns no custom
prompts
- [x] Verify legacy users' personalized prompts are preserved and
visible

🤖 Generated with [Claude Code](https://claude.com/claude-code)

---------

Co-authored-by: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-03-25 15:32:49 +08:00
Zamil Majdy
336114f217 fix(backend): prevent graph execution stuck + steer SDK away from bash_exec (#12548)
## Summary

Two backend fixes for CoPilot stability:

1. **Steer model away from bash_exec for SDK tool-result files** — When
the SDK returns tool results as file paths, the copilot model was
attempting to use `bash_exec` to read them instead of treating the
content directly. Added system prompt guidance to prevent this.

2. **Guard against missing 'name' in execution input_data** —
`GraphExecution.from_db()` assumed all INPUT/OUTPUT block node
executions have a `name` field in `input_data`. This crashes with
`KeyError: 'name'` when non-standard blocks (e.g., OrchestratorBlock)
produce node executions without this field. Added `"name" in
exec.input_data` guards.

## Why

- The bash_exec issue causes copilot to fail when processing SDK tool
outputs
- The KeyError crashes the `update_graph_execution_stats` endpoint,
causing graph executions to appear stuck (retries 35+ times, never
completes)

## How

- Added system prompt instruction to treat tool result file contents
directly
- Added `"name" in exec.input_data` guard in both input extraction (line
340) and output extraction (line 365) in `execution.py`

### Changes
- `backend/copilot/sdk/service.py` — system prompt guidance
- `backend/data/execution.py` — KeyError guard for missing `name` field

### Checklist 📋
- [x] I have clearly listed my changes in the PR description
- [x] I have made a test plan
- [x] I have tested my changes according to the test plan

#### Test plan:
- [x] OrchestratorBlock graph execution no longer gets stuck
- [x] Standard Agent Input/Output blocks still work correctly
- [x] Copilot SDK tool results are processed without bash_exec
2026-03-25 13:58:24 +07:00
Nicholas Tindle
866563ad25 feat(platform): admin preview marketplace submissions before approving (#12536)
## Why

Admins reviewing marketplace submissions currently approve blindly —
they can see raw metadata in the admin table but cannot see what the
listing actually looks like (images, video, branding, layout). This
risks approving inappropriate content. With full-scale production
approaching, this is critical.

Additionally, when a creator un-publishes an agent, users who already
added it to their library lose access — breaking their workflows.
Product decided on a "you added it, you keep it" model.

## What

- **Admin preview page** at `/admin/marketplace/preview/[id]` — renders
the listing exactly as it would appear on the public marketplace
- **Add to Library** for admins to test-run pending agents before
approving
- **Library membership grants graph access** — if you added an agent to
your library, you keep access even if it's un-published or rejected
- **Preview button** on every submission row in the admin marketplace
table
- **Cross-reference comments** on original functions to prevent
SECRT-2162-style regressions

## How

### Backend

**Admin preview (`store/db.py`):**
- `get_store_agent_details_as_admin()` queries `StoreListingVersion`
directly, bypassing the APPROVED-only `StoreAgent` DB view
- Validates `CreatorProfile` FK integrity, reads all fields including
`recommendedScheduleCron`

**Admin add-to-library (`library/_add_to_library.py`):**
- Extracted shared logic into `resolve_graph_for_library()` +
`add_graph_to_library()` — eliminates duplication between public and
admin paths
- Admin path uses `get_graph_as_admin()` to bypass marketplace status
checks
- Handles concurrent double-click race via `UniqueViolationError` catch

**Library membership grants graph access (`data/graph.py`):**
- `get_graph()` now falls back to `LibraryAgent` lookup if ownership and
marketplace checks fail
- Only for authenticated users with non-deleted, non-archived library
records
- `validate_graph_execution_permissions()` updated to match — library
membership grants execution access too

**New endpoints (`store_admin_routes.py`):**
- `GET /admin/submissions/{id}/preview` — returns `StoreAgentDetails`
- `POST /admin/submissions/{id}/add-to-library` — creates `LibraryAgent`
via admin path

### Frontend

- Preview page reuses `AgentInfo` + `AgentImages` with admin banner
- Shows instructions, recommended schedule, and slug
- "Add to My Library" button wired to admin endpoint
- Preview button added to `ExpandableRow` (header + version history)
- Categories column uncommented in version history table

### Testing (19 tests)

**Graph access control (9 in `graph_test.py`):** Owner access,
marketplace access, library member access (unpublished),
deleted/archived/anonymous denied, null FK denied, efficiency checks

**Admin bypass (5 in `store_admin_routes_test.py`):** Preview uses
StoreListingVersion not StoreAgent, admin path uses get_graph_as_admin,
regular path uses get_graph, library member can view in builder

**Security (3):** Non-admin 403 on preview, non-admin 403 on
add-to-library, nonexistent 404

**SECRT-2162 regression (2):** Admin access to pending agent, export
with sub-graphs

### Checklist
- [x] Changes clearly listed
- [x] Test plan made
- [x] 19 backend tests pass
- [x] Frontend lints and types clean

## Test plan
- [x] Navigate to `/admin/marketplace`, click Preview on a PENDING
submission
- [x] Verify images, video, description, categories, instructions,
schedule render correctly
- [x] Click "Add to My Library", verify agent appears in library and
opens in builder
- [x] Verify non-admin users get 403
- [x] Verify un-publishing doesn't break access for users who already
added it

🤖 Generated with [Claude Code](https://claude.com/claude-code)

<!-- CURSOR_SUMMARY -->
---

> [!NOTE]
> **High Risk**
> Adds new admin-only endpoints that bypass marketplace
approval/ownership checks and changes `get_graph`/execution
authorization to grant access via library membership, which impacts
security-sensitive access control paths.
> 
> **Overview**
> Adds **admin preview + review workflow support** for marketplace
submissions: new admin routes to `GET /admin/submissions/{id}/preview`
(querying `StoreListingVersion` directly) and `POST
/admin/submissions/{id}/add-to-library` (admin bypass to pull pending
graphs into an admin’s library).
> 
> Refactors library add-from-store logic into shared helpers
(`resolve_graph_for_library`, `add_graph_to_library`) and introduces an
admin variant `add_store_agent_to_library_as_admin`, including restore
of archived/deleted entries and dedup/race handling.
> 
> Changes core graph access rules: `get_graph()` now falls back to
**library membership** (non-deleted/non-archived, version-specific) when
ownership and marketplace approval don’t apply, and
`validate_graph_execution_permissions()` is updated accordingly.
Frontend adds a preview link and a dedicated admin preview page with
“Add to My Library”; tests expand significantly to lock in the new
bypass and access-control behavior.
> 
> <sup>Written by [Cursor
Bugbot](https://cursor.com/dashboard?tab=bugbot) for commit
a362415d12. This will update automatically
on new commits. Configure
[here](https://cursor.com/dashboard?tab=bugbot).</sup>
<!-- /CURSOR_SUMMARY -->

---------

Co-authored-by: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-03-25 04:26:36 +00:00
Zamil Majdy
e79928a815 fix(backend): prevent logging sensitive data in SafeJson fallback (#12547)
### Why / What / How

**Why:** GitHub's code scanning detected a HIGH severity security
vulnerability in `/autogpt_platform/backend/backend/util/json.py:172`.
The error handler in `sanitize_json()` was logging sensitive data
(potentially including secrets, API keys, credentials) as clear text
when serialization fails.

**What:** This PR removes the logging of actual data content from the
error handler while preserving useful debugging metadata (error type,
error message, and data type).

**How:** Removed the `"Data preview: %s"` format parameter and the
corresponding `truncate(str(data), 100)` argument from the
logger.error() call. The error handler now logs only safe metadata that
helps debugging without exposing sensitive information.

### Changes 🏗️

- **Security Fix**: Modified `sanitize_json()` function in
`backend/util/json.py`
- Removed logging of data content (`truncate(str(data), 100)`) from the
error handler
  - Retained logging of error type (`type(e).__name__`)
- Retained logging of truncated error message (`truncate(str(e), 200)`)
  - Retained logging of data type (`type(data).__name__`)
- Error handler still provides useful debugging information without
exposing secrets

### Checklist 📋

#### For code changes:
- [x] I have clearly listed my changes in the PR description
- [x] I have made a test plan
- [x] I have tested my changes according to the test plan:
- [x] Verified the code passes type checking (`poetry run pyright
backend/util/json.py`)
- [x] Verified the code passes linting (`poetry run ruff check
backend/util/json.py`)
  - [x] Verified all pre-commit hooks pass
- [x] Reviewed the diff to ensure only the sensitive data logging was
removed
- [x] Confirmed that useful debugging information (error type, error
message, data type) is still logged

#### For configuration changes:
- N/A - No configuration changes required
2026-03-25 04:21:21 +00:00
Zamil Majdy
1771ed3bef dx(skills): codify PR workflow rules in skill docs and CLAUDE.md (#12531)
## Summary

- **pr-address skill**: Add explicit rule against empty commits for CI
re-triggers, and strengthen push-immediately guidance with rationale
- **Platform CLAUDE.md**: Add "split PRs by concern" guideline under
Creating Pull Requests

### Changes
- Updated `.claude/skills/pr-address/SKILL.md`
- Updated `autogpt_platform/CLAUDE.md`

### Checklist 📋

#### For code changes:
- [x] I have clearly listed my changes in the PR description
- [x] I have made a test plan
- [x] I have tested my changes according to the test plan

#### Test plan:
- [x] Documentation-only changes — no functional tests needed
- [x] Verified markdown renders correctly
2026-03-25 10:19:30 +07:00
Zamil Majdy
550fa5a319 fix(backend): register AutoPilot sessions with stream registry for SSE updates (#12500)
### Changes 🏗️
- When the AutoPilot block executes a copilot session via
`collect_copilot_response`, it calls `stream_chat_completion_sdk`
directly, bypassing the copilot executor and stream registry. This means
the frontend sees no `active_stream` on the session and cannot connect
via SSE — users see a frozen chat with no updates until the turn fully
completes.
- Fix: register a `stream_registry` session in
`collect_copilot_response` and publish each chunk to Redis as events are
consumed. This allows the frontend to detect `active_stream=true` and
connect via the SSE reconnect endpoint for live streaming updates during
AutoPilot execution.
- Error handling is graceful — if stream registry fails, AutoPilot still
works normally, just without real-time frontend updates.

### Checklist 📋

#### For code changes:
- [x] I have clearly listed my changes in the PR description
- [x] I have made a test plan
- [x] I have tested my changes according to the test plan:
- [x] Trigger an AutoPilot block execution that creates a new chat
session
- [x] Verify the new session appears in the sidebar with streaming
indicator
- [x] Click on the session while AutoPilot is still executing — verify
SSE connects and messages stream in real-time
- [x] Verify that after AutoPilot completes, the session shows as
complete (no active_stream)
- [x] Test reconnection: disconnect and reconnect while AutoPilot is
running — verify stream resumes (found and fixed GeneratorExit bug that
caused stuck sessions)
- [x] E2E: 10 stream events published to Redis (StreamStart,
3×ToolInput, 3×ToolOutput, TextStart, TextEnd, StreamFinish)
  - [x] E2E: Redis xadd latency 0.2–3.4ms per chunk
  - [x] E2E: Chat sessions registered in Redis (confirmed via redis-cli)
2026-03-25 01:08:49 +00:00
Zamil Majdy
8528dffbf2 fix(backend): allow /tmp as valid path in E2B sandbox file tools (#12501)
## Summary
- Allow `/tmp` as a valid writable directory in E2B sandbox file tools
(`write_file`, `read_file`, `edit_file`, `glob`, `grep`)
- The E2B sandbox is already fully isolated, so restricting writes to
only `/home/user` was unnecessarily limiting — scripts and tools
commonly use `/tmp` for temporary files
- Extract `is_within_allowed_dirs()` helper in `context.py` to
centralize the allowed-directory check for both path resolution and
symlink escape detection

## Changes
- `context.py`: Add `E2B_ALLOWED_DIRS` tuple and `E2B_ALLOWED_DIRS_STR`,
introduce `is_within_allowed_dirs()`, update `resolve_sandbox_path()` to
use it
- `e2b_file_tools.py`: Update `_check_sandbox_symlink_escape()` to use
`is_within_allowed_dirs()`, update tool descriptions
- Tests: Add coverage for `/tmp` paths in both `context_test.py` and
`e2b_file_tools_test.py`

### Checklist 📋

#### For code changes:
- [x] I have clearly listed my changes in the PR description
- [x] I have made a test plan
- [x] I have tested my changes according to the test plan:
- [x] All 59 existing + new tests pass (`poetry run pytest
backend/copilot/context_test.py
backend/copilot/sdk/e2b_file_tools_test.py`)
  - [x] `poetry run format` and `poetry run lint` pass clean
  - [x] Verify `/tmp` write works in live E2B sandbox
  - [x] E2E: Write file to /tmp/test.py in E2B sandbox via copilot
  - [x] E2E: Execute script from /tmp — output "Hello, World!"
  - [x] E2E: E2B sandbox lifecycle (create, use, pause) works correctly
2026-03-25 00:52:58 +00:00
Zamil Majdy
8fbf6a4b09 Merge branch 'master' of github.com:Significant-Gravitas/AutoGPT into dev 2026-03-25 06:55:47 +07:00
Zamil Majdy
239148596c fix(backend): filter SDK default credentials from credentials API responses (#12544)
## Summary

- Filter SDK-provisioned default credentials from credentials API list
endpoints
- Reuse `CredentialsMetaResponse` model from internal router in external
API (removes duplicate `CredentialSummary`)
- Add `is_sdk_default()` helper for identifying platform-provisioned
credentials
- Add `provider_matches()` to credential store for consistent provider
filtering
- Add tests for credential filtering behavior

### Changes
- `backend/data/model.py` — add `is_sdk_default()` helper
- `backend/api/features/integrations/router.py` — filter SDK defaults
from list endpoints
- `backend/api/external/v1/integrations.py` — reuse
`CredentialsMetaResponse`, filter SDK defaults
- `backend/integrations/credentials_store.py` — add `provider_matches()`
- `backend/sdk/registry.py` — update credential registration
- `backend/api/features/integrations/router_test.py` — new tests
- `backend/api/features/integrations/conftest.py` — test fixtures

### Checklist 📋

#### For code changes:
- [x] I have clearly listed my changes in the PR description
- [x] I have made a test plan
- [x] I have tested my changes according to the test plan

#### Test plan:
- [x] Unit tests for credential filtering (`router_test.py`)
- [x] Verify SDK default credentials excluded from API responses
- [x] Verify user-created credentials still returned normally
2026-03-25 06:54:54 +07:00
Zamil Majdy
a880d73481 feat(platform): dry-run execution mode with LLM block simulation (#12483)
## Why

Agent generation and building needs a way to test-run agents without
requiring real credentials or producing side effects. Currently, every
execution hits real APIs, consumes credits, and requires valid
credentials — making it impossible to debug or validate agent graphs
during the build phase without real consequences.

## Summary

Adds a `dry_run` execution mode to the copilot's `run_block` and
`run_agent` tools. When `dry_run=True`, every block execution is
simulated by an LLM instead of calling the real service — no real API
calls, no credentials consumed, no side effects.

Inspired by
[Significant-Gravitas/agent-simulator](https://github.com/Significant-Gravitas/agent-simulator).

### How it works

- **`backend/executor/simulator.py`** (new): `simulate_block()` builds a
prompt from the block's name, description, input/output schemas, and
actual input values, then calls `gpt-4o-mini` via the existing
OpenRouter client with JSON mode. Retries up to 5 times on JSON parse
failures. Missing output pins are filled with `None` (or `""` for the
`error` pin). Long inputs (>20k chars) are truncated before sending to
the LLM.
- **`ExecutionContext`**: Added `dry_run: bool = False` field; threaded
through `add_graph_execution()` so graph-level dry runs propagate to
every block execution.
- **`execute_block()` helper**: When `dry_run=True`, the function
short-circuits before any credential injection or credit checks, calls
`simulate_block()`, and returns a `[DRY RUN]`-prefixed
`BlockOutputResponse`.
- **`RunBlockTool`**: New `dry_run` boolean parameter.
- **`RunAgentTool`**: New `dry_run` boolean parameter; passes
`ExecutionContext(dry_run=True)` to graph execution.

### Tests

11 tests in `backend/copilot/tools/test_dry_run.py`:
- Correct output tuples from LLM response
- JSON retry logic (3 total calls when first 2 fail)
- All-retries-exhausted yields `SIMULATOR ERROR`
- Missing output pins filled with `None`/`""`
- No-client case
- Input truncation at 20k chars
- `execute_block(dry_run=True)` skips real `block.execute()`
- Response format: `[DRY RUN]` message, `success=True`
- `dry_run=False` unchanged (real path)
- `RunBlockTool` parameter presence
- `dry_run` kwarg forwarding

## Test plan
- [x] Run `pytest backend/copilot/tools/test_dry_run.py -v` — all 11
pass
- [x] Call `run_block` with `dry_run=true` in copilot; verify no real
API calls occur and output contains `[DRY RUN]`
- [x] Call `run_agent` with `dry_run=true`; verify execution is created
with `dry_run=True` in context
- [x] E2E: Simulate button (flask icon) present in builder alongside
play button
- [x] E2E: Simulated run labeled with "(Simulated)" suffix and badge in
Library
- [x] E2E: No credits consumed during dry-run
2026-03-24 22:36:47 +00:00
Zamil Majdy
80bfd64ffa Merge branch 'master' of github.com:Significant-Gravitas/AutoGPT into dev 2026-03-24 21:18:11 +07:00
Zamil Majdy
0076ad2a1a hotfix(blocks): bump stagehand ^0.5.1 → ^3.4.0 to fix yanked litellm (#12539)
## Summary

**Critical CI fix** — litellm was compromised in a supply chain attack
(versions 1.82.7/1.82.8 contained infostealer malware) and PyPI
subsequently yanked many litellm versions including the 1.7x range that
stagehand 0.5.x depended on. This breaks `poetry lock` in CI for all
PRs.

- Bump `stagehand` from `^0.5.1` to `^3.4.0` — Stagehand v3 is a
Stainless-generated HTTP API client that **no longer depends on
litellm**, completely removing litellm from our dependency tree
- Migrate stagehand blocks to use `AsyncStagehand` + session-based API
(`sessions.start`, `session.navigate/act/observe/extract`)
- Net reduction of ~430 lines in `poetry.lock` from dropping litellm and
its transitive dependencies

## Why

All CI pipelines are blocked because `poetry lock` fails to resolve
yanked litellm versions that stagehand 0.5.x required.

## Test plan

- [x] CI passes (poetry lock resolves, backend tests green)
- [ ] Verify stagehand blocks still function with the new session-based
API
2026-03-24 21:17:19 +07:00
Zamil Majdy
edb3d322f0 feat(backend/copilot): parallel block execution via infrastructure-level pre-launch (#12472)
## Summary

- Implements **infrastructure-level parallel tool execution** for
CoPilot: all tools called in a single LLM turn now execute concurrently
with zero changes to individual tool implementations or LLM prompts.
- Adds `pre_launch_tool_call()` to `tool_adapter.py`: when an
`AssistantMessage` with `ToolUseBlock`s arrives, all tools are
immediately fired as `asyncio.Task`s before the SDK dispatches MCP
handlers. Each MCP handler then awaits its pre-launched task instead of
executing fresh.
- Adds a `_tool_task_queues` `ContextVar` (initialized per-session in
`set_execution_context()`) so concurrent sessions never share task
queues.
- DRY refactor: extracts `prepare_block_for_execution()`,
`check_hitl_review()`, and `BlockPreparation` dataclass into
`helpers.py` so the execution pipeline is reusable.
- 10 unit tests for the parallel pre-launch infrastructure (queue
enqueue/dequeue, MCP prefix stripping, fallback path, `CancelledError`
handling, multi-same-tool FIFO ordering).

## Root cause

The Claude Agent SDK CLI sends MCP tool calls as sequential
request-response pairs: it waits for each `control_response` before
issuing the next `mcp_message`. Even though Python dispatches handlers
with `start_soon`, the CLI never issues call B until call A's response
is sent — blocks always ran sequentially. The pre-launch pattern fixes
this at the infrastructure level by starting all tasks before the SDK
even dispatches the first handler.

## Test plan

- [x] `poetry run pytest backend/copilot/sdk/tool_adapter_test.py` — 27
tests pass (10 new parallel infra tests)
- [x] `poetry run pytest backend/copilot/tools/helpers_test.py` — 20
tests pass
- [x] `poetry run pytest backend/copilot/tools/run_block_test.py
backend/copilot/tools/test_run_block_details.py` — all pass
- [x] Manually test in CoPilot: ask the agent to run two blocks
simultaneously — verify both start executing before either completes
- [x] E2E: Both GetCurrentTimeBlock and CalculatorBlock executed
concurrently (time=09:35:42, 42×7=294)
- [x] E2E: Pre-launch mechanism active — two run_block events at same
timestamp (3ms apart)
- [x] E2E: Arg-mismatch fallback tested — system correctly cancels and
falls back to direct execution
2026-03-24 20:27:46 +07:00
Zamil Majdy
9381057079 refactor(platform): rename SmartDecisionMakerBlock to OrchestratorBlock (#12511)
## Summary
- Renames `SmartDecisionMakerBlock` to `OrchestratorBlock` across the
entire codebase
- The block supports iteration/agent mode and general tool
orchestration, so "Smart Decision Maker" no longer accurately describes
its capabilities
- Block UUID (`3b191d9f-356f-482d-8238-ba04b6d18381`) remains unchanged
— fully backward compatible with existing graphs

## Changes
- Renamed block class, constants, file names, test files, docs, and
frontend enum
- Updated copilot agent generator (helpers, validator, fixer) references
- Updated agent generation guide documentation
- No functional changes — pure rename refactor

### For code changes
- [x] I have clearly listed my changes in the PR description
- [x] I have made corresponding changes to the documentation
- [x] My changes do not generate new warnings or errors
- [x] New and existing unit tests pass locally with my changes

## Test plan
- [x] All pre-commit hooks pass (typecheck, lint, format)
- [x] Existing graphs with this block continue to load and execute (same
UUID)
- [x] Agent mode / iteration mode works as before
- [x] Copilot agent generator correctly references the renamed block
2026-03-24 19:16:42 +07:00
Otto
f21a36ca37 fix(backend): downgrade user-caused LLM API errors to warning level (#12516)
Requested by @majdyz

Follow-up to #12513. Anthropic/OpenAI 401, 403, and 429 errors are
user-caused (bad API keys, forbidden, rate limits) and should not hit
Sentry as exceptions.

### Changes

**Changes in `blocks/llm.py`:**
- Anthropic `APIError` handler (line ~950): check `status_code` — use
`logger.warning()` for 401/403/429, keep `logger.error()` for server
errors
- Generic `Exception` handler in LLM block `run()` (line ~1467): same
pattern — `logger.warning()` for user-caused status codes,
`logger.exception()` for everything else
- Extracted `USER_ERROR_STATUS_CODES = (401, 403, 429)` module-level
constant
- Added `break` to short-circuit retry loop for user-caused errors
- Removed double-logging from inner Anthropic handler

**Changes in `blocks/test/test_llm.py`:**
- Added 8 regression tests covering 401/403/429 fast-exit and 500 retry
behavior

**Sentry issues addressed:**
- AUTOGPT-SERVER-8B6, 8B7, 8B8 — `[LLM-Block] Anthropic API error: Error
code: 401 - invalid x-api-key`
- Any OpenAI 401/403/429 errors hitting the generic exception handler

Part of SECRT-2166

### Checklist 📋

#### For code changes:
- [x] I have clearly listed my changes in the PR description
- [x] I have made a test plan
- [x] I have tested my changes according to the test plan

#### Test plan:
- [x] Unit tests for 401/403/429 Anthropic errors → warning log, no
retry
- [x] Unit tests for 500 Anthropic errors → error log, retry
- [x] Unit tests for 401/403/429 OpenAI errors → warning log, no retry
- [x] Unit tests for 500 OpenAI errors → error log, retry
- [x] Verified USER_ERROR_STATUS_CODES constant is used consistently
- [x] Verified no double-logging in Anthropic handler path

---
Co-authored-by: Zamil Majdy (@majdyz) <zamil.majdy@agpt.co>

---------

Co-authored-by: Zamil Majdy (@majdyz) <zamil.majdy@agpt.co>
2026-03-24 10:59:04 +00:00
Zamil Majdy
ee5382a064 feat(copilot): add tool/block capability filtering to AutoPilotBlock (#12482)
## Summary

- Adds `CopilotPermissions` model (`copilot/permissions.py`) — a
capability filter that restricts which tools and blocks the
AutoPilot/Copilot may use during a single execution
- Exposes 4 new `advanced=True` fields on `AutoPilotBlock`: `tools`,
`tools_exclude`, `blocks`, `blocks_exclude`
- Threads permissions through the full execution path: `AutoPilotBlock`
→ `collect_copilot_response` → `stream_chat_completion_sdk` →
`run_block`
- Implements recursion inheritance via contextvar: sub-agent executions
can only be *more* restrictive than their parent

## Design

**Tool filtering** (`tools` + `tools_exclude`):
- `tools_exclude=True` (default): `tools` is a **blacklist** — listed
tools denied, all others allowed. Empty list = allow all.
- `tools_exclude=False`: `tools` is a **whitelist** — only listed tools
are allowed.
- Users specify short names (`run_block`, `web_fetch`, `Read`, `Task`,
…) — mapped to full SDK format internally.
- Validated eagerly at block-run time with a clear error listing valid
names.

**Block filtering** (`blocks` + `blocks_exclude`):
- Same semantics as tool filtering, applied inside `run_block` via
contextvar.
- Each entry can be a full UUID, an 8-char partial UUID (first segment),
or a case-insensitive block name.
- Validated against the live block registry; invalid identifiers surface
a helpful error before the session is created.

**Recursion inheritance**:
- `_inherited_permissions` contextvar stores the parent execution's
permissions.
- On each `AutoPilotBlock.run()`, the child's permissions are merged
with the parent via `merged_with_parent()` — effective allowed sets are
intersected (tools) and the parent chain is kept for block checks.
- Sub-agents can never expand what the parent allowed.

## Test plan

- [x] 68 new unit tests in `copilot/permissions_test.py` and
`blocks/autopilot_permissions_test.py`
- [x] Block identifier matching: full UUID, partial UUID, name,
case-insensitivity
- [x] Tool allow/deny list semantics including edge cases (empty list,
unknown tool)
- [x] Parent/child merging and recursion ceiling correctness
- [x] `validate_tool_names` / `validate_block_identifiers` with mock
block registry
- [x] `apply_tool_permissions` SDK tool-list integration
- [x] `AutoPilotBlock.run()` — invalid tool/block yields error before
session creation
- [x] `AutoPilotBlock.run()` — valid permissions forwarded to
`execute_copilot`
- [x] Existing `AutoPilotBlock` block tests still pass (2/2)
- [x] All hooks pass (pyright, ruff, black, isort)
- [x] E2E: CoPilot chat works end-to-end with E2B sandbox (12s stream)
- [x] E2E: Permission fields render in Builder UI (Tools combobox,
exclude toggles)
- [x] E2E: Agent with restricted permissions (whitelist web_fetch only)
executes correctly
- [x] E2E: Permission values preserved through API round-trip
2026-03-24 07:49:58 +00:00
Nicholas Tindle
b80e5ea987 fix(backend): allow admins to download submitted agents pending review (#12535)
## Why

Admins cannot download submitted-but-not-yet-approved agents from
`/admin/marketplace`. Clicking "Download" fails silently with a Server
Components render error. This blocks admins from reviewing agents that
companies have submitted.

## What

Remove the redundant ownership/marketplace check from
`get_graph_as_admin()` that was silently tightened in PR #11323 (Nov
2025). Add regression tests for both the admin download path and the
non-admin marketplace access control.

## How

**Root cause:** In PR #11323, Reinier refactored an inline
`StoreListingVersion` query (which had no status filter) into a call to
`is_graph_published_in_marketplace()` (which requires `submissionStatus:
APPROVED`). This was collateral cleanup — his PR focused on sub-agent
execution permissions — but it broke admin download of pending agents.

**Fix:** Remove the ownership/marketplace check from
`get_graph_as_admin()`, keeping only the null guard. This is safe
because `get_graph_as_admin` is only callable through admin-protected
routes (`requires_admin_user` at router level).

**Tests added:**
- `test_admin_can_access_pending_agent_not_owned` — admin can access a
graph they don't own that isn't APPROVED
- `test_admin_download_pending_agent_with_subagents` — admin export
includes sub-graphs
- `test_get_graph_non_owner_approved_marketplace_agent` — protects PR
#11323: non-owners CAN access APPROVED agents
- `test_get_graph_non_owner_pending_marketplace_agent_denied` — protects
PR #11323: non-owners CANNOT access PENDING agents

### Checklist

- [x] I have clearly listed my changes in the PR description
- [x] I have made a test plan
- [x] I have tested my changes according to the test plan:
  - [x] 4 regression tests pass locally
  - [x] Admin can download pending agents (verified via unit test)
  - [x] Non-admin marketplace access control preserved

## Test plan
- [ ] Verify admin can download a submitted-but-not-approved agent from
`/admin/marketplace`
- [ ] Verify non-admin users still cannot access admin endpoints
- [ ] Verify the download succeeds without console errors

🤖 Generated with [Claude Code](https://claude.com/claude-code)

<!-- CURSOR_SUMMARY -->
---

> [!NOTE]
> **Medium Risk**
> Changes access-control behavior for admin graph retrieval; risk is
mitigated by route-level admin auth but misuse of `get_graph_as_admin()`
outside admin-protected routes would expose non-approved graphs.
> 
> **Overview**
> Admins can now download/review **submitted-but-not-approved**
marketplace agents: `get_graph_as_admin()` no longer enforces ownership
or *marketplace APPROVED* checks, only returning `None` when the graph
doesn’t exist.
> 
> Adds regression tests covering the admin download/export path
(including sub-graphs) and confirming non-admin behavior is unchanged:
non-owners can fetch **APPROVED** marketplace graphs but cannot access
**pending** ones.
> 
> <sup>Written by [Cursor
Bugbot](https://cursor.com/dashboard?tab=bugbot) for commit
a6d2d69ae4. This will update automatically
on new commits. Configure
[here](https://cursor.com/dashboard?tab=bugbot).</sup>
<!-- /CURSOR_SUMMARY -->

---------

Co-authored-by: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-03-24 07:40:17 +00:00
Zamil Majdy
3d4fcfacb6 fix(backend): add circuit breaker for infinite tool call retry loops (#12499)
## Summary
- Adds a two-layer circuit breaker to prevent AutoPilot from looping
infinitely when tool calls fail with empty parameters
- **Tool-level**: After 3 consecutive identical failures per tool,
returns a hard-stop message instructing the model to output content as
text instead of retrying
- **Stream-level**: After 6 consecutive empty tool calls (`input: {}`),
aborts the stream entirely with a user-visible error and retry button

## Background
In session `c5548b48`, the model completed all research successfully but
then spent 51+ minutes in an infinite loop trying to write output —
every tool call was sent with `input: {}` (likely due to context
saturation preventing argument serialization). 21+ identical failing
tool calls with no circuit breaker.

## Changes
- `tool_adapter.py`: Added `_check_circuit_breaker`,
`_record_tool_failure`, `_clear_tool_failures` functions with a
`ContextVar`-based tracker. Integrated into both `create_tool_handler`
(BaseTool) and the `_truncating` wrapper (all tools).
- `service.py`: Added empty-tool-call detection in the main stream loop
that counts consecutive `AssistantMessage`s with empty
`ToolUseBlock.input` and aborts after the limit.
- `test_circuit_breaker.py`: 7 unit tests covering threshold behavior,
per-args tracking, reset on success, and uninitialized tracker safety.

## Test plan
- [x] Unit tests pass (`pytest
backend/copilot/sdk/test_circuit_breaker.py` — 8/8 passing)
- [x] Pre-commit hooks pass (Ruff, Black, isort, typecheck all pass)
- [x] E2E: CoPilot tool calls work normally (GetCurrentTimeBlock
returned 09:16:39 UTC)
- [x] E2E: Circuit breaker pass-through verified (successful calls don't
trigger breaker)
- [x] E2E: Circuit breaker code integrated into tool_adapter truncating
wrapper
2026-03-24 05:45:12 +00:00
Zamil Majdy
32eac6d52e dx(skills): improve /pr-test to require screenshots, state verification, and fix accountability (#12527)
## Summary
- Add "Critical Requirements" section making screenshots at every step,
PR comment posting, state verification, negative tests, and full
evidence reports non-negotiable
- Add "State Manipulation for Realistic Testing" section with Redis CLI,
DB query, and API before/after patterns
- Strengthen fix mode to require before/after screenshot pairs, rebuild
only affected services, and commit after each fix
- Expand test report format to include API evidence and screenshot
evidence columns
- Bump version to 2.0.0

## Test plan
- [x] Run `/pr-test` on an existing PR and verify it follows the new
critical requirements
- [x] Verify screenshots are posted to PR comment
- [x] Verify fix mode produces before/after screenshot pairs
2026-03-24 12:35:05 +07:00
dependabot[bot]
9762f4cde7 chore(libs/deps-dev): bump the development-dependencies group across 1 directory with 2 updates (#12523)
Bumps the development-dependencies group with 2 updates in the
/autogpt_platform/autogpt_libs directory:
[pytest-cov](https://github.com/pytest-dev/pytest-cov) and
[ruff](https://github.com/astral-sh/ruff).

Updates `pytest-cov` from 7.0.0 to 7.1.0
<details>
<summary>Changelog</summary>
<p><em>Sourced from <a
href="https://github.com/pytest-dev/pytest-cov/blob/master/CHANGELOG.rst">pytest-cov's
changelog</a>.</em></p>
<blockquote>
<h2>7.1.0 (2026-03-21)</h2>
<ul>
<li>
<p>Fixed total coverage computation to always be consistent, regardless
of reporting settings.
Previously some reports could produce different total counts, and
consequently can make --cov-fail-under behave different depending on
reporting options.
See <code>[#641](https://github.com/pytest-dev/pytest-cov/issues/641)
&lt;https://github.com/pytest-dev/pytest-cov/issues/641&gt;</code>_.</p>
</li>
<li>
<p>Improve handling of ResourceWarning from sqlite3.</p>
<p>The plugin adds warning filter for sqlite3
<code>ResourceWarning</code> unclosed database (since 6.2.0).
It checks if there is already existing plugin for this message by
comparing filter regular expression.
When filter is specified on command line the message is escaped and does
not match an expected message.
A check for an escaped regular expression is added to handle this
case.</p>
<p>With this fix one can suppress <code>ResourceWarning</code> from
sqlite3 from command line::</p>
<p>pytest -W &quot;ignore:unclosed database in &lt;sqlite3.Connection
object at:ResourceWarning&quot; ...</p>
</li>
<li>
<p>Various improvements to documentation.
Contributed by Art Pelling in
<code>[#718](https://github.com/pytest-dev/pytest-cov/issues/718)
&lt;https://github.com/pytest-dev/pytest-cov/pull/718&gt;</code>_ and
&quot;vivodi&quot; in
<code>[#738](https://github.com/pytest-dev/pytest-cov/issues/738)
&lt;https://github.com/pytest-dev/pytest-cov/pull/738&gt;</code><em>.
Also closed
<code>[#736](https://github.com/pytest-dev/pytest-cov/issues/736)
&lt;https://github.com/pytest-dev/pytest-cov/issues/736&gt;</code></em>.</p>
</li>
<li>
<p>Fixed some assertions in tests.
Contributed by in Markéta Machová in
<code>[#722](https://github.com/pytest-dev/pytest-cov/issues/722)
&lt;https://github.com/pytest-dev/pytest-cov/pull/722&gt;</code>_.</p>
</li>
<li>
<p>Removed unnecessary coverage configuration copying (meant as a backup
because reporting commands had configuration side-effects before
coverage 5.0).</p>
</li>
</ul>
</blockquote>
</details>
<details>
<summary>Commits</summary>
<ul>
<li><a
href="66c8a526b1"><code>66c8a52</code></a>
Bump version: 7.0.0 → 7.1.0</li>
<li><a
href="f707662478"><code>f707662</code></a>
Make the examples use pypy 3.11.</li>
<li><a
href="6049a78478"><code>6049a78</code></a>
Make context test use the old ctracer (seems the new sysmon tracer
behaves di...</li>
<li><a
href="8ebf20bbbc"><code>8ebf20b</code></a>
Update changelog.</li>
<li><a
href="861d30e60d"><code>861d30e</code></a>
Remove the backup context manager - shouldn't be needed since coverage
5.0, ...</li>
<li><a
href="fd4c956014"><code>fd4c956</code></a>
Pass the precision on the nulled total (seems that there's some caching
goion...</li>
<li><a
href="78c9c4ecb0"><code>78c9c4e</code></a>
Only run the 3.9 on older deps.</li>
<li><a
href="4849a922e8"><code>4849a92</code></a>
Punctuation.</li>
<li><a
href="197c35e2f3"><code>197c35e</code></a>
Update changelog and hopefully I don't forget to publish release again
:))</li>
<li><a
href="14dc1c92d4"><code>14dc1c9</code></a>
Update examples to use 3.11 and make the adhoc layout example look a bit
more...</li>
<li>Additional commits viewable in <a
href="https://github.com/pytest-dev/pytest-cov/compare/v7.0.0...v7.1.0">compare
view</a></li>
</ul>
</details>
<br />

Updates `ruff` from 0.15.0 to 0.15.7
<details>
<summary>Release notes</summary>
<p><em>Sourced from <a
href="https://github.com/astral-sh/ruff/releases">ruff's
releases</a>.</em></p>
<blockquote>
<h2>0.15.7</h2>
<h2>Release Notes</h2>
<p>Released on 2026-03-19.</p>
<h3>Preview features</h3>
<ul>
<li>Display output severity in preview (<a
href="https://redirect.github.com/astral-sh/ruff/pull/23845">#23845</a>)</li>
<li>Don't show <code>noqa</code> hover for non-Python documents (<a
href="https://redirect.github.com/astral-sh/ruff/pull/24040">#24040</a>)</li>
</ul>
<h3>Rule changes</h3>
<ul>
<li>[<code>pycodestyle</code>] Recognize <code>pyrefly:</code> as a
pragma comment (<code>E501</code>) (<a
href="https://redirect.github.com/astral-sh/ruff/pull/24019">#24019</a>)</li>
</ul>
<h3>Server</h3>
<ul>
<li>Don't return code actions for non-Python documents (<a
href="https://redirect.github.com/astral-sh/ruff/pull/23905">#23905</a>)</li>
</ul>
<h3>Documentation</h3>
<ul>
<li>Add company AI policy to contributing guide (<a
href="https://redirect.github.com/astral-sh/ruff/pull/24021">#24021</a>)</li>
<li>Document editor features for Markdown code formatting (<a
href="https://redirect.github.com/astral-sh/ruff/pull/23924">#23924</a>)</li>
<li>[<code>pylint</code>] Improve phrasing (<code>PLC0208</code>) (<a
href="https://redirect.github.com/astral-sh/ruff/pull/24033">#24033</a>)</li>
</ul>
<h3>Other changes</h3>
<ul>
<li>Use PEP 639 license information (<a
href="https://redirect.github.com/astral-sh/ruff/pull/19661">#19661</a>)</li>
</ul>
<h3>Contributors</h3>
<ul>
<li><a
href="https://github.com/tmimmanuel"><code>@​tmimmanuel</code></a></li>
<li><a
href="https://github.com/DimitriPapadopoulos"><code>@​DimitriPapadopoulos</code></a></li>
<li><a
href="https://github.com/amyreese"><code>@​amyreese</code></a></li>
<li><a href="https://github.com/statxc"><code>@​statxc</code></a></li>
<li><a href="https://github.com/dylwil3"><code>@​dylwil3</code></a></li>
<li><a
href="https://github.com/hunterhogan"><code>@​hunterhogan</code></a></li>
<li><a
href="https://github.com/renovate"><code>@​renovate</code></a></li>
</ul>
<h2>Install ruff 0.15.7</h2>
<h3>Install prebuilt binaries via shell script</h3>
<pre lang="sh"><code>curl --proto '=https' --tlsv1.2 -LsSf
https://releases.astral.sh/github/ruff/releases/download/0.15.7/ruff-installer.sh
| sh
</code></pre>
<h3>Install prebuilt binaries via powershell script</h3>
<pre lang="sh"><code>powershell -ExecutionPolicy Bypass -c &quot;irm
https://releases.astral.sh/github/ruff/releases/download/0.15.7/ruff-installer.ps1
| iex&quot;
&lt;/tr&gt;&lt;/table&gt; 
</code></pre>
</blockquote>
<p>... (truncated)</p>
</details>
<details>
<summary>Changelog</summary>
<p><em>Sourced from <a
href="https://github.com/astral-sh/ruff/blob/main/CHANGELOG.md">ruff's
changelog</a>.</em></p>
<blockquote>
<h2>0.15.7</h2>
<p>Released on 2026-03-19.</p>
<h3>Preview features</h3>
<ul>
<li>Display output severity in preview (<a
href="https://redirect.github.com/astral-sh/ruff/pull/23845">#23845</a>)</li>
<li>Don't show <code>noqa</code> hover for non-Python documents (<a
href="https://redirect.github.com/astral-sh/ruff/pull/24040">#24040</a>)</li>
</ul>
<h3>Rule changes</h3>
<ul>
<li>[<code>pycodestyle</code>] Recognize <code>pyrefly:</code> as a
pragma comment (<code>E501</code>) (<a
href="https://redirect.github.com/astral-sh/ruff/pull/24019">#24019</a>)</li>
</ul>
<h3>Server</h3>
<ul>
<li>Don't return code actions for non-Python documents (<a
href="https://redirect.github.com/astral-sh/ruff/pull/23905">#23905</a>)</li>
</ul>
<h3>Documentation</h3>
<ul>
<li>Add company AI policy to contributing guide (<a
href="https://redirect.github.com/astral-sh/ruff/pull/24021">#24021</a>)</li>
<li>Document editor features for Markdown code formatting (<a
href="https://redirect.github.com/astral-sh/ruff/pull/23924">#23924</a>)</li>
<li>[<code>pylint</code>] Improve phrasing (<code>PLC0208</code>) (<a
href="https://redirect.github.com/astral-sh/ruff/pull/24033">#24033</a>)</li>
</ul>
<h3>Other changes</h3>
<ul>
<li>Use PEP 639 license information (<a
href="https://redirect.github.com/astral-sh/ruff/pull/19661">#19661</a>)</li>
</ul>
<h3>Contributors</h3>
<ul>
<li><a
href="https://github.com/tmimmanuel"><code>@​tmimmanuel</code></a></li>
<li><a
href="https://github.com/DimitriPapadopoulos"><code>@​DimitriPapadopoulos</code></a></li>
<li><a
href="https://github.com/amyreese"><code>@​amyreese</code></a></li>
<li><a href="https://github.com/statxc"><code>@​statxc</code></a></li>
<li><a href="https://github.com/dylwil3"><code>@​dylwil3</code></a></li>
<li><a
href="https://github.com/hunterhogan"><code>@​hunterhogan</code></a></li>
<li><a
href="https://github.com/renovate"><code>@​renovate</code></a></li>
</ul>
<h2>0.15.6</h2>
<p>Released on 2026-03-12.</p>
<h3>Preview features</h3>
<ul>
<li>Add support for <code>lazy</code> import parsing (<a
href="https://redirect.github.com/astral-sh/ruff/pull/23755">#23755</a>)</li>
<li>Add support for star-unpacking of comprehensions (PEP 798) (<a
href="https://redirect.github.com/astral-sh/ruff/pull/23788">#23788</a>)</li>
<li>Reject semantic syntax errors for lazy imports (<a
href="https://redirect.github.com/astral-sh/ruff/pull/23757">#23757</a>)</li>
<li>Drop a few rules from the preview default set (<a
href="https://redirect.github.com/astral-sh/ruff/pull/23879">#23879</a>)</li>
<li>[<code>airflow</code>] Flag <code>Variable.get()</code> calls
outside of task execution context (<code>AIR003</code>) (<a
href="https://redirect.github.com/astral-sh/ruff/pull/23584">#23584</a>)</li>
<li>[<code>airflow</code>] Flag runtime-varying values in DAG/task
constructor arguments (<code>AIR304</code>) (<a
href="https://redirect.github.com/astral-sh/ruff/pull/23631">#23631</a>)</li>
<li>[<code>flake8-bugbear</code>] Implement
<code>delattr-with-constant</code> (<code>B043</code>) (<a
href="https://redirect.github.com/astral-sh/ruff/pull/23737">#23737</a>)</li>
</ul>
<!-- raw HTML omitted -->
</blockquote>
<p>... (truncated)</p>
</details>
<details>
<summary>Commits</summary>
<ul>
<li><a
href="0ef39de46c"><code>0ef39de</code></a>
Bump 0.15.7 (<a
href="https://redirect.github.com/astral-sh/ruff/issues/24049">#24049</a>)</li>
<li><a
href="beb543b5c6"><code>beb543b</code></a>
[ty] ecosystem-analyzer: Fail on newly panicking projects (<a
href="https://redirect.github.com/astral-sh/ruff/issues/24043">#24043</a>)</li>
<li><a
href="378fe73092"><code>378fe73</code></a>
Don't show noqa hover for non-Python documents (<a
href="https://redirect.github.com/astral-sh/ruff/issues/24040">#24040</a>)</li>
<li><a
href="b5665bd18e"><code>b5665bd</code></a>
[<code>pylint</code>] Improve phrasing (<code>PLC0208</code>) (<a
href="https://redirect.github.com/astral-sh/ruff/issues/24033">#24033</a>)</li>
<li><a
href="6e20f22190"><code>6e20f22</code></a>
test: migrate <code>show_settings</code> and <code>version</code> tests
to use <code>CliTest</code> (<a
href="https://redirect.github.com/astral-sh/ruff/issues/23702">#23702</a>)</li>
<li><a
href="f99b284c1f"><code>f99b284</code></a>
Drain file watcher events during test setup (<a
href="https://redirect.github.com/astral-sh/ruff/issues/24030">#24030</a>)</li>
<li><a
href="744c996c35"><code>744c996</code></a>
[ty] Filter out unsatisfiable inference attempts during generic call
narrowin...</li>
<li><a
href="16160958bd"><code>1616095</code></a>
[ty] Avoid inferring intersection types for call arguments (<a
href="https://redirect.github.com/astral-sh/ruff/issues/23933">#23933</a>)</li>
<li><a
href="7f275f431b"><code>7f275f4</code></a>
[ty] Pin mypy_primer in <code>setup_primer_project.py</code> (<a
href="https://redirect.github.com/astral-sh/ruff/issues/24020">#24020</a>)</li>
<li><a
href="7255e362e4"><code>7255e36</code></a>
[<code>pycodestyle</code>] Recognize <code>pyrefly:</code> as a pragma
comment (<code>E501</code>) (<a
href="https://redirect.github.com/astral-sh/ruff/issues/24019">#24019</a>)</li>
<li>Additional commits viewable in <a
href="https://github.com/astral-sh/ruff/compare/0.15.0...0.15.7">compare
view</a></li>
</ul>
</details>
<br />


Dependabot will resolve any conflicts with this PR as long as you don't
alter it yourself. You can also trigger a rebase manually by commenting
`@dependabot rebase`.

[//]: # (dependabot-automerge-start)
[//]: # (dependabot-automerge-end)

---

<details>
<summary>Dependabot commands and options</summary>
<br />

You can trigger Dependabot actions by commenting on this PR:
- `@dependabot rebase` will rebase this PR
- `@dependabot recreate` will recreate this PR, overwriting any edits
that have been made to it
- `@dependabot show <dependency name> ignore conditions` will show all
of the ignore conditions of the specified dependency
- `@dependabot ignore <dependency name> major version` will close this
group update PR and stop Dependabot creating any more for the specific
dependency's major version (unless you unignore this specific
dependency's major version or upgrade to it yourself)
- `@dependabot ignore <dependency name> minor version` will close this
group update PR and stop Dependabot creating any more for the specific
dependency's minor version (unless you unignore this specific
dependency's minor version or upgrade to it yourself)
- `@dependabot ignore <dependency name>` will close this group update PR
and stop Dependabot creating any more for the specific dependency
(unless you unignore this specific dependency or upgrade to it yourself)
- `@dependabot unignore <dependency name>` will remove all of the ignore
conditions of the specified dependency
- `@dependabot unignore <dependency name> <ignore condition>` will
remove the ignore condition of the specified dependency and ignore
conditions


</details>

Signed-off-by: dependabot[bot] <support@github.com>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
2026-03-24 01:36:45 +00:00
Otto
76901ba22f docs: add Why/What/How structure to PR template, CLAUDE.md, and PR skills (#12525)
Requested by @majdyz

### Why / What / How

**Why:** PR descriptions currently explain the *what* and *how* but not
the *why*. Without motivation context, reviewers can't judge whether an
approach fits the problem. Nick flagged this in standup: "The PR
descriptions you use are explaining the what not the why."

**What:** Adds a consistent Why / What / How structure to PR
descriptions across the entire workflow — template, CLAUDE.md guidance,
and all PR-related skills (`/pr-review`, `/pr-test`, `/pr-address`).

**How:**
- **`.github/PULL_REQUEST_TEMPLATE.md`**: Replaced the old vague
`Changes` heading with a single `Why / What / How` section with guiding
comments
- **`autogpt_platform/CLAUDE.md`**: Added bullet under "Creating Pull
Requests" requiring the Why/What/How structure
- **`.claude/skills/pr-review/SKILL.md`**: Added "Read the PR
description" step before reading the diff, and "Description quality" to
the review checklist
- **`.claude/skills/pr-test/SKILL.md`**: Updated Step 1 to read the PR
description and understand Why/What/How before testing
- **`.claude/skills/pr-address/SKILL.md`**: Added "Read the PR
description" step before fetching comments

## Test plan
- [x] All five files reviewed for correct formatting and consistency

---
Co-authored-by: Zamil Majdy (@majdyz) <zamil.majdy@agpt.co>
2026-03-24 01:35:39 +00:00
Zamil Majdy
23b65939f3 fix(backend/db): add DB_STATEMENT_CACHE_SIZE env var for Prisma engine (#12521)
## Summary
- Add `DB_STATEMENT_CACHE_SIZE` env var support for Prisma query engine
- Wires through as `statement_cache_size` URL parameter to control the
LRU prepared statement cache per connection in the Rust binary engine

## Why
Live investigation on dev pods showed the Prisma Rust engine growing
from 34MB to 932MB over ~1hr due to unbounded query plan cache. Despite
`pgbouncer=true` in the DATABASE_URL (which should disable caching), the
engine still caches.

This gives explicit control: setting `DB_STATEMENT_CACHE_SIZE=0`
disables the cache entirely.

## Live data (dev)
```
Fresh pod:  Python=693MB, Engine=34MB,  Total=727MB
Bloated:    Python=2.1GB, Engine=932MB, Total=3GB
```

## Infra companion PR

[AutoGPT_cloud_infrastructure#299](https://github.com/Significant-Gravitas/AutoGPT_cloud_infrastructure/pull/299)
sets `DB_STATEMENT_CACHE_SIZE=0` along with `PYTHONMALLOC=malloc` and
memory limit changes.

## Test plan
- [ ] Deploy to dev and monitor Prisma engine memory over 1hr
- [ ] Verify queries still work correctly with cache disabled
- [ ] Compare engine RSS on fresh vs aged pods
2026-03-23 23:57:28 +07:00
Zamil Majdy
1c27eaac53 dx(skills): improve /pr-test skill to show screenshots with explanations (#12518)
## Summary
- Update /pr-test skill to consistently show screenshots inline to the
user with explanations
- Post PR comments with inline images and per-screenshot descriptions
(not just local file paths)
- Simplify GitHub Git API upload flow for screenshot hosting

## Changes
- Step 5: Take screenshots at every significant test step (aim for 1+
per scenario)
- Step 6 (new): Show every screenshot to the user via Read tool with 2-3
sentence explanations
- Step 7: Post PR comment with inline images, summary table, and
per-screenshot context

## Test plan
- [x] Tested end-to-end on PR #12512 — screenshots uploaded and rendered
correctly in PR comment
2026-03-23 23:11:21 +07:00
Zamil Majdy
923b164794 fix(backend): use system chromium for agent-browser on all architectures (#12473)
## Summary

- Replaces the arch-conditional chromium install (ARM64 vs AMD64) with a
single approach: always use the distro-packaged `chromium` and set
`AGENT_BROWSER_EXECUTABLE_PATH=/usr/bin/chromium`
- Removes `agent-browser install` entirely (it downloads Chrome for
Testing, which has no ARM64 binary)
- Removes the `entrypoint.sh` wrapper script that was setting the env
var at runtime
- Updates `autogpt_platform/db/docker/docker-compose.yml`: removes
`external: true` from the network declarations so the Supabase stack can
be brought up standalone (needed for the Docker integration tests in the
test plan below — without this, `docker compose up` fails unless the
platform stack is already running); also sets
`GOTRUE_MAILER_AUTOCONFIRM: true` for local dev convenience (no SMTP
setup required on first run — this compose file is not used in
production)
- Updates `autogpt_platform/docker-compose.platform.yml`: mounts the
`workspace` volume so agent-browser results (screenshots, snapshots) are
accessible from other services; without this the copilot workspace write
fails in Docker

## Verification

Tested via Docker build on arm64 (Apple Silicon):
```
=== Testing agent-browser with system chromium ===
✓ Example Domain
  https://example.com/
=== SUCCESS: agent-browser launched with system chromium ===
```
agent-browser navigated to example.com in ~1.5s using system chromium
(v146 from Debian trixie).

## Test plan

- [x] Docker build test on arm64: `agent-browser open
https://example.com` succeeds with system chromium
- [x] Verify amd64 Docker build still works (CI)
2026-03-23 20:54:03 +07:00
Zamil Majdy
e86ac21c43 feat(platform): add workflow import from other tools (n8n, Make.com, Zapier) (#12440)
## Summary
- Enable one-click import of workflows from other platforms (n8n,
Make.com, Zapier, etc.) into AutoGPT via CoPilot
- **No backend endpoint** — import is entirely client-side: the dialog
reads the file or fetches the n8n template URL, uploads the JSON to the
workspace via `uploadFileDirect`, stores the file reference in
`sessionStorage`, and redirects to CoPilot with `autosubmit=true`
- CoPilot receives the workflow JSON as a proper file attachment and
uses the existing agent-generator pipeline to convert it
- Library dialog redesigned: 2 tabs — "AutoGPT agent" (upload exported
agent JSON) and "Another platform" (file upload + optional n8n URL)

## How it works
1. User uploads a workflow JSON (or pastes an n8n template URL)
2. Frontend fetches/reads the JSON and uploads it to the user's
workspace via the existing file upload API
3. User is redirected to `/copilot?source=import&autosubmit=true`
4. CoPilot picks up the file from `sessionStorage` and sends it as a
`FileUIPart` attachment with a prompt to recreate the workflow as an
AutoGPT agent

## Test plan
- [x] Manual test: import a real n8n workflow JSON via the dialog
- [x] Manual test: paste an n8n template URL and verify it fetches +
converts
- [x] Manual test: import Make.com / Zapier workflow export JSON
- [x] Repeated imports don't cause 409 conflicts (filenames use
`crypto.randomUUID()`)
- [x] E2E: Import dialog has 2 tabs (AutoGPT agent + Another platform)
- [x] E2E: n8n quick-start template buttons present
- [x] E2E: n8n URL input enables Import button on valid URL
- [x] E2E: Workspace upload API returns file_id
2026-03-23 13:03:02 +00:00
Lluis Agusti
94224be841 Merge remote-tracking branch 'origin/master' into dev 2026-03-23 20:42:32 +08:00
Otto
da4bdc7ab9 fix(backend+frontend): reduce Sentry noise from user-caused errors (#12513)
Requested by @majdyz

User-caused errors (no payment method, webhook agent invocation, missing
credentials, bad API keys) were hitting Sentry via `logger.exception()`
in the `ValueError` handler, creating noise that obscures real bugs.
Additionally, a frontend crash on the copilot page (BUILDER-71J) needed
fixing.

**Changes:**

**Backend — rest_api.py**
- Set `log_error=False` for the `ValueError` exception handler (line
278), consistent with how `FolderValidationError` and `NotFoundError`
are already handled. User-caused 400 errors no longer trigger
`logger.exception()` → Sentry.

**Backend — executor/manager.py**
- Downgrade `ExecutionManager` input validation skip errors from `error`
to `warning` level. Missing credentials is expected user behavior, not
an internal error.

**Backend — blocks/llm.py**
- Sanitize unpaired surrogates in LLM prompt content before sending to
provider APIs. Prevents `UnicodeEncodeError: surrogates not allowed`
when httpx encodes the JSON body (AUTOGPT-SERVER-8AX).

**Frontend — package.json**
- Upgrade `ai` SDK from `6.0.59` to `6.0.134` to fix BUILDER-71J
(`TypeError: undefined is not an object (evaluating
'this.activeResponse.state')` on /copilot page). This is a known issue
in the Vercel AI SDK fixed in later patch versions.

**Sentry issues addressed:**
- `No payment method found` (ValueError → 400)
- `This agent is triggered by an external event (webhook)` (ValueError →
400)
- `Node input updated with non-existent credentials` (ValueError → 400)
- `[ExecutionManager] Skip execution, input validation error: missing
input {credentials}`
- `UnicodeEncodeError: surrogates not allowed` (AUTOGPT-SERVER-8AX)
- `TypeError: activeResponse.state` (BUILDER-71J)

Resolves SECRT-2166

---
Co-authored-by: Zamil Majdy (@majdyz) <zamil.majdy@agpt.co>

---------

Co-authored-by: Zamil Majdy (@majdyz) <zamil.majdy@agpt.co>
2026-03-23 12:22:49 +00:00
Zamil Majdy
7176cecf25 perf(copilot): reduce tool schema token cost by 34% (#12398)
## Summary

Reduce CoPilot per-turn token overhead by systematically trimming tool
descriptions, parameter schemas, and system prompt content. All 35 MCP
tool schemas are passed on every SDK call — this PR reduces their size.

### Strategy

1. **Tool descriptions**: Trimmed verbose multi-sentence explanations to
concise single-sentence summaries while preserving meaning
2. **Parameter schemas**: Shortened parameter descriptions to essential
info, removed some `default` values (handled in code)
3. **System prompt**: Condensed `_SHARED_TOOL_NOTES` and storage
supplement template in `prompting.py`
4. **Cross-tool references**: Removed duplicate workflow hints (e.g.
"call find_block before run_block" appeared in BOTH tools — kept only in
the dependent tool). Critical cross-tool references retained (e.g.
`continue_run_block` in `run_block`, `fix_agent_graph` in
`validate_agent`, `get_doc_page` in `search_docs`, `web_fetch`
preference in `browser_navigate`)

### Token Impact

| Metric | Before | After | Reduction |
|--------|--------|-------|-----------|
| System Prompt | ~865 tokens | ~497 tokens | 43% |
| Tool Schemas | ~9,744 tokens | ~6,470 tokens | 34% |
| **Grand Total** | **~10,609 tokens** | **~6,967 tokens** | **34%** |

Saves **~3,642 tokens per conversation turn**.

### Key Decisions

- **Mostly description changes**: Tool logic, parameters, and types
unchanged. However, some schema-level `default` fields were removed
(e.g. `save` in `customize_agent`) — these are machine-readable
metadata, not just prose, and may affect LLM behavior.
- **Quality preserved**: All descriptions still convey what the tool
does and essential usage patterns
- **Cross-references trimmed carefully**: Kept prerequisite hints in the
dependent tool (run_block mentions find_block) but removed the reverse
(find_block no longer mentions run_block). Critical cross-tool guidance
retained where removal would degrade model behavior.
- **`run_time` description fixed**: Added missing supported values
(today, last 30 days, ISO datetime) per review feedback

### Future Optimization

The SDK passes all 35 tools on every call. The MCP protocol's
`list_tools()` handler supports dynamic tool registration — a follow-up
PR could implement lazy tool loading (register core tools + a discovery
meta-tool) to further reduce per-turn token cost.

### Changes

- Trimmed descriptions across 25 tool files
- Condensed `_SHARED_TOOL_NOTES` and `_build_storage_supplement` in
`prompting.py`
- Fixed `run_time` schema description in `agent_output.py`

### Checklist

#### For code changes:
- [x] I have clearly listed my changes in the PR description
- [x] I have made a test plan
- [x] I have tested my changes according to the test plan:
  - [x] All 273 copilot tests pass locally
  - [x] All 35 tools load and produce valid schemas
  - [x] Before/after token dumps compared
  - [x] Formatting passes (`poetry run format`)
  - [x] CI green
2026-03-23 08:27:24 +00:00
Zamil Majdy
f35210761c feat(devops): add /pr-test skill + subscription mode auto-provisioning (#12507)
## Summary
- Adds `/pr-test` skill for automated E2E testing of PRs using docker
compose, agent-browser, and API calls
- Covers full environment setup (copy .env, configure copilot auth,
ARM64 Docker fix)
- Includes browser UI testing, direct API testing, screenshot capture,
and test report generation
- Has `--fix` mode for auto-fixing bugs found during testing (similar to
`/pr-address`)
- **Screenshot uploads use GitHub Git API** (blobs → tree → commit →
ref) — no local git operations, safe for worktrees
- **Subscription mode improvements:**
- Extract subscription auth logic to `sdk/subscription.py` — uses SDK's
bundled CLI binary instead of requiring `npm install -g
@anthropic-ai/claude-code`
- Auto-provision `~/.claude/.credentials.json` from
`CLAUDE_CODE_OAUTH_TOKEN` env var on container startup — no `claude
login` needed in Docker
- Add `scripts/refresh_claude_token.sh` — cross-platform helper
(macOS/Linux/Windows) to extract OAuth tokens from host and update
`backend/.env`

## Test plan
- [x] Validated skill on multiple PRs (#12482, #12483, #12499, #12500,
#12501, #12440, #12472) — all test scenarios passed
- [x] Confirmed screenshot upload via GitHub Git API renders correctly
on all 7 PRs
- [x] Verified subscription mode E2E in Docker:
`refresh_claude_token.sh` → `docker compose up` → copilot chat responds
correctly with no API keys (pure OAuth subscription)
- [x] Verified auto-provisioning of credentials file inside container
from `CLAUDE_CODE_OAUTH_TOKEN` env var
- [x] Confirmed bundled CLI detection
(`claude_agent_sdk._bundled/claude`) works without system-installed
`claude`
- [x] `poetry run pytest backend/copilot/sdk/service_test.py` — 24/24
tests pass
2026-03-23 15:29:00 +07:00
Zamil Majdy
1ebcf85669 fix(platform): resolve 5 production Sentry alerts (#12496)
## Summary

Fixes 5 high-priority Sentry alerts from production:

- **AUTOGPT-SERVER-8AM**: Fix `TypeError: TypedDict does not support
instance and class checks` — `_value_satisfies_type` in `type.py` now
handles TypedDict classes that don't support `isinstance()` checks
- **AUTOGPT-SERVER-8AN**: Fix `ValueError: No payment method found`
triggering Sentry error — catch the expected ValueError in the
auto-top-up endpoint and return HTTP 422 instead
- **BUILDER-7F5**: Fix `Upload failed (409): File already exists` — add
`overwrite` query param to workspace upload endpoint and set it to
`true` from the frontend direct-upload
- **BUILDER-7F0**: Fix `LaTeX-incompatible input` KaTeX warnings
flooding Sentry — set `strict: false` on rehype-katex plugin to suppress
warnings for unrecognized Unicode characters
- **AUTOGPT-SERVER-89N**: Fix `Tool execution with manager failed:
validation error for dict[str,list[any]]` — make RPC return type
validation resilient (log warning instead of crash) and downgrade
SmartDecisionMaker tool execution errors to warnings

## Test plan
- [ ] Verify TypedDict type coercion works for
GithubMultiFileCommitBlock inputs
- [ ] Verify auto-top-up without payment method returns 422, not 500
- [ ] Verify file re-upload in copilot succeeds (overwrites instead of
409)
- [ ] Verify LaTeX rendering with Unicode characters doesn't produce
console warnings
- [ ] Verify SmartDecisionMaker tool execution failures are logged at
warning level
2026-03-23 08:05:08 +00:00
Otto
ab7c38bda7 fix(frontend): detect closed OAuth popup and allow dismissing waiting modal (#12443)
Requested by @kcze

When a user closes the OAuth sign-in popup without completing
authentication, the 'Waiting on sign-in process' modal was stuck open
with no way to dismiss it, forcing a page refresh.

Two bugs caused this:

1. `oauth-popup.ts` had no detection for the popup being closed by the
user. The promise would hang until the 5-minute timeout.

2. The modal's cancel button aborted a disconnected `AbortController`
instead of the actual OAuth flow's abort function, so clicking
cancel/close did nothing.

### Changes

- Add `popup.closed` polling (500ms) in `openOAuthPopup()` that rejects
the promise when the user closes the auth window
- Add reject-on-abort so the cancel button properly terminates the flow
- Replace the disconnected `oAuthPopupController` with a direct
`cancelOAuthFlow()` function that calls the real abort ref
- Handle popup-closed and user-canceled as silent cancellations (no
error toast)

### Testing

Tested manually 
- [x] Start OAuth flow → close popup window → modal dismisses
automatically 
- [x] Start OAuth flow → click cancel on modal → popup closes, modal
dismisses 
- [x] Complete OAuth flow normally → works as before 

Resolves SECRT-2054

---
Co-authored-by: Krzysztof Czerwinski (@kcze)
<krzysztof.czerwinski@agpt.co>

---------

Co-authored-by: Krzysztof Czerwinski <kpczerwinski@gmail.com>
Co-authored-by: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-03-20 14:41:09 +00:00
Ubbe
b9ce37600e refactor(frontend/marketplace): move download below Add to library with contextual text (#12486)
## Summary

<img width="1487" height="670" alt="Screenshot 2026-03-20 at 00 52 58"
src="https://github.com/user-attachments/assets/f09de2a0-3c5b-4bce-b6f4-8a853f6792cf"
/>


- Move the download button from inline next to "Add to library" to a
separate line below it
- Add contextual text: "Want to use this agent locally? Download here"
- Style the "Download here" as a violet ghost button link with the
download icon

## Test plan
- [ ] Visit a marketplace agent page
- [ ] Verify "Add to library" button renders in its row
- [ ] Verify "Want to use this agent locally? Download here" appears
below it
- [ ] Click "Download here" and confirm the agent downloads correctly

🤖 Generated with [Claude Code](https://claude.com/claude-code)

---------

Co-authored-by: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-03-20 13:13:59 +00:00
Otto
3921deaef1 fix(frontend): truncate marketplace card description to 2 lines (#12494)
Reduces `line-clamp` from 3 to 2 on the marketplace `StoreCard`
description to prevent text from overlapping with the
absolutely-positioned run count and +Add button at the bottom of the
card.

Resolves SECRT-2156.

---
Co-authored-by: Abhimanyu Yadav (@Abhi1992002)
<122007096+Abhi1992002@users.noreply.github.com>
2026-03-20 09:10:21 +00:00
2371 changed files with 757378 additions and 4794 deletions

1
.agents/skills Symbolic link
View File

@@ -0,0 +1 @@
../.claude/skills

View File

@@ -0,0 +1,106 @@
---
name: open-pr
description: Open a pull request with proper PR template, test coverage, and review workflow. Guides agents through creating a PR that follows repo conventions, ensures existing behaviors aren't broken, covers new behaviors with tests, and handles review via bot when local testing isn't possible. TRIGGER when user asks to "open a PR", "create a PR", "make a PR", "submit a PR", "open pull request", "push and create PR", or any variation of opening/submitting a pull request.
user-invocable: true
args: "[base-branch] — optional target branch (defaults to dev)."
metadata:
author: autogpt-team
version: "1.0.0"
---
# Open a Pull Request
## Step 1: Pre-flight checks
Before opening the PR:
1. Ensure all changes are committed
2. Ensure the branch is pushed to the remote (`git push -u origin <branch>`)
3. Run linters/formatters across the whole repo (not just changed files) and commit any fixes
## Step 2: Test coverage
**This is critical.** Before opening the PR, verify:
### Existing behavior is not broken
- Identify which modules/components your changes touch
- Run the existing test suites for those areas
- If tests fail, fix them before opening the PR — do not open a PR with known regressions
### New behavior has test coverage
- Every new feature, endpoint, or behavior change needs tests
- If you added a new block, add tests for that block
- If you changed API behavior, add or update API tests
- If you changed frontend behavior, verify it doesn't break existing flows
If you cannot run the full test suite locally, note which tests you ran and which you couldn't in the test plan.
## Step 3: Create the PR using the repo template
Read the canonical PR template at `.github/PULL_REQUEST_TEMPLATE.md` and use it **verbatim** as your PR body:
1. Read the template: `cat .github/PULL_REQUEST_TEMPLATE.md`
2. Preserve the exact section titles and formatting, including:
- `### Why / What / How`
- `### Changes 🏗️`
- `### Checklist 📋`
3. Replace HTML comment prompts (`<!-- ... -->`) with actual content; do not leave them in
4. **Do not pre-check boxes** — leave all checkboxes as `- [ ]` until each step is actually completed
5. Do not alter the template structure, rename sections, or remove any checklist items
**PR title must use conventional commit format** (e.g., `feat(backend): add new block`, `fix(frontend): resolve routing bug`, `dx(skills): update PR workflow`). See CLAUDE.md for the full list of scopes.
Use `gh pr create` with the base branch (defaults to `dev` if no `[base-branch]` was provided). Use `--body-file` to avoid shell interpretation of backticks and special characters:
```bash
BASE_BRANCH="${BASE_BRANCH:-dev}"
PR_BODY=$(mktemp)
cat > "$PR_BODY" << 'PREOF'
<filled-in template from .github/PULL_REQUEST_TEMPLATE.md>
PREOF
gh pr create --base "$BASE_BRANCH" --title "<type>(scope): short description" --body-file "$PR_BODY"
rm "$PR_BODY"
```
## Step 4: Review workflow
### If you have a workspace that allows testing (docker, running backend, etc.)
- Run `/pr-test` to do E2E manual testing of the PR using docker compose, agent-browser, and API calls. This is the most thorough way to validate your changes before review.
- After testing, run `/pr-review` to self-review the PR for correctness, security, code quality, and testing gaps before requesting human review.
### If you do NOT have a workspace that allows testing
This is common for agents running in worktrees without a full stack. In this case:
1. Run `/pr-review` locally to catch obvious issues before pushing
2. **Comment `/review` on the PR** after creating it to trigger the review bot
3. **Poll for the review** rather than blindly waiting — check for new review comments every 30 seconds using `gh api repos/Significant-Gravitas/AutoGPT/pulls/{N}/reviews --paginate` and the GraphQL inline threads query. The bot typically responds within 30 minutes, but polling lets the agent react as soon as it arrives.
4. Do NOT proceed or merge until the bot review comes back
5. Address any issues the bot raises — use `/pr-address` which has a full polling loop with CI + comment tracking
```bash
# After creating the PR:
PR_NUMBER=$(gh pr view --json number -q .number)
gh pr comment "$PR_NUMBER" --body "/review"
# Then use /pr-address to poll for and address the review when it arrives
```
## Step 5: Address review feedback
Once the review bot or human reviewers leave comments:
- Run `/pr-address` to address review comments. It will loop until CI is green and all comments are resolved.
- Do not merge without human approval.
## Related skills
| Skill | When to use |
|---|---|
| `/pr-test` | E2E testing with docker compose, agent-browser, API calls — use when you have a running workspace |
| `/pr-review` | Review for correctness, security, code quality — use before requesting human review |
| `/pr-address` | Address reviewer comments and loop until CI green — use after reviews come in |
## Step 6: Post-creation
After the PR is created and review is triggered:
- Share the PR URL with the user
- If waiting on the review bot, let the user know the expected wait time (~30 min)
- Do not merge without human approval

View File

@@ -17,6 +17,14 @@ gh pr list --head $(git branch --show-current) --repo Significant-Gravitas/AutoG
gh pr view {N}
```
## Read the PR description
Understand the **Why / What / How** before addressing comments — you need context to make good fixes:
```bash
gh pr view {N} --json body --jq '.body'
```
## Fetch comments (all sources)
### 1. Inline review threads — GraphQL (primary source of actionable items)
@@ -105,7 +113,9 @@ kill $REST_PID 2>/dev/null; trap - EXIT
```
Never manually edit files in `src/app/api/__generated__/`.
Then commit and **push immediately** — never batch commits without pushing.
Then commit and **push immediately** — never batch commits without pushing. Each fix should be visible on GitHub right away so CI can start and reviewers can see progress.
**Never push empty commits** (`git commit --allow-empty`) to re-trigger CI or bot checks. When a check fails, investigate the root cause (unchecked PR checklist, unaddressed review comments, code issues) and fix those directly. Empty commits add noise to git history.
For backend commits in worktrees: `poetry run git commit` (pre-commit hooks).

View File

@@ -17,6 +17,16 @@ gh pr list --head $(git branch --show-current) --repo Significant-Gravitas/AutoG
gh pr view {N}
```
## Read the PR description
Before reading code, understand the **why**, **what**, and **how** from the PR description:
```bash
gh pr view {N} --json body --jq '.body'
```
Every PR should have a Why / What / How structure. If any of these are missing, note it as feedback.
## Read the diff
```bash
@@ -34,6 +44,8 @@ gh api repos/Significant-Gravitas/AutoGPT/pulls/{N}/reviews
## What to check
**Description quality:** Does the PR description cover Why (motivation/problem), What (summary of changes), and How (approach/implementation details)? If any are missing, request them — you can't judge the approach without understanding the problem and intent.
**Correctness:** logic errors, off-by-one, missing edge cases, race conditions (TOCTOU in file access, credit charging), error handling gaps, async correctness (missing `await`, unclosed resources).
**Security:** input validation at boundaries, no injection (command, XSS, SQL), secrets not logged, file paths sanitized (`os.path.basename()` in error messages).

View File

@@ -0,0 +1,754 @@
---
name: pr-test
description: "E2E manual testing of PRs/branches using docker compose, agent-browser, and API calls. TRIGGER when user asks to manually test a PR, test a feature end-to-end, or run integration tests against a running system."
user-invocable: true
argument-hint: "[worktree path or PR number] — tests the PR in the given worktree. Optional flags: --fix (auto-fix issues found)"
metadata:
author: autogpt-team
version: "2.0.0"
---
# Manual E2E Test
Test a PR/branch end-to-end by building the full platform, interacting via browser and API, capturing screenshots, and reporting results.
## Critical Requirements
These are NON-NEGOTIABLE. Every test run MUST satisfy ALL the following:
### 1. Screenshots at Every Step
- Take a screenshot at EVERY significant test step — not just at the end
- Every test scenario MUST have at least one BEFORE and one AFTER screenshot
- Name screenshots sequentially: `{NN}-{action}-{state}.png` (e.g., `01-credits-before.png`, `02-credits-after.png`)
- If a screenshot is missing for a scenario, the test is INCOMPLETE — go back and take it
### 2. Screenshots MUST Be Posted to PR
- Push ALL screenshots to a temp branch `test-screenshots/pr-{N}`
- Post a PR comment with ALL screenshots embedded inline using GitHub raw URLs
- This is NOT optional — every test run MUST end with a PR comment containing screenshots
- If screenshot upload fails, retry. If it still fails, list failed files and require manual drag-and-drop/paste attachment in the PR comment
### 3. State Verification with Before/After Evidence
- For EVERY state-changing operation (API call, user action), capture the state BEFORE and AFTER
- Log the actual API response values (e.g., `credits_before=100, credits_after=95`)
- Screenshot MUST show the relevant UI state change
- Compare expected vs actual values explicitly — do not just eyeball it
### 4. Negative Test Cases Are Mandatory
- Test at least ONE negative case per feature (e.g., insufficient credits, invalid input, unauthorized access)
- Verify error messages are user-friendly and accurate
- Verify the system state did NOT change after a rejected operation
### 5. Test Report Must Include Full Evidence
Each test scenario in the report MUST have:
- **Steps**: What was done (exact commands or UI actions)
- **Expected**: What should happen
- **Actual**: What actually happened
- **API Evidence**: Before/after API response values for state-changing operations
- **Screenshot Evidence**: Before/after screenshots with explanations
## State Manipulation for Realistic Testing
When testing features that depend on specific states (rate limits, credits, quotas):
1. **Use Redis CLI to set counters directly:**
```bash
# Find the Redis container
REDIS_CONTAINER=$(docker ps --format '{{.Names}}' | grep redis | head -1)
# Set a key with expiry
docker exec $REDIS_CONTAINER redis-cli SET key value EX ttl
# Example: Set rate limit counter to near-limit
docker exec $REDIS_CONTAINER redis-cli SET "rate_limit:user:test@test.com" 99 EX 3600
# Example: Check current value
docker exec $REDIS_CONTAINER redis-cli GET "rate_limit:user:test@test.com"
```
2. **Use API calls to check before/after state:**
```bash
# BEFORE: Record current state
BEFORE=$(curl -s -H "Authorization: Bearer $TOKEN" http://localhost:8006/api/credits | jq '.credits')
echo "Credits BEFORE: $BEFORE"
# Perform the action...
# AFTER: Record new state and compare
AFTER=$(curl -s -H "Authorization: Bearer $TOKEN" http://localhost:8006/api/credits | jq '.credits')
echo "Credits AFTER: $AFTER"
echo "Delta: $(( BEFORE - AFTER ))"
```
3. **Take screenshots BEFORE and AFTER state changes** — the UI must reflect the backend state change
4. **Never rely on mocked/injected browser state** — always use real backend state. Do NOT use `agent-browser eval` to fake UI state. The backend must be the source of truth.
5. **Use direct DB queries when needed:**
```bash
# Query via Supabase's PostgREST or docker exec into the DB
docker exec supabase-db psql -U supabase_admin -d postgres -c "SELECT credits FROM user_credits WHERE user_id = '...';"
```
6. **After every API test, verify the state change actually persisted:**
```bash
# Example: After a credits purchase, verify DB matches API
API_CREDITS=$(curl -s -H "Authorization: Bearer $TOKEN" http://localhost:8006/api/credits | jq '.credits')
DB_CREDITS=$(docker exec supabase-db psql -U supabase_admin -d postgres -t -c "SELECT credits FROM user_credits WHERE user_id = '...';" | tr -d ' ')
[ "$API_CREDITS" = "$DB_CREDITS" ] && echo "CONSISTENT" || echo "MISMATCH: API=$API_CREDITS DB=$DB_CREDITS"
```
## Arguments
- `$ARGUMENTS` — worktree path (e.g. `$REPO_ROOT`) or PR number
- If `--fix` flag is present, auto-fix bugs found and push fixes (like pr-address loop)
## Step 0: Resolve the target
```bash
# If argument is a PR number, find its worktree
gh pr view {N} --json headRefName --jq '.headRefName'
# If argument is a path, use it directly
```
Determine:
- `REPO_ROOT` — the root repo directory: `git -C "$WORKTREE_PATH" worktree list | head -1 | awk '{print $1}'` (or `git rev-parse --show-toplevel` if not a worktree)
- `WORKTREE_PATH` — the worktree directory
- `PLATFORM_DIR` — `$WORKTREE_PATH/autogpt_platform`
- `BACKEND_DIR` — `$PLATFORM_DIR/backend`
- `FRONTEND_DIR` — `$PLATFORM_DIR/frontend`
- `PR_NUMBER` — the PR number (from `gh pr list --head $(git branch --show-current)`)
- `PR_TITLE` — the PR title, slugified (e.g. "Add copilot permissions" → "add-copilot-permissions")
- `RESULTS_DIR` — `$REPO_ROOT/test-results/PR-{PR_NUMBER}-{slugified-title}`
Create the results directory:
```bash
PR_NUMBER=$(cd $WORKTREE_PATH && gh pr list --head $(git branch --show-current) --repo Significant-Gravitas/AutoGPT --json number --jq '.[0].number')
PR_TITLE=$(cd $WORKTREE_PATH && gh pr list --head $(git branch --show-current) --repo Significant-Gravitas/AutoGPT --json title --jq '.[0].title' | tr '[:upper:]' '[:lower:]' | sed 's/[^a-z0-9]/-/g' | sed 's/--*/-/g' | sed 's/^-//;s/-$//' | head -c 50)
RESULTS_DIR="$REPO_ROOT/test-results/PR-${PR_NUMBER}-${PR_TITLE}"
mkdir -p $RESULTS_DIR
```
**Test user credentials** (for logging into the UI or verifying results manually):
- Email: `test@test.com`
- Password: `testtest123`
## Step 1: Understand the PR
Before testing, understand what changed:
```bash
cd $WORKTREE_PATH
# Read PR description to understand the WHY
gh pr view {N} --json body --jq '.body'
git log --oneline dev..HEAD | head -20
git diff dev --stat
```
Read the PR description (Why / What / How) and changed files to understand:
0. **Why** does this PR exist? What problem does it solve?
1. **What** feature/fix does this PR implement?
2. **How** does it work? What's the approach?
3. What components are affected? (backend, frontend, copilot, executor, etc.)
4. What are the key user-facing behaviors to test?
## Step 2: Write test scenarios
Based on the PR analysis, write a test plan to `$RESULTS_DIR/test-plan.md`:
```markdown
# Test Plan: PR #{N} — {title}
## Scenarios
1. [Scenario name] — [what to verify]
2. ...
## API Tests (if applicable)
1. [Endpoint] — [expected behavior]
- Before state: [what to check before]
- After state: [what to verify changed]
## UI Tests (if applicable)
1. [Page/component] — [interaction to test]
- Screenshot before: [what to capture]
- Screenshot after: [what to capture]
## Negative Tests (REQUIRED — at least one per feature)
1. [What should NOT happen] — [how to trigger it]
- Expected error: [what error message/code]
- State unchanged: [what to verify did NOT change]
```
**Be critical** — include edge cases, error paths, and security checks. Every scenario MUST specify what screenshots to take and what state to verify.
## Step 3: Environment setup
### 3a. Copy .env files from the root worktree
The root worktree (`$REPO_ROOT`) has the canonical `.env` files with all API keys. Copy them to the target worktree:
```bash
# CRITICAL: .env files are NOT checked into git. They must be copied manually.
cp $REPO_ROOT/autogpt_platform/.env $PLATFORM_DIR/.env
cp $REPO_ROOT/autogpt_platform/backend/.env $BACKEND_DIR/.env
cp $REPO_ROOT/autogpt_platform/frontend/.env $FRONTEND_DIR/.env
```
### 3b. Configure copilot authentication
The copilot needs an LLM API to function. Two approaches (try subscription first):
#### Option 1: Subscription mode (preferred — uses your Claude Max/Pro subscription)
The `claude_agent_sdk` Python package **bundles its own Claude CLI binary** — no need to install `@anthropic-ai/claude-code` via npm. The backend auto-provisions credentials from environment variables on startup.
Run the helper script to extract tokens from your host and auto-update `backend/.env` (works on macOS, Linux, and Windows/WSL):
```bash
# Extracts OAuth tokens and writes CLAUDE_CODE_OAUTH_TOKEN + CLAUDE_CODE_REFRESH_TOKEN into .env
bash $BACKEND_DIR/scripts/refresh_claude_token.sh --env-file $BACKEND_DIR/.env
```
**How it works:** The script reads the OAuth token from:
- **macOS**: system keychain (`"Claude Code-credentials"`)
- **Linux/WSL**: `~/.claude/.credentials.json`
- **Windows**: `%APPDATA%/claude/.credentials.json`
It sets `CLAUDE_CODE_OAUTH_TOKEN`, `CLAUDE_CODE_REFRESH_TOKEN`, and `CHAT_USE_CLAUDE_CODE_SUBSCRIPTION=true` in the `.env` file. On container startup, the backend auto-provisions `~/.claude/.credentials.json` inside the container from these env vars. The SDK's bundled CLI then authenticates using that file. No `claude login`, no npm install needed.
**Note:** The OAuth token expires (~24h). If copilot returns auth errors, re-run the script and restart: `$BACKEND_DIR/scripts/refresh_claude_token.sh --env-file $BACKEND_DIR/.env && docker compose up -d copilot_executor`
#### Option 2: OpenRouter API key mode (fallback)
If subscription mode doesn't work, switch to API key mode using OpenRouter:
```bash
# In $BACKEND_DIR/.env, ensure these are set:
CHAT_USE_CLAUDE_CODE_SUBSCRIPTION=false
CHAT_API_KEY=<value of OPEN_ROUTER_API_KEY from the same .env>
CHAT_BASE_URL=https://openrouter.ai/api/v1
CHAT_USE_CLAUDE_AGENT_SDK=true
```
Use `sed` to update these values:
```bash
ORKEY=$(grep "^OPEN_ROUTER_API_KEY=" $BACKEND_DIR/.env | cut -d= -f2)
[ -n "$ORKEY" ] || { echo "ERROR: OPEN_ROUTER_API_KEY is missing in $BACKEND_DIR/.env"; exit 1; }
perl -i -pe 's/CHAT_USE_CLAUDE_CODE_SUBSCRIPTION=true/CHAT_USE_CLAUDE_CODE_SUBSCRIPTION=false/' $BACKEND_DIR/.env
# Add or update CHAT_API_KEY and CHAT_BASE_URL
grep -q "^CHAT_API_KEY=" $BACKEND_DIR/.env && perl -i -pe "s|^CHAT_API_KEY=.*|CHAT_API_KEY=$ORKEY|" $BACKEND_DIR/.env || echo "CHAT_API_KEY=$ORKEY" >> $BACKEND_DIR/.env
grep -q "^CHAT_BASE_URL=" $BACKEND_DIR/.env && perl -i -pe 's|^CHAT_BASE_URL=.*|CHAT_BASE_URL=https://openrouter.ai/api/v1|' $BACKEND_DIR/.env || echo "CHAT_BASE_URL=https://openrouter.ai/api/v1" >> $BACKEND_DIR/.env
```
### 3c. Stop conflicting containers
```bash
# Stop any running app containers (keep infra: supabase, redis, rabbitmq, clamav)
docker ps --format "{{.Names}}" | grep -E "rest_server|executor|copilot|websocket|database_manager|scheduler|notification|frontend|migrate" | while read name; do
docker stop "$name" 2>/dev/null
done
```
### 3e. Build and start
```bash
cd $PLATFORM_DIR && docker compose build --no-cache 2>&1 | tail -20
if [ ${PIPESTATUS[0]} -ne 0 ]; then echo "ERROR: Docker build failed"; exit 1; fi
cd $PLATFORM_DIR && docker compose up -d 2>&1 | tail -20
if [ ${PIPESTATUS[0]} -ne 0 ]; then echo "ERROR: Docker compose up failed"; exit 1; fi
```
**Note:** If the container appears to be running old code (e.g. missing PR changes), use `docker compose build --no-cache` to force a full rebuild. Docker BuildKit may sometimes reuse cached `COPY` layers from a previous build on a different branch.
**Expected time: 3-8 minutes** for build, 5-10 minutes with `--no-cache`.
### 3f. Wait for services to be ready
```bash
# Poll until backend and frontend respond
for i in $(seq 1 60); do
BACKEND=$(curl -s -o /dev/null -w "%{http_code}" http://localhost:8006/docs 2>/dev/null)
FRONTEND=$(curl -s -o /dev/null -w "%{http_code}" http://localhost:3000 2>/dev/null)
if [ "$BACKEND" = "200" ] && [ "$FRONTEND" = "200" ]; then
echo "Services ready"
break
fi
sleep 5
done
```
### 3h. Create test user and get auth token
```bash
ANON_KEY=$(grep "NEXT_PUBLIC_SUPABASE_ANON_KEY=" $FRONTEND_DIR/.env | sed 's/.*NEXT_PUBLIC_SUPABASE_ANON_KEY=//' | tr -d '[:space:]')
# Signup (idempotent — returns "User already registered" if exists)
RESULT=$(curl -s -X POST 'http://localhost:8000/auth/v1/signup' \
-H "apikey: $ANON_KEY" \
-H 'Content-Type: application/json' \
-d '{"email":"test@test.com","password":"testtest123"}')
# If "Database error finding user", restart supabase-auth and retry
if echo "$RESULT" | grep -q "Database error"; then
docker restart supabase-auth && sleep 5
curl -s -X POST 'http://localhost:8000/auth/v1/signup' \
-H "apikey: $ANON_KEY" \
-H 'Content-Type: application/json' \
-d '{"email":"test@test.com","password":"testtest123"}'
fi
# Get auth token
TOKEN=$(curl -s -X POST 'http://localhost:8000/auth/v1/token?grant_type=password' \
-H "apikey: $ANON_KEY" \
-H 'Content-Type: application/json' \
-d '{"email":"test@test.com","password":"testtest123"}' | jq -r '.access_token // ""')
```
**Use this token for ALL API calls:**
```bash
curl -H "Authorization: Bearer $TOKEN" http://localhost:8006/api/...
```
## Step 4: Run tests
### Service ports reference
| Service | Port | URL |
|---------|------|-----|
| Frontend | 3000 | http://localhost:3000 |
| Backend REST | 8006 | http://localhost:8006 |
| Supabase Auth (via Kong) | 8000 | http://localhost:8000 |
| Executor | 8002 | http://localhost:8002 |
| Copilot Executor | 8008 | http://localhost:8008 |
| WebSocket | 8001 | http://localhost:8001 |
| Database Manager | 8005 | http://localhost:8005 |
| Redis | 6379 | localhost:6379 |
| RabbitMQ | 5672 | localhost:5672 |
### API testing
Use `curl` with the auth token for backend API tests. **For EVERY API call that changes state, record before/after values:**
```bash
# Example: List agents
curl -s -H "Authorization: Bearer $TOKEN" http://localhost:8006/api/graphs | jq . | head -20
# Example: Create an agent
curl -s -X POST http://localhost:8006/api/graphs \
-H "Authorization: Bearer $TOKEN" \
-H 'Content-Type: application/json' \
-d '{...}' | jq .
# Example: Run an agent
curl -s -X POST "http://localhost:8006/api/graphs/{graph_id}/execute" \
-H "Authorization: Bearer $TOKEN" \
-H 'Content-Type: application/json' \
-d '{"data": {...}}'
# Example: Get execution results
curl -s -H "Authorization: Bearer $TOKEN" \
"http://localhost:8006/api/graphs/{graph_id}/executions/{exec_id}" | jq .
```
**State verification pattern (use for EVERY state-changing API call):**
```bash
# 1. Record BEFORE state
BEFORE_STATE=$(curl -s -H "Authorization: Bearer $TOKEN" http://localhost:8006/api/{resource} | jq '{relevant_fields}')
echo "BEFORE: $BEFORE_STATE"
# 2. Perform the action
ACTION_RESULT=$(curl -s -X POST ... | jq .)
echo "ACTION RESULT: $ACTION_RESULT"
# 3. Record AFTER state
AFTER_STATE=$(curl -s -H "Authorization: Bearer $TOKEN" http://localhost:8006/api/{resource} | jq '{relevant_fields}')
echo "AFTER: $AFTER_STATE"
# 4. Log the comparison
echo "=== STATE CHANGE VERIFICATION ==="
echo "Before: $BEFORE_STATE"
echo "After: $AFTER_STATE"
echo "Expected change: {describe what should have changed}"
```
### Browser testing with agent-browser
```bash
# Close any existing session
agent-browser close 2>/dev/null || true
# Use --session-name to persist cookies across navigations
# This means login only needs to happen once per test session
agent-browser --session-name pr-test open 'http://localhost:3000/login' --timeout 15000
# Get interactive elements
agent-browser --session-name pr-test snapshot | grep "textbox\|button"
# Login
agent-browser --session-name pr-test fill {email_ref} "test@test.com"
agent-browser --session-name pr-test fill {password_ref} "testtest123"
agent-browser --session-name pr-test click {login_button_ref}
sleep 5
# Dismiss cookie banner if present
agent-browser --session-name pr-test click 'text=Accept All' 2>/dev/null || true
# Navigate — cookies are preserved so login persists
agent-browser --session-name pr-test open 'http://localhost:3000/copilot' --timeout 10000
# Take screenshot
agent-browser --session-name pr-test screenshot $RESULTS_DIR/01-page.png
# Interact with elements
agent-browser --session-name pr-test fill {ref} "text"
agent-browser --session-name pr-test press "Enter"
agent-browser --session-name pr-test click {ref}
agent-browser --session-name pr-test click 'text=Button Text'
# Read page content
agent-browser --session-name pr-test snapshot | grep "text:"
```
**Key pages:**
- `/copilot` — CoPilot chat (for testing copilot features)
- `/build` — Agent builder (for testing block/node features)
- `/build?flowID={id}` — Specific agent in builder
- `/library` — Agent library (for testing listing/import features)
- `/library/agents/{id}` — Agent detail with run history
- `/marketplace` — Marketplace
### Checking logs
```bash
# Backend REST server
docker logs autogpt_platform-rest_server-1 2>&1 | tail -30
# Executor (runs agent graphs)
docker logs autogpt_platform-executor-1 2>&1 | tail -30
# Copilot executor (runs copilot chat sessions)
docker logs autogpt_platform-copilot_executor-1 2>&1 | tail -30
# Frontend
docker logs autogpt_platform-frontend-1 2>&1 | tail -30
# Filter for errors
docker logs autogpt_platform-executor-1 2>&1 | grep -i "error\|exception\|traceback" | tail -20
```
### Copilot chat testing
The copilot uses SSE streaming. To test via API:
```bash
# Create a session
SESSION_ID=$(curl -s -X POST 'http://localhost:8006/api/chat/sessions' \
-H "Authorization: Bearer $TOKEN" \
-H 'Content-Type: application/json' \
-d '{}' | jq -r '.id // .session_id // ""')
# Stream a message (SSE - will stream chunks)
curl -N -X POST "http://localhost:8006/api/chat/sessions/$SESSION_ID/stream" \
-H "Authorization: Bearer $TOKEN" \
-H 'Content-Type: application/json' \
-d '{"message": "Hello, what can you help me with?"}' \
--max-time 60 2>/dev/null | head -50
```
Or test via browser (preferred for UI verification):
```bash
agent-browser --session-name pr-test open 'http://localhost:3000/copilot' --timeout 10000
# ... fill chat input and press Enter, wait 20-30s for response
```
## Step 5: Record results and take screenshots
**Take a screenshot at EVERY significant test step** — before and after interactions, on success, and on failure. This is NON-NEGOTIABLE.
**Required screenshot pattern for each test scenario:**
```bash
# BEFORE the action
agent-browser --session-name pr-test screenshot $RESULTS_DIR/{NN}-{scenario}-before.png
# Perform the action...
# AFTER the action
agent-browser --session-name pr-test screenshot $RESULTS_DIR/{NN}-{scenario}-after.png
```
**Naming convention:**
```bash
# Examples:
# $RESULTS_DIR/01-login-page-before.png
# $RESULTS_DIR/02-login-page-after.png
# $RESULTS_DIR/03-credits-page-before.png
# $RESULTS_DIR/04-credits-purchase-after.png
# $RESULTS_DIR/05-negative-insufficient-credits.png
# $RESULTS_DIR/06-error-state.png
```
**Minimum requirements:**
- At least TWO screenshots per test scenario (before + after)
- At least ONE screenshot for each negative test case showing the error state
- If a test fails, screenshot the failure state AND any error logs visible in the UI
## Step 6: Show results to user with screenshots
**CRITICAL: After all tests complete, you MUST show every screenshot to the user using the Read tool, with an explanation of what each screenshot shows.** This is the most important part of the test report — the user needs to visually verify the results.
For each screenshot:
1. Use the `Read` tool to display the PNG file (Claude can read images)
2. Write a 1-2 sentence explanation below it describing:
- What page/state is being shown
- What the screenshot proves (which test scenario it validates)
- Any notable details visible in the UI
Format the output like this:
```markdown
### Screenshot 1: {descriptive title}
[Read the PNG file here]
**What it shows:** {1-2 sentence explanation of what this screenshot proves}
---
```
After showing all screenshots, output a **detailed** summary table:
| # | Scenario | Result | API Evidence | Screenshot Evidence |
|---|----------|--------|-------------|-------------------|
| 1 | {name} | PASS/FAIL | Before: X, After: Y | 01-before.png, 02-after.png |
| 2 | ... | ... | ... | ... |
**IMPORTANT:** As you show each screenshot and record test results, persist them in shell variables for Step 7:
```bash
# Build these variables during Step 6 — they are required by Step 7's script
# NOTE: declare -A requires Bash 4.0+. This is standard on modern systems (macOS ships zsh
# but Homebrew bash is 5.x; Linux typically has bash 5.x). If running on Bash <4, use a
# plain variable with a lookup function instead.
declare -A SCREENSHOT_EXPLANATIONS=(
["01-login-page.png"]="Shows the login page loaded successfully with SSO options visible."
["02-builder-with-block.png"]="The builder canvas displays the newly added block connected to the trigger."
# ... one entry per screenshot, using the same explanations you showed the user above
)
TEST_RESULTS_TABLE="| 1 | Login flow | PASS | N/A | 01-login-before.png, 02-login-after.png |
| 2 | Credits purchase | PASS | Before: 100, After: 95 | 03-credits-before.png, 04-credits-after.png |
| 3 | Insufficient credits (negative) | PASS | Credits: 0, rejected | 05-insufficient-credits-error.png |"
# ... one row per test scenario with actual results
```
## Step 7: Post test report as PR comment with screenshots
Upload screenshots to the PR using the GitHub Git API (no local git operations — safe for worktrees), then post a comment with inline images and per-screenshot explanations.
**This step is MANDATORY. Every test run MUST post a PR comment with screenshots. No exceptions.**
```bash
# Upload screenshots via GitHub Git API (creates blobs, tree, commit, and ref remotely)
REPO="Significant-Gravitas/AutoGPT"
SCREENSHOTS_BRANCH="test-screenshots/pr-${PR_NUMBER}"
SCREENSHOTS_DIR="test-screenshots/PR-${PR_NUMBER}"
# Step 1: Create blobs for each screenshot and build tree JSON
# Retry each blob upload up to 3 times. If still failing, list them at end of report.
shopt -s nullglob
SCREENSHOT_FILES=("$RESULTS_DIR"/*.png)
if [ ${#SCREENSHOT_FILES[@]} -eq 0 ]; then
echo "ERROR: No screenshots found in $RESULTS_DIR. Test run is incomplete."
exit 1
fi
TREE_JSON='['
FIRST=true
FAILED_UPLOADS=()
for img in "${SCREENSHOT_FILES[@]}"; do
BASENAME=$(basename "$img")
B64=$(base64 < "$img")
BLOB_SHA=""
for attempt in 1 2 3; do
BLOB_SHA=$(gh api "repos/${REPO}/git/blobs" -f content="$B64" -f encoding="base64" --jq '.sha' 2>/dev/null || true)
[ -n "$BLOB_SHA" ] && break
sleep 1
done
if [ -z "$BLOB_SHA" ]; then
FAILED_UPLOADS+=("$img")
continue
fi
if [ "$FIRST" = true ]; then FIRST=false; else TREE_JSON+=','; fi
TREE_JSON+="{\"path\":\"${SCREENSHOTS_DIR}/${BASENAME}\",\"mode\":\"100644\",\"type\":\"blob\",\"sha\":\"${BLOB_SHA}\"}"
done
TREE_JSON+=']'
# Step 2: Create tree, commit, and branch ref
TREE_SHA=$(echo "$TREE_JSON" | jq -c '{tree: .}' | gh api "repos/${REPO}/git/trees" --input - --jq '.sha')
COMMIT_SHA=$(gh api "repos/${REPO}/git/commits" \
-f message="test: add E2E test screenshots for PR #${PR_NUMBER}" \
-f tree="$TREE_SHA" \
--jq '.sha')
gh api "repos/${REPO}/git/refs" \
-f ref="refs/heads/${SCREENSHOTS_BRANCH}" \
-f sha="$COMMIT_SHA" 2>/dev/null \
|| gh api "repos/${REPO}/git/refs/heads/${SCREENSHOTS_BRANCH}" \
-X PATCH -f sha="$COMMIT_SHA" -f force=true
```
Then post the comment with **inline images AND explanations for each screenshot**:
```bash
REPO_URL="https://raw.githubusercontent.com/${REPO}/${SCREENSHOTS_BRANCH}"
# Build image markdown using uploaded image URLs; skip FAILED_UPLOADS (listed separately)
IMAGE_MARKDOWN=""
for img in "${SCREENSHOT_FILES[@]}"; do
BASENAME=$(basename "$img")
TITLE=$(echo "${BASENAME%.png}" | sed 's/^[0-9]*-//' | sed 's/-/ /g' | awk '{for(i=1;i<=NF;i++) $i=toupper(substr($i,1,1)) tolower(substr($i,2))}1')
# Skip images that failed to upload — they will be listed at the end
IS_FAILED=false
for failed in "${FAILED_UPLOADS[@]}"; do
[ "$(basename "$failed")" = "$BASENAME" ] && IS_FAILED=true && break
done
if [ "$IS_FAILED" = true ]; then
continue
fi
EXPLANATION="${SCREENSHOT_EXPLANATIONS[$BASENAME]}"
if [ -z "$EXPLANATION" ]; then
echo "ERROR: Missing screenshot explanation for $BASENAME. Add it to SCREENSHOT_EXPLANATIONS in Step 6."
exit 1
fi
IMAGE_MARKDOWN="${IMAGE_MARKDOWN}
### ${TITLE}
![${BASENAME}](${REPO_URL}/${SCREENSHOTS_DIR}/${BASENAME})
${EXPLANATION}
"
done
# Write comment body to file to avoid shell interpretation issues with special characters
COMMENT_FILE=$(mktemp)
# If any uploads failed, append a section listing them with instructions
FAILED_SECTION=""
if [ ${#FAILED_UPLOADS[@]} -gt 0 ]; then
FAILED_SECTION="
## ⚠️ Failed Screenshot Uploads
The following screenshots could not be uploaded via the GitHub API after 3 retries.
**To add them:** drag-and-drop or paste these files into a PR comment manually:
"
for failed in "${FAILED_UPLOADS[@]}"; do
FAILED_SECTION="${FAILED_SECTION}
- \`$(basename "$failed")\` (local path: \`$failed\`)"
done
FAILED_SECTION="${FAILED_SECTION}
**Run status:** INCOMPLETE until the files above are manually attached and visible inline in the PR."
fi
cat > "$COMMENT_FILE" <<INNEREOF
## E2E Test Report
| # | Scenario | Result | API Evidence | Screenshot Evidence |
|---|----------|--------|-------------|-------------------|
${TEST_RESULTS_TABLE}
${IMAGE_MARKDOWN}
${FAILED_SECTION}
INNEREOF
gh api "repos/${REPO}/issues/$PR_NUMBER/comments" -F body=@"$COMMENT_FILE"
rm -f "$COMMENT_FILE"
```
**The PR comment MUST include:**
1. A summary table of all scenarios with PASS/FAIL and before/after API evidence
2. Every successfully uploaded screenshot rendered inline; any failed uploads listed with manual attachment instructions
3. A 1-2 sentence explanation below each screenshot describing what it proves
This approach uses the GitHub Git API to create blobs, trees, commits, and refs entirely server-side. No local `git checkout` or `git push` — safe for worktrees and won't interfere with the PR branch.
## Fix mode (--fix flag)
When `--fix` is present, the standard is HIGHER. Do not just note issues — FIX them immediately.
### Fix protocol for EVERY issue found (including UX issues):
1. **Identify** the root cause in the code — read the relevant source files
2. **Write a failing test first** (TDD): For backend bugs, write a test marked with `pytest.mark.xfail(reason="...")`. For frontend/Playwright bugs, write a test with `.fixme` annotation. Run it to confirm it fails as expected.
3. **Screenshot** the broken state: `agent-browser screenshot $RESULTS_DIR/{NN}-broken-{description}.png`
4. **Fix** the code in the worktree
5. **Rebuild** ONLY the affected service (not the whole stack):
```bash
cd $PLATFORM_DIR && docker compose up --build -d {service_name}
# e.g., docker compose up --build -d rest_server
# e.g., docker compose up --build -d frontend
```
6. **Wait** for the service to be ready (poll health endpoint)
7. **Re-test** the same scenario
8. **Screenshot** the fixed state: `agent-browser screenshot $RESULTS_DIR/{NN}-fixed-{description}.png`
9. **Remove the xfail/fixme marker** from the test written in step 2, and verify it passes
10. **Verify** the fix did not break other scenarios (run a quick smoke test)
11. **Commit and push** immediately:
```bash
cd $WORKTREE_PATH
git add -A
git commit -m "fix: {description of fix}"
git push
```
12. **Continue** to the next test scenario
### Fix loop (like pr-address)
```text
test scenario → find issue (bug OR UX problem) → screenshot broken state
→ fix code → rebuild affected service only → re-test → screenshot fixed state
→ verify no regressions → commit + push
→ repeat for next scenario
→ after ALL scenarios pass, run full re-test to verify everything together
```
**Key differences from non-fix mode:**
- UX issues count as bugs — fix them (bad alignment, confusing labels, missing loading states)
- Every fix MUST have a before/after screenshot pair proving it works
- Commit after EACH fix, not in a batch at the end
- The final re-test must produce a clean set of all-passing screenshots
## Known issues and workarounds
### Problem: "Database error finding user" on signup
**Cause:** Supabase auth service schema cache is stale after migration.
**Fix:** `docker restart supabase-auth && sleep 5` then retry signup.
### Problem: Copilot returns auth errors in subscription mode
**Cause:** `CHAT_USE_CLAUDE_CODE_SUBSCRIPTION=true` but `CLAUDE_CODE_OAUTH_TOKEN` is not set or expired.
**Fix:** Re-extract the OAuth token from macOS keychain (see step 3b, Option 1) and recreate the container (`docker compose up -d copilot_executor`). The backend auto-provisions `~/.claude/.credentials.json` from the env var on startup. No `npm install` or `claude login` needed — the SDK bundles its own CLI binary.
### Problem: agent-browser can't find chromium
**Cause:** The Dockerfile auto-provisions system chromium on all architectures (including ARM64). If your branch is behind `dev`, this may not be present yet.
**Fix:** Check if chromium exists: `which chromium || which chromium-browser`. If missing, install it: `apt-get install -y chromium` and set `AGENT_BROWSER_EXECUTABLE_PATH=/usr/bin/chromium` in the container environment.
### Problem: agent-browser selector matches multiple elements
**Cause:** `text=X` matches all elements containing that text.
**Fix:** Use `agent-browser snapshot` to get specific `ref=eNN` references, then use those: `agent-browser click eNN`.
### Problem: Frontend shows cookie banner blocking interaction
**Fix:** `agent-browser click 'text=Accept All'` before other interactions.
### Problem: Container loses npm packages after rebuild
**Cause:** `docker compose up --build` rebuilds the image, losing runtime installs.
**Fix:** Add packages to the Dockerfile instead of installing at runtime.
### Problem: Services not starting after `docker compose up`
**Fix:** Wait and check health: `docker compose ps`. Common cause: migration hasn't finished. Check: `docker logs autogpt_platform-migrate-1 2>&1 | tail -5`. If supabase-db isn't healthy: `docker restart supabase-db && sleep 10`.
### Problem: Docker uses cached layers with old code (PR changes not visible)
**Cause:** `docker compose up --build` reuses cached `COPY` layers from previous builds. If the PR branch changes Python files but the previous build already cached that layer from `dev`, the container runs `dev` code.
**Fix:** Always use `docker compose build --no-cache` for the first build of a PR branch. Subsequent rebuilds within the same branch can use `--build`.
### Problem: `agent-browser open` loses login session
**Cause:** Without session persistence, `agent-browser open` starts fresh.
**Fix:** Use `--session-name pr-test` on ALL agent-browser commands. This auto-saves/restores cookies and localStorage across navigations. Alternatively, use `agent-browser eval "window.location.href = '...'"` to navigate within the same context.
### Problem: Supabase auth returns "Database error querying schema"
**Cause:** The database schema changed (migration ran) but supabase-auth has a stale schema cache.
**Fix:** `docker restart supabase-db && sleep 10 && docker restart supabase-auth && sleep 8`. If user data was lost, re-signup.

View File

@@ -0,0 +1,195 @@
---
name: setup-repo
description: Initialize a worktree-based repo layout for parallel development. Creates a main worktree, a reviews worktree for PR reviews, and N numbered work branches. Handles .env creation, dependency installation, and branchlet config. TRIGGER when user asks to set up the repo from scratch, initialize worktrees, bootstrap their dev environment, "setup repo", "setup worktrees", "initialize dev environment", "set up branches", or when a freshly cloned repo has no sibling worktrees.
user-invocable: true
args: "No arguments — interactive setup via prompts."
metadata:
author: autogpt-team
version: "1.0.0"
---
# Repository Setup
This skill sets up a worktree-based development layout from a freshly cloned repo. It creates:
- A **main** worktree (the primary checkout)
- A **reviews** worktree (for PR reviews)
- **N work branches** (branch1..branchN) for parallel development
## Step 1: Identify the repo
Determine the repo root and parent directory:
```bash
ROOT=$(git rev-parse --show-toplevel)
REPO_NAME=$(basename "$ROOT")
PARENT=$(dirname "$ROOT")
```
Detect if the repo is already inside a worktree layout by counting sibling worktrees (not just checking the directory name, which could be anything):
```bash
# Count worktrees that are siblings (live under $PARENT but aren't $ROOT itself)
SIBLING_COUNT=$(git worktree list --porcelain 2>/dev/null | grep "^worktree " | grep -c "$PARENT/" || true)
if [ "$SIBLING_COUNT" -gt 1 ]; then
echo "INFO: Existing worktree layout detected at $PARENT ($SIBLING_COUNT worktrees)"
# Use $ROOT as-is; skip renaming/restructuring
else
echo "INFO: Fresh clone detected, proceeding with setup"
fi
```
## Step 2: Ask the user questions
Use AskUserQuestion to gather setup preferences:
1. **How many parallel work branches do you need?** (Options: 4, 8, 16, or custom)
- These become `branch1` through `branchN`
2. **Which branch should be the base?** (Options: origin/master, origin/dev, or custom)
- All work branches and reviews will start from this
## Step 3: Fetch and set up branches
```bash
cd "$ROOT"
git fetch origin
# Create the reviews branch from base (skip if already exists)
if git show-ref --verify --quiet refs/heads/reviews; then
echo "INFO: Branch 'reviews' already exists, skipping"
else
git branch reviews <base-branch>
fi
# Create numbered work branches from base (skip if already exists)
for i in $(seq 1 "$COUNT"); do
if git show-ref --verify --quiet "refs/heads/branch$i"; then
echo "INFO: Branch 'branch$i' already exists, skipping"
else
git branch "branch$i" <base-branch>
fi
done
```
## Step 4: Create worktrees
Create worktrees as siblings to the main checkout:
```bash
if [ -d "$PARENT/reviews" ]; then
echo "INFO: Worktree '$PARENT/reviews' already exists, skipping"
else
git worktree add "$PARENT/reviews" reviews
fi
for i in $(seq 1 "$COUNT"); do
if [ -d "$PARENT/branch$i" ]; then
echo "INFO: Worktree '$PARENT/branch$i' already exists, skipping"
else
git worktree add "$PARENT/branch$i" "branch$i"
fi
done
```
## Step 5: Set up environment files
**Do NOT assume .env files exist.** For each worktree (including main if needed):
1. Check if `.env` exists in the source worktree for each path
2. If `.env` exists, copy it
3. If only `.env.default` or `.env.example` exists, copy that as `.env`
4. If neither exists, warn the user and list which env files are missing
Env file locations to check (same as the `/worktree` skill — keep these in sync):
- `autogpt_platform/.env`
- `autogpt_platform/backend/.env`
- `autogpt_platform/frontend/.env`
> **Note:** This env copying logic intentionally mirrors the `/worktree` skill's approach. If you update the path list or fallback logic here, update `/worktree` as well.
```bash
SOURCE="$ROOT"
WORKTREES="reviews"
for i in $(seq 1 "$COUNT"); do WORKTREES="$WORKTREES branch$i"; done
FOUND_ANY_ENV=0
for wt in $WORKTREES; do
TARGET="$PARENT/$wt"
for envpath in autogpt_platform autogpt_platform/backend autogpt_platform/frontend; do
if [ -f "$SOURCE/$envpath/.env" ]; then
FOUND_ANY_ENV=1
cp "$SOURCE/$envpath/.env" "$TARGET/$envpath/.env"
elif [ -f "$SOURCE/$envpath/.env.default" ]; then
FOUND_ANY_ENV=1
cp "$SOURCE/$envpath/.env.default" "$TARGET/$envpath/.env"
echo "NOTE: $wt/$envpath/.env was created from .env.default — you may need to edit it"
elif [ -f "$SOURCE/$envpath/.env.example" ]; then
FOUND_ANY_ENV=1
cp "$SOURCE/$envpath/.env.example" "$TARGET/$envpath/.env"
echo "NOTE: $wt/$envpath/.env was created from .env.example — you may need to edit it"
else
echo "WARNING: No .env, .env.default, or .env.example found at $SOURCE/$envpath/"
fi
done
done
if [ "$FOUND_ANY_ENV" -eq 0 ]; then
echo "WARNING: No environment files or templates were found in the source worktree."
# Use AskUserQuestion to confirm: "Continue setup without env files?"
# If the user declines, stop here and let them set up .env files first.
fi
```
## Step 6: Copy branchlet config
Copy `.branchlet.json` from main to each worktree so branchlet can manage sub-worktrees:
```bash
if [ -f "$ROOT/.branchlet.json" ]; then
for wt in $WORKTREES; do
cp "$ROOT/.branchlet.json" "$PARENT/$wt/.branchlet.json"
done
fi
```
## Step 7: Install dependencies
Install deps in all worktrees. Run these sequentially per worktree:
```bash
for wt in $WORKTREES; do
TARGET="$PARENT/$wt"
echo "=== Installing deps for $wt ==="
(cd "$TARGET/autogpt_platform/autogpt_libs" && poetry install) &&
(cd "$TARGET/autogpt_platform/backend" && poetry install && poetry run prisma generate) &&
(cd "$TARGET/autogpt_platform/frontend" && pnpm install) &&
echo "=== Done: $wt ===" ||
echo "=== FAILED: $wt ==="
done
```
This is slow. Run in background if possible and notify when complete.
## Step 8: Verify and report
After setup, verify and report to the user:
```bash
git worktree list
```
Summarize:
- Number of worktrees created
- Which env files were copied vs created from defaults vs missing
- Any warnings or errors encountered
## Final directory layout
```
parent/
main/ # Primary checkout (already exists)
reviews/ # PR review worktree
branch1/ # Work branch 1
branch2/ # Work branch 2
...
branchN/ # Work branch N
```

View File

@@ -1,8 +1,12 @@
<!-- Clearly explain the need for these changes: -->
### Why / What / How
<!-- Why: Why does this PR exist? What problem does it solve, or what's broken/missing without it? -->
<!-- What: What does this PR change? Summarize the changes at a high level. -->
<!-- How: How does it work? Describe the approach, key implementation details, or architecture decisions. -->
### Changes 🏗️
<!-- Concisely describe all of the changes made in this pull request: -->
<!-- List the key changes. Keep it higher level than the diff but specific enough to highlight what's new/modified. -->
### Checklist 📋

View File

@@ -1,6 +1,6 @@
# AutoGPT Platform Contribution Guide
This guide provides context for Codex when updating the **autogpt_platform** folder.
This guide provides context for coding agents when updating the **autogpt_platform** folder.
## Directory overview

1
CLAUDE.md Normal file
View File

@@ -0,0 +1 @@
@AGENTS.md

View File

@@ -83,13 +83,13 @@ The AutoGPT frontend is where users interact with our powerful AI automation pla
**Agent Builder:** For those who want to customize, our intuitive, low-code interface allows you to design and configure your own AI agents.
**Workflow Management:** Build, modify, and optimize your automation workflows with ease. You build your agent by connecting blocks, where each block performs a single action.
**Workflow Management:** Build, modify, and optimize your automation workflows with ease. You build your agent by connecting blocks, where each block performs a single action.
**Deployment Controls:** Manage the lifecycle of your agents, from testing to production.
**Ready-to-Use Agents:** Don't want to build? Simply select from our library of pre-configured agents and put them to work immediately.
**Agent Interaction:** Whether you've built your own or are using pre-configured agents, easily run and interact with them through our user-friendly interface.
**Agent Interaction:** Whether you've built your own or are using pre-configured agents, easily run and interact with them through our user-friendly interface.
**Monitoring and Analytics:** Keep track of your agents' performance and gain insights to continually improve your automation processes.

120
autogpt_platform/AGENTS.md Normal file
View File

@@ -0,0 +1,120 @@
# AutoGPT Platform
This file provides guidance to coding agents when working with code in this repository.
## Repository Overview
AutoGPT Platform is a monorepo containing:
- **Backend** (`backend`): Python FastAPI server with async support
- **Frontend** (`frontend`): Next.js React application
- **Shared Libraries** (`autogpt_libs`): Common Python utilities
## Component Documentation
- **Backend**: See @backend/AGENTS.md for backend-specific commands, architecture, and development tasks
- **Frontend**: See @frontend/AGENTS.md for frontend-specific commands, architecture, and development patterns
## Key Concepts
1. **Agent Graphs**: Workflow definitions stored as JSON, executed by the backend
2. **Blocks**: Reusable components in `backend/backend/blocks/` that perform specific tasks
3. **Integrations**: OAuth and API connections stored per user
4. **Store**: Marketplace for sharing agent templates
5. **Virus Scanning**: ClamAV integration for file upload security
### Environment Configuration
#### Configuration Files
- **Backend**: `backend/.env.default` (defaults) → `backend/.env` (user overrides)
- **Frontend**: `frontend/.env.default` (defaults) → `frontend/.env` (user overrides)
- **Platform**: `.env.default` (Supabase/shared defaults) → `.env` (user overrides)
#### Docker Environment Loading Order
1. `.env.default` files provide base configuration (tracked in git)
2. `.env` files provide user-specific overrides (gitignored)
3. Docker Compose `environment:` sections provide service-specific overrides
4. Shell environment variables have highest precedence
#### Key Points
- All services use hardcoded defaults in docker-compose files (no `${VARIABLE}` substitutions)
- The `env_file` directive loads variables INTO containers at runtime
- Backend/Frontend services use YAML anchors for consistent configuration
- Supabase services (`db/docker/docker-compose.yml`) follow the same pattern
### Branching Strategy
- **`dev`** is the main development branch. All PRs should target `dev`.
- **`master`** is the production branch. Only used for production releases.
### Creating Pull Requests
- Create the PR against the `dev` branch of the repository.
- **Split PRs by concern** — each PR should have a single clear purpose. For example, "usage tracking" and "credit charging" should be separate PRs even if related. Combining multiple concerns makes it harder for reviewers to understand what belongs to what.
- Ensure the branch name is descriptive (e.g., `feature/add-new-block`)
- Use conventional commit messages (see below)
- **Structure the PR description with Why / What / How** — Why: the motivation (what problem it solves, what's broken/missing without it); What: high-level summary of changes; How: approach, key implementation details, or architecture decisions. Reviewers need all three to judge whether the approach fits the problem.
- Fill out the .github/PULL_REQUEST_TEMPLATE.md template as the PR description
- Always use `--body-file` to pass PR body — avoids shell interpretation of backticks and special characters:
```bash
PR_BODY=$(mktemp)
cat > "$PR_BODY" << 'PREOF'
## Summary
- use `backticks` freely here
PREOF
gh pr create --title "..." --body-file "$PR_BODY" --base dev
rm "$PR_BODY"
```
- Run the github pre-commit hooks to ensure code quality.
### Test-Driven Development (TDD)
When fixing a bug or adding a feature, follow a test-first approach:
1. **Write a failing test first** — create a test that reproduces the bug or validates the new behavior, marked with `@pytest.mark.xfail` (backend) or `.fixme` (Playwright). Run it to confirm it fails for the right reason.
2. **Implement the fix/feature** — write the minimal code to make the test pass.
3. **Remove the xfail marker** — once the test passes, remove the `xfail`/`.fixme` annotation and run the full test suite to confirm nothing else broke.
This ensures every change is covered by a test and that the test actually validates the intended behavior.
### Reviewing/Revising Pull Requests
Use `/pr-review` to review a PR or `/pr-address` to address comments.
When fetching comments manually:
- `gh api repos/Significant-Gravitas/AutoGPT/pulls/{N}/reviews --paginate` — top-level reviews
- `gh api repos/Significant-Gravitas/AutoGPT/pulls/{N}/comments --paginate` — inline review comments (always paginate to avoid missing comments beyond page 1)
- `gh api repos/Significant-Gravitas/AutoGPT/issues/{N}/comments` — PR conversation comments
### Conventional Commits
Use this format for commit messages and Pull Request titles:
**Conventional Commit Types:**
- `feat`: Introduces a new feature to the codebase
- `fix`: Patches a bug in the codebase
- `refactor`: Code change that neither fixes a bug nor adds a feature; also applies to removing features
- `ci`: Changes to CI configuration
- `docs`: Documentation-only changes
- `dx`: Improvements to the developer experience
**Recommended Base Scopes:**
- `platform`: Changes affecting both frontend and backend
- `frontend`
- `backend`
- `infra`
- `blocks`: Modifications/additions of individual blocks
**Subscope Examples:**
- `backend/executor`
- `backend/db`
- `frontend/builder` (includes changes to the block UI component)
- `infra/prod`
Use these scopes and subscopes for clarity and consistency in commit messages.

View File

@@ -1,118 +1 @@
# CLAUDE.md
This file provides guidance to Claude Code (claude.ai/code) when working with code in this repository.
## Repository Overview
AutoGPT Platform is a monorepo containing:
- **Backend** (`backend`): Python FastAPI server with async support
- **Frontend** (`frontend`): Next.js React application
- **Shared Libraries** (`autogpt_libs`): Common Python utilities
## Component Documentation
- **Backend**: See @backend/CLAUDE.md for backend-specific commands, architecture, and development tasks
- **Frontend**: See @frontend/CLAUDE.md for frontend-specific commands, architecture, and development patterns
## Key Concepts
1. **Agent Graphs**: Workflow definitions stored as JSON, executed by the backend
2. **Blocks**: Reusable components in `backend/backend/blocks/` that perform specific tasks
3. **Integrations**: OAuth and API connections stored per user
4. **Store**: Marketplace for sharing agent templates
5. **Virus Scanning**: ClamAV integration for file upload security
### Environment Configuration
#### Configuration Files
- **Backend**: `backend/.env.default` (defaults) → `backend/.env` (user overrides)
- **Frontend**: `frontend/.env.default` (defaults) → `frontend/.env` (user overrides)
- **Platform**: `.env.default` (Supabase/shared defaults) → `.env` (user overrides)
#### Docker Environment Loading Order
1. `.env.default` files provide base configuration (tracked in git)
2. `.env` files provide user-specific overrides (gitignored)
3. Docker Compose `environment:` sections provide service-specific overrides
4. Shell environment variables have highest precedence
#### Key Points
- All services use hardcoded defaults in docker-compose files (no `${VARIABLE}` substitutions)
- The `env_file` directive loads variables INTO containers at runtime
- Backend/Frontend services use YAML anchors for consistent configuration
- Supabase services (`db/docker/docker-compose.yml`) follow the same pattern
### Branching Strategy
- **`dev`** is the main development branch. All PRs should target `dev`.
- **`master`** is the production branch. Only used for production releases.
### Creating Pull Requests
- Create the PR against the `dev` branch of the repository.
- Ensure the branch name is descriptive (e.g., `feature/add-new-block`)
- Use conventional commit messages (see below)
- Fill out the .github/PULL_REQUEST_TEMPLATE.md template as the PR description
- Always use `--body-file` to pass PR body — avoids shell interpretation of backticks and special characters:
```bash
PR_BODY=$(mktemp)
cat > "$PR_BODY" << 'PREOF'
## Summary
- use `backticks` freely here
PREOF
gh pr create --title "..." --body-file "$PR_BODY" --base dev
rm "$PR_BODY"
```
- Run the github pre-commit hooks to ensure code quality.
### Test-Driven Development (TDD)
When fixing a bug or adding a feature, follow a test-first approach:
1. **Write a failing test first** — create a test that reproduces the bug or validates the new behavior, marked with `@pytest.mark.xfail` (backend) or `.fixme` (Playwright). Run it to confirm it fails for the right reason.
2. **Implement the fix/feature** — write the minimal code to make the test pass.
3. **Remove the xfail marker** — once the test passes, remove the `xfail`/`.fixme` annotation and run the full test suite to confirm nothing else broke.
This ensures every change is covered by a test and that the test actually validates the intended behavior.
### Reviewing/Revising Pull Requests
Use `/pr-review` to review a PR or `/pr-address` to address comments.
When fetching comments manually:
- `gh api repos/Significant-Gravitas/AutoGPT/pulls/{N}/reviews --paginate` — top-level reviews
- `gh api repos/Significant-Gravitas/AutoGPT/pulls/{N}/comments --paginate` — inline review comments (always paginate to avoid missing comments beyond page 1)
- `gh api repos/Significant-Gravitas/AutoGPT/issues/{N}/comments` — PR conversation comments
### Conventional Commits
Use this format for commit messages and Pull Request titles:
**Conventional Commit Types:**
- `feat`: Introduces a new feature to the codebase
- `fix`: Patches a bug in the codebase
- `refactor`: Code change that neither fixes a bug nor adds a feature; also applies to removing features
- `ci`: Changes to CI configuration
- `docs`: Documentation-only changes
- `dx`: Improvements to the developer experience
**Recommended Base Scopes:**
- `platform`: Changes affecting both frontend and backend
- `frontend`
- `backend`
- `infra`
- `blocks`: Modifications/additions of individual blocks
**Subscope Examples:**
- `backend/executor`
- `backend/db`
- `frontend/builder` (includes changes to the block UI component)
- `infra/prod`
Use these scopes and subscopes for clarity and consistency in commit messages.
@AGENTS.md

View File

@@ -1,4 +1,4 @@
# This file is automatically @generated by Poetry 2.1.1 and should not be changed by hand.
# This file is automatically @generated by Poetry 2.2.1 and should not be changed by hand.
[[package]]
name = "annotated-doc"
@@ -67,7 +67,7 @@ description = "Backport of asyncio.Runner, a context manager that controls event
optional = false
python-versions = "<3.11,>=3.8"
groups = ["dev"]
markers = "python_version < \"3.11\""
markers = "python_version == \"3.10\""
files = [
{file = "backports_asyncio_runner-1.2.0-py3-none-any.whl", hash = "sha256:0da0a936a8aeb554eccb426dc55af3ba63bcdc69fa1a600b5bb305413a4477b5"},
{file = "backports_asyncio_runner-1.2.0.tar.gz", hash = "sha256:a5aa7b2b7d8f8bfcaa2b57313f70792df84e32a2a746f585213373f900b42162"},
@@ -541,7 +541,7 @@ description = "Backport of PEP 654 (exception groups)"
optional = false
python-versions = ">=3.7"
groups = ["main", "dev"]
markers = "python_version < \"3.11\""
markers = "python_version == \"3.10\""
files = [
{file = "exceptiongroup-1.3.0-py3-none-any.whl", hash = "sha256:4d111e6e0c13d0644cad6ddaa7ed0261a0b36971f6d23e7ec9b4b9097da78a10"},
{file = "exceptiongroup-1.3.0.tar.gz", hash = "sha256:b241f5885f560bc56a59ee63ca4c6a8bfa46ae4ad651af316d4e81817bb9fd88"},
@@ -2181,14 +2181,14 @@ testing = ["coverage (>=6.2)", "hypothesis (>=5.7.1)"]
[[package]]
name = "pytest-cov"
version = "7.0.0"
version = "7.1.0"
description = "Pytest plugin for measuring coverage."
optional = false
python-versions = ">=3.9"
groups = ["dev"]
files = [
{file = "pytest_cov-7.0.0-py3-none-any.whl", hash = "sha256:3b8e9558b16cc1479da72058bdecf8073661c7f57f7d3c5f22a1c23507f2d861"},
{file = "pytest_cov-7.0.0.tar.gz", hash = "sha256:33c97eda2e049a0c5298e91f519302a1334c26ac65c1a483d6206fd458361af1"},
{file = "pytest_cov-7.1.0-py3-none-any.whl", hash = "sha256:a0461110b7865f9a271aa1b51e516c9a95de9d696734a2f71e3e78f46e1d4678"},
{file = "pytest_cov-7.1.0.tar.gz", hash = "sha256:30674f2b5f6351aa09702a9c8c364f6a01c27aae0c1366ae8016160d1efc56b2"},
]
[package.dependencies]
@@ -2342,30 +2342,30 @@ pyasn1 = ">=0.1.3"
[[package]]
name = "ruff"
version = "0.15.0"
version = "0.15.7"
description = "An extremely fast Python linter and code formatter, written in Rust."
optional = false
python-versions = ">=3.7"
groups = ["dev"]
files = [
{file = "ruff-0.15.0-py3-none-linux_armv6l.whl", hash = "sha256:aac4ebaa612a82b23d45964586f24ae9bc23ca101919f5590bdb368d74ad5455"},
{file = "ruff-0.15.0-py3-none-macosx_10_12_x86_64.whl", hash = "sha256:dcd4be7cc75cfbbca24a98d04d0b9b36a270d0833241f776b788d59f4142b14d"},
{file = "ruff-0.15.0-py3-none-macosx_11_0_arm64.whl", hash = "sha256:d747e3319b2bce179c7c1eaad3d884dc0a199b5f4d5187620530adf9105268ce"},
{file = "ruff-0.15.0-py3-none-manylinux_2_17_aarch64.manylinux2014_aarch64.whl", hash = "sha256:650bd9c56ae03102c51a5e4b554d74d825ff3abe4db22b90fd32d816c2e90621"},
{file = "ruff-0.15.0-py3-none-manylinux_2_17_armv7l.manylinux2014_armv7l.whl", hash = "sha256:a6664b7eac559e3048223a2da77769c2f92b43a6dfd4720cef42654299a599c9"},
{file = "ruff-0.15.0-py3-none-manylinux_2_17_i686.manylinux2014_i686.whl", hash = "sha256:6f811f97b0f092b35320d1556f3353bf238763420ade5d9e62ebd2b73f2ff179"},
{file = "ruff-0.15.0-py3-none-manylinux_2_17_ppc64le.manylinux2014_ppc64le.whl", hash = "sha256:761ec0a66680fab6454236635a39abaf14198818c8cdf691e036f4bc0f406b2d"},
{file = "ruff-0.15.0-py3-none-manylinux_2_17_s390x.manylinux2014_s390x.whl", hash = "sha256:940f11c2604d317e797b289f4f9f3fa5555ffe4fb574b55ed006c3d9b6f0eb78"},
{file = "ruff-0.15.0-py3-none-manylinux_2_17_x86_64.manylinux2014_x86_64.whl", hash = "sha256:bcbca3d40558789126da91d7ef9a7c87772ee107033db7191edefa34e2c7f1b4"},
{file = "ruff-0.15.0-py3-none-manylinux_2_31_riscv64.whl", hash = "sha256:9a121a96db1d75fa3eb39c4539e607f628920dd72ff1f7c5ee4f1b768ac62d6e"},
{file = "ruff-0.15.0-py3-none-musllinux_1_2_aarch64.whl", hash = "sha256:5298d518e493061f2eabd4abd067c7e4fb89e2f63291c94332e35631c07c3662"},
{file = "ruff-0.15.0-py3-none-musllinux_1_2_armv7l.whl", hash = "sha256:afb6e603d6375ff0d6b0cee563fa21ab570fd15e65c852cb24922cef25050cf1"},
{file = "ruff-0.15.0-py3-none-musllinux_1_2_i686.whl", hash = "sha256:77e515f6b15f828b94dc17d2b4ace334c9ddb7d9468c54b2f9ed2b9c1593ef16"},
{file = "ruff-0.15.0-py3-none-musllinux_1_2_x86_64.whl", hash = "sha256:6f6e80850a01eb13b3e42ee0ebdf6e4497151b48c35051aab51c101266d187a3"},
{file = "ruff-0.15.0-py3-none-win32.whl", hash = "sha256:238a717ef803e501b6d51e0bdd0d2c6e8513fe9eec14002445134d3907cd46c3"},
{file = "ruff-0.15.0-py3-none-win_amd64.whl", hash = "sha256:dd5e4d3301dc01de614da3cdffc33d4b1b96fb89e45721f1598e5532ccf78b18"},
{file = "ruff-0.15.0-py3-none-win_arm64.whl", hash = "sha256:c480d632cc0ca3f0727acac8b7d053542d9e114a462a145d0b00e7cd658c515a"},
{file = "ruff-0.15.0.tar.gz", hash = "sha256:6bdea47cdbea30d40f8f8d7d69c0854ba7c15420ec75a26f463290949d7f7e9a"},
{file = "ruff-0.15.7-py3-none-linux_armv6l.whl", hash = "sha256:a81cc5b6910fb7dfc7c32d20652e50fa05963f6e13ead3c5915c41ac5d16668e"},
{file = "ruff-0.15.7-py3-none-macosx_10_12_x86_64.whl", hash = "sha256:722d165bd52403f3bdabc0ce9e41fc47070ac56d7a91b4e0d097b516a53a3477"},
{file = "ruff-0.15.7-py3-none-macosx_11_0_arm64.whl", hash = "sha256:7fbc2448094262552146cbe1b9643a92f66559d3761f1ad0656d4991491af49e"},
{file = "ruff-0.15.7-py3-none-manylinux_2_17_aarch64.manylinux2014_aarch64.whl", hash = "sha256:6b39329b60eba44156d138275323cc726bbfbddcec3063da57caa8a8b1d50adf"},
{file = "ruff-0.15.7-py3-none-manylinux_2_17_armv7l.manylinux2014_armv7l.whl", hash = "sha256:87768c151808505f2bfc93ae44e5f9e7c8518943e5074f76ac21558ef5627c85"},
{file = "ruff-0.15.7-py3-none-manylinux_2_17_i686.manylinux2014_i686.whl", hash = "sha256:fb0511670002c6c529ec66c0e30641c976c8963de26a113f3a30456b702468b0"},
{file = "ruff-0.15.7-py3-none-manylinux_2_17_ppc64le.manylinux2014_ppc64le.whl", hash = "sha256:e0d19644f801849229db8345180a71bee5407b429dd217f853ec515e968a6912"},
{file = "ruff-0.15.7-py3-none-manylinux_2_17_s390x.manylinux2014_s390x.whl", hash = "sha256:4806d8e09ef5e84eb19ba833d0442f7e300b23fe3f0981cae159a248a10f0036"},
{file = "ruff-0.15.7-py3-none-manylinux_2_17_x86_64.manylinux2014_x86_64.whl", hash = "sha256:dce0896488562f09a27b9c91b1f58a097457143931f3c4d519690dea54e624c5"},
{file = "ruff-0.15.7-py3-none-manylinux_2_31_riscv64.whl", hash = "sha256:1852ce241d2bc89e5dc823e03cff4ce73d816b5c6cdadd27dbfe7b03217d2a12"},
{file = "ruff-0.15.7-py3-none-musllinux_1_2_aarch64.whl", hash = "sha256:5f3e4b221fb4bd293f79912fc5e93a9063ebd6d0dcbd528f91b89172a9b8436c"},
{file = "ruff-0.15.7-py3-none-musllinux_1_2_armv7l.whl", hash = "sha256:b15e48602c9c1d9bdc504b472e90b90c97dc7d46c7028011ae67f3861ceba7b4"},
{file = "ruff-0.15.7-py3-none-musllinux_1_2_i686.whl", hash = "sha256:1b4705e0e85cedc74b0a23cf6a179dbb3df184cb227761979cc76c0440b5ab0d"},
{file = "ruff-0.15.7-py3-none-musllinux_1_2_x86_64.whl", hash = "sha256:112c1fa316a558bb34319282c1200a8bf0495f1b735aeb78bfcb2991e6087580"},
{file = "ruff-0.15.7-py3-none-win32.whl", hash = "sha256:6d39e2d3505b082323352f733599f28169d12e891f7dd407f2d4f54b4c2886de"},
{file = "ruff-0.15.7-py3-none-win_amd64.whl", hash = "sha256:4d53d712ddebcd7dace1bc395367aec12c057aacfe9adbb6d832302575f4d3a1"},
{file = "ruff-0.15.7-py3-none-win_arm64.whl", hash = "sha256:18e8d73f1c3fdf27931497972250340f92e8c861722161a9caeb89a58ead6ed2"},
{file = "ruff-0.15.7.tar.gz", hash = "sha256:04f1ae61fc20fe0b148617c324d9d009b5f63412c0b16474f3d5f1a1a665f7ac"},
]
[[package]]
@@ -2564,7 +2564,7 @@ description = "A lil' TOML parser"
optional = false
python-versions = ">=3.8"
groups = ["dev"]
markers = "python_version < \"3.11\""
markers = "python_version == \"3.10\""
files = [
{file = "tomli-2.2.1-cp311-cp311-macosx_10_9_x86_64.whl", hash = "sha256:678e4fa69e4575eb77d103de3df8a895e1591b48e740211bd1067378c69e8249"},
{file = "tomli-2.2.1-cp311-cp311-macosx_11_0_arm64.whl", hash = "sha256:023aa114dd824ade0100497eb2318602af309e5a55595f76b626d6d9f3b7b0a6"},
@@ -2912,4 +2912,4 @@ type = ["pytest-mypy"]
[metadata]
lock-version = "2.1"
python-versions = ">=3.10,<4.0"
content-hash = "9619cae908ad38fa2c48016a58bcf4241f6f5793aa0e6cc140276e91c433cbbb"
content-hash = "e0936a065565550afed18f6298b7e04e814b44100def7049f1a0d68662624a39"

View File

@@ -26,8 +26,8 @@ pyright = "^1.1.408"
pytest = "^8.4.1"
pytest-asyncio = "^1.3.0"
pytest-mock = "^3.15.1"
pytest-cov = "^7.0.0"
ruff = "^0.15.0"
pytest-cov = "^7.1.0"
ruff = "^0.15.7"
[build-system]
requires = ["poetry-core"]

View File

@@ -178,6 +178,7 @@ SMTP_USERNAME=
SMTP_PASSWORD=
# Business & Marketing Tools
AGENTMAIL_API_KEY=
APOLLO_API_KEY=
ENRICHLAYER_API_KEY=
AYRSHARE_API_KEY=

View File

@@ -0,0 +1,227 @@
# Backend
This file provides guidance to coding agents when working with the backend.
## Essential Commands
To run something with Python package dependencies you MUST use `poetry run ...`.
```bash
# Install dependencies
poetry install
# Run database migrations
poetry run prisma migrate dev
# Start all services (database, redis, rabbitmq, clamav)
docker compose up -d
# Run the backend as a whole
poetry run app
# Run tests
poetry run test
# Run specific test
poetry run pytest path/to/test_file.py::test_function_name
# Run block tests (tests that validate all blocks work correctly)
poetry run pytest backend/blocks/test/test_block.py -xvs
# Run tests for a specific block (e.g., GetCurrentTimeBlock)
poetry run pytest 'backend/blocks/test/test_block.py::test_available_blocks[GetCurrentTimeBlock]' -xvs
# Lint and format
# prefer format if you want to just "fix" it and only get the errors that can't be autofixed
poetry run format # Black + isort
poetry run lint # ruff
```
More details can be found in @TESTING.md
### Creating/Updating Snapshots
When you first write a test or when the expected output changes:
```bash
poetry run pytest path/to/test.py --snapshot-update
```
⚠️ **Important**: Always review snapshot changes before committing! Use `git diff` to verify the changes are expected.
## Architecture
- **API Layer**: FastAPI with REST and WebSocket endpoints
- **Database**: PostgreSQL with Prisma ORM, includes pgvector for embeddings
- **Queue System**: RabbitMQ for async task processing
- **Execution Engine**: Separate executor service processes agent workflows
- **Authentication**: JWT-based with Supabase integration
- **Security**: Cache protection middleware prevents sensitive data caching in browsers/proxies
## Code Style
- **Top-level imports only** — no local/inner imports (lazy imports only for heavy optional deps like `openpyxl`)
- **Absolute imports** — use `from backend.module import ...` for cross-package imports. Single-dot relative (`from .sibling import ...`) is acceptable for sibling modules within the same package (e.g., blocks). Avoid double-dot relative imports (`from ..parent import ...`) — use the absolute path instead
- **No duck typing** — no `hasattr`/`getattr`/`isinstance` for type dispatch; use typed interfaces/unions/protocols
- **Pydantic models** over dataclass/namedtuple/dict for structured data
- **No linter suppressors** — no `# type: ignore`, `# noqa`, `# pyright: ignore`; fix the type/code
- **List comprehensions** over manual loop-and-append
- **Early return** — guard clauses first, avoid deep nesting
- **f-strings vs printf syntax in log statements** — Use `%s` for deferred interpolation in `debug` statements, f-strings elsewhere for readability: `logger.debug("Processing %s items", count)`, `logger.info(f"Processing {count} items")`
- **Sanitize error paths** — `os.path.basename()` in error messages to avoid leaking directory structure
- **TOCTOU awareness** — avoid check-then-act patterns for file access and credit charging
- **`Security()` vs `Depends()`** — use `Security()` for auth deps to get proper OpenAPI security spec
- **Redis pipelines** — `transaction=True` for atomicity on multi-step operations
- **`max(0, value)` guards** — for computed values that should never be negative
- **SSE protocol** — `data:` lines for frontend-parsed events (must match Zod schema), `: comment` lines for heartbeats/status
- **File length** — keep files under ~300 lines; if a file grows beyond this, split by responsibility (e.g. extract helpers, models, or a sub-module into a new file). Never keep appending to a long file.
- **Function length** — keep functions under ~40 lines; extract named helpers when a function grows longer. Long functions are a sign of mixed concerns, not complexity.
- **Top-down ordering** — define the main/public function or class first, then the helpers it uses below. A reader should encounter high-level logic before implementation details.
## Testing Approach
- Uses pytest with snapshot testing for API responses
- Test files are colocated with source files (`*_test.py`)
- Mock at boundaries — mock where the symbol is **used**, not where it's **defined**
- After refactoring, update mock targets to match new module paths
- Use `AsyncMock` for async functions (`from unittest.mock import AsyncMock`)
### Test-Driven Development (TDD)
When fixing a bug or adding a feature, write the test **before** the implementation:
```python
# 1. Write a failing test marked xfail
@pytest.mark.xfail(reason="Bug #1234: widget crashes on empty input")
def test_widget_handles_empty_input():
result = widget.process("")
assert result == Widget.EMPTY_RESULT
# 2. Run it — confirm it fails (XFAIL)
# poetry run pytest path/to/test.py::test_widget_handles_empty_input -xvs
# 3. Implement the fix
# 4. Remove xfail, run again — confirm it passes
def test_widget_handles_empty_input():
result = widget.process("")
assert result == Widget.EMPTY_RESULT
```
This catches regressions and proves the fix actually works. **Every bug fix should include a test that would have caught it.**
## Database Schema
Key models (defined in `schema.prisma`):
- `User`: Authentication and profile data
- `AgentGraph`: Workflow definitions with version control
- `AgentGraphExecution`: Execution history and results
- `AgentNode`: Individual nodes in a workflow
- `StoreListing`: Marketplace listings for sharing agents
## Environment Configuration
- **Backend**: `.env.default` (defaults) → `.env` (user overrides)
## Common Development Tasks
### Adding a new block
Follow the comprehensive [Block SDK Guide](@../../docs/platform/block-sdk-guide.md) which covers:
- Provider configuration with `ProviderBuilder`
- Block schema definition
- Authentication (API keys, OAuth, webhooks)
- Testing and validation
- File organization
Quick steps:
1. Create new file in `backend/blocks/`
2. Configure provider using `ProviderBuilder` in `_config.py`
3. Inherit from `Block` base class
4. Define input/output schemas using `BlockSchema`
5. Implement async `run` method
6. Generate unique block ID using `uuid.uuid4()`
7. Test with `poetry run pytest backend/blocks/test/test_block.py`
Note: when making many new blocks analyze the interfaces for each of these blocks and picture if they would go well together in a graph-based editor or would they struggle to connect productively?
ex: do the inputs and outputs tie well together?
If you get any pushback or hit complex block conditions check the new_blocks guide in the docs.
#### Handling files in blocks with `store_media_file()`
When blocks need to work with files (images, videos, documents), use `store_media_file()` from `backend.util.file`. The `return_format` parameter determines what you get back:
| Format | Use When | Returns |
|--------|----------|---------|
| `"for_local_processing"` | Processing with local tools (ffmpeg, MoviePy, PIL) | Local file path (e.g., `"image.png"`) |
| `"for_external_api"` | Sending content to external APIs (Replicate, OpenAI) | Data URI (e.g., `"data:image/png;base64,..."`) |
| `"for_block_output"` | Returning output from your block | Smart: `workspace://` in CoPilot, data URI in graphs |
**Examples:**
```python
# INPUT: Need to process file locally with ffmpeg
local_path = await store_media_file(
file=input_data.video,
execution_context=execution_context,
return_format="for_local_processing",
)
# local_path = "video.mp4" - use with Path/ffmpeg/etc
# INPUT: Need to send to external API like Replicate
image_b64 = await store_media_file(
file=input_data.image,
execution_context=execution_context,
return_format="for_external_api",
)
# image_b64 = "data:image/png;base64,iVBORw0..." - send to API
# OUTPUT: Returning result from block
result_url = await store_media_file(
file=generated_image_url,
execution_context=execution_context,
return_format="for_block_output",
)
yield "image_url", result_url
# In CoPilot: result_url = "workspace://abc123"
# In graphs: result_url = "data:image/png;base64,..."
```
**Key points:**
- `for_block_output` is the ONLY format that auto-adapts to execution context
- Always use `for_block_output` for block outputs unless you have a specific reason not to
- Never hardcode workspace checks - let `for_block_output` handle it
### Modifying the API
1. Update route in `backend/api/features/`
2. Add/update Pydantic models in same directory
3. Write tests alongside the route file
4. Run `poetry run test` to verify
## Workspace & Media Files
**Read [Workspace & Media Architecture](../../docs/platform/workspace-media-architecture.md) when:**
- Working on CoPilot file upload/download features
- Building blocks that handle `MediaFileType` inputs/outputs
- Modifying `WorkspaceManager` or `store_media_file()`
- Debugging file persistence or virus scanning issues
Covers: `WorkspaceManager` (persistent storage with session scoping), `store_media_file()` (media normalization pipeline), and responsibility boundaries for virus scanning and persistence.
## Security Implementation
### Cache Protection Middleware
- Located in `backend/api/middleware/security.py`
- Default behavior: Disables caching for ALL endpoints with `Cache-Control: no-store, no-cache, must-revalidate, private`
- Uses an allow list approach - only explicitly permitted paths can be cached
- Cacheable paths include: static assets (`static/*`, `_next/static/*`), health checks, public store pages, documentation
- Prevents sensitive data (auth tokens, API keys, user data) from being cached by browsers/proxies
- To allow caching for a new endpoint, add it to `CACHEABLE_PATHS` in the middleware
- Applied to both main API server and external API applications

View File

@@ -1,226 +1 @@
# CLAUDE.md - Backend
This file provides guidance to Claude Code when working with the backend.
## Essential Commands
To run something with Python package dependencies you MUST use `poetry run ...`.
```bash
# Install dependencies
poetry install
# Run database migrations
poetry run prisma migrate dev
# Start all services (database, redis, rabbitmq, clamav)
docker compose up -d
# Run the backend as a whole
poetry run app
# Run tests
poetry run test
# Run specific test
poetry run pytest path/to/test_file.py::test_function_name
# Run block tests (tests that validate all blocks work correctly)
poetry run pytest backend/blocks/test/test_block.py -xvs
# Run tests for a specific block (e.g., GetCurrentTimeBlock)
poetry run pytest 'backend/blocks/test/test_block.py::test_available_blocks[GetCurrentTimeBlock]' -xvs
# Lint and format
# prefer format if you want to just "fix" it and only get the errors that can't be autofixed
poetry run format # Black + isort
poetry run lint # ruff
```
More details can be found in @TESTING.md
### Creating/Updating Snapshots
When you first write a test or when the expected output changes:
```bash
poetry run pytest path/to/test.py --snapshot-update
```
⚠️ **Important**: Always review snapshot changes before committing! Use `git diff` to verify the changes are expected.
## Architecture
- **API Layer**: FastAPI with REST and WebSocket endpoints
- **Database**: PostgreSQL with Prisma ORM, includes pgvector for embeddings
- **Queue System**: RabbitMQ for async task processing
- **Execution Engine**: Separate executor service processes agent workflows
- **Authentication**: JWT-based with Supabase integration
- **Security**: Cache protection middleware prevents sensitive data caching in browsers/proxies
## Code Style
- **Top-level imports only** — no local/inner imports (lazy imports only for heavy optional deps like `openpyxl`)
- **No duck typing** — no `hasattr`/`getattr`/`isinstance` for type dispatch; use typed interfaces/unions/protocols
- **Pydantic models** over dataclass/namedtuple/dict for structured data
- **No linter suppressors** — no `# type: ignore`, `# noqa`, `# pyright: ignore`; fix the type/code
- **List comprehensions** over manual loop-and-append
- **Early return** — guard clauses first, avoid deep nesting
- **f-strings vs printf syntax in log statements** — Use `%s` for deferred interpolation in `debug` statements, f-strings elsewhere for readability: `logger.debug("Processing %s items", count)`, `logger.info(f"Processing {count} items")`
- **Sanitize error paths** — `os.path.basename()` in error messages to avoid leaking directory structure
- **TOCTOU awareness** — avoid check-then-act patterns for file access and credit charging
- **`Security()` vs `Depends()`** — use `Security()` for auth deps to get proper OpenAPI security spec
- **Redis pipelines** — `transaction=True` for atomicity on multi-step operations
- **`max(0, value)` guards** — for computed values that should never be negative
- **SSE protocol** — `data:` lines for frontend-parsed events (must match Zod schema), `: comment` lines for heartbeats/status
- **File length** — keep files under ~300 lines; if a file grows beyond this, split by responsibility (e.g. extract helpers, models, or a sub-module into a new file). Never keep appending to a long file.
- **Function length** — keep functions under ~40 lines; extract named helpers when a function grows longer. Long functions are a sign of mixed concerns, not complexity.
- **Top-down ordering** — define the main/public function or class first, then the helpers it uses below. A reader should encounter high-level logic before implementation details.
## Testing Approach
- Uses pytest with snapshot testing for API responses
- Test files are colocated with source files (`*_test.py`)
- Mock at boundaries — mock where the symbol is **used**, not where it's **defined**
- After refactoring, update mock targets to match new module paths
- Use `AsyncMock` for async functions (`from unittest.mock import AsyncMock`)
### Test-Driven Development (TDD)
When fixing a bug or adding a feature, write the test **before** the implementation:
```python
# 1. Write a failing test marked xfail
@pytest.mark.xfail(reason="Bug #1234: widget crashes on empty input")
def test_widget_handles_empty_input():
result = widget.process("")
assert result == Widget.EMPTY_RESULT
# 2. Run it — confirm it fails (XFAIL)
# poetry run pytest path/to/test.py::test_widget_handles_empty_input -xvs
# 3. Implement the fix
# 4. Remove xfail, run again — confirm it passes
def test_widget_handles_empty_input():
result = widget.process("")
assert result == Widget.EMPTY_RESULT
```
This catches regressions and proves the fix actually works. **Every bug fix should include a test that would have caught it.**
## Database Schema
Key models (defined in `schema.prisma`):
- `User`: Authentication and profile data
- `AgentGraph`: Workflow definitions with version control
- `AgentGraphExecution`: Execution history and results
- `AgentNode`: Individual nodes in a workflow
- `StoreListing`: Marketplace listings for sharing agents
## Environment Configuration
- **Backend**: `.env.default` (defaults) → `.env` (user overrides)
## Common Development Tasks
### Adding a new block
Follow the comprehensive [Block SDK Guide](@../../docs/content/platform/block-sdk-guide.md) which covers:
- Provider configuration with `ProviderBuilder`
- Block schema definition
- Authentication (API keys, OAuth, webhooks)
- Testing and validation
- File organization
Quick steps:
1. Create new file in `backend/blocks/`
2. Configure provider using `ProviderBuilder` in `_config.py`
3. Inherit from `Block` base class
4. Define input/output schemas using `BlockSchema`
5. Implement async `run` method
6. Generate unique block ID using `uuid.uuid4()`
7. Test with `poetry run pytest backend/blocks/test/test_block.py`
Note: when making many new blocks analyze the interfaces for each of these blocks and picture if they would go well together in a graph-based editor or would they struggle to connect productively?
ex: do the inputs and outputs tie well together?
If you get any pushback or hit complex block conditions check the new_blocks guide in the docs.
#### Handling files in blocks with `store_media_file()`
When blocks need to work with files (images, videos, documents), use `store_media_file()` from `backend.util.file`. The `return_format` parameter determines what you get back:
| Format | Use When | Returns |
|--------|----------|---------|
| `"for_local_processing"` | Processing with local tools (ffmpeg, MoviePy, PIL) | Local file path (e.g., `"image.png"`) |
| `"for_external_api"` | Sending content to external APIs (Replicate, OpenAI) | Data URI (e.g., `"data:image/png;base64,..."`) |
| `"for_block_output"` | Returning output from your block | Smart: `workspace://` in CoPilot, data URI in graphs |
**Examples:**
```python
# INPUT: Need to process file locally with ffmpeg
local_path = await store_media_file(
file=input_data.video,
execution_context=execution_context,
return_format="for_local_processing",
)
# local_path = "video.mp4" - use with Path/ffmpeg/etc
# INPUT: Need to send to external API like Replicate
image_b64 = await store_media_file(
file=input_data.image,
execution_context=execution_context,
return_format="for_external_api",
)
# image_b64 = "data:image/png;base64,iVBORw0..." - send to API
# OUTPUT: Returning result from block
result_url = await store_media_file(
file=generated_image_url,
execution_context=execution_context,
return_format="for_block_output",
)
yield "image_url", result_url
# In CoPilot: result_url = "workspace://abc123"
# In graphs: result_url = "data:image/png;base64,..."
```
**Key points:**
- `for_block_output` is the ONLY format that auto-adapts to execution context
- Always use `for_block_output` for block outputs unless you have a specific reason not to
- Never hardcode workspace checks - let `for_block_output` handle it
### Modifying the API
1. Update route in `backend/api/features/`
2. Add/update Pydantic models in same directory
3. Write tests alongside the route file
4. Run `poetry run test` to verify
## Workspace & Media Files
**Read [Workspace & Media Architecture](../../docs/platform/workspace-media-architecture.md) when:**
- Working on CoPilot file upload/download features
- Building blocks that handle `MediaFileType` inputs/outputs
- Modifying `WorkspaceManager` or `store_media_file()`
- Debugging file persistence or virus scanning issues
Covers: `WorkspaceManager` (persistent storage with session scoping), `store_media_file()` (media normalization pipeline), and responsibility boundaries for virus scanning and persistence.
## Security Implementation
### Cache Protection Middleware
- Located in `backend/api/middleware/security.py`
- Default behavior: Disables caching for ALL endpoints with `Cache-Control: no-store, no-cache, must-revalidate, private`
- Uses an allow list approach - only explicitly permitted paths can be cached
- Cacheable paths include: static assets (`static/*`, `_next/static/*`), health checks, public store pages, documentation
- Prevents sensitive data (auth tokens, API keys, user data) from being cached by browsers/proxies
- To allow caching for a new endpoint, add it to `CACHEABLE_PATHS` in the middleware
- Applied to both main API server and external API applications
@AGENTS.md

View File

@@ -121,36 +121,20 @@ RUN ln -s ../lib/node_modules/npm/bin/npm-cli.js /usr/bin/npm \
&& ln -s ../lib/node_modules/npm/bin/npx-cli.js /usr/bin/npx
COPY --from=builder /root/.cache/prisma-python/binaries /root/.cache/prisma-python/binaries
# Install agent-browser (Copilot browser tool) + Chromium.
# On amd64: install runtime libs + run `agent-browser install` to download
# Chrome for Testing (pinned version, tested with Playwright).
# On arm64: install system chromium package — Chrome for Testing has no ARM64
# binary. AGENT_BROWSER_EXECUTABLE_PATH is set at runtime by the entrypoint
# script (below) to redirect agent-browser to the system binary.
ARG TARGETARCH
# Install agent-browser (Copilot browser tool) using the system chromium package.
# Chrome for Testing (the binary agent-browser downloads via `agent-browser install`)
# has no ARM64 builds, so we use the distro-packaged chromium instead — verified to
# work with agent-browser via Docker tests on arm64; amd64 is validated in CI.
# Note: system chromium tracks the Debian package schedule rather than a pinned
# Chrome for Testing release. If agent-browser requires a specific Chrome version,
# verify compatibility against the chromium package version in the base image.
RUN apt-get update \
&& if [ "$TARGETARCH" = "arm64" ]; then \
apt-get install -y --no-install-recommends chromium fonts-liberation; \
else \
apt-get install -y --no-install-recommends \
libnss3 libnspr4 libatk1.0-0 libatk-bridge2.0-0 libcups2 libdrm2 \
libdbus-1-3 libxkbcommon0 libatspi2.0-0t64 libxcomposite1 libxdamage1 \
libxfixes3 libxrandr2 libgbm1 libasound2t64 libpango-1.0-0 libcairo2 \
libx11-6 libx11-xcb1 libxcb1 libxext6 libglib2.0-0t64 \
fonts-liberation libfontconfig1; \
fi \
&& apt-get install -y --no-install-recommends chromium fonts-liberation \
&& rm -rf /var/lib/apt/lists/* \
&& npm install -g agent-browser \
&& ([ "$TARGETARCH" = "arm64" ] || agent-browser install) \
&& rm -rf /tmp/* /root/.npm
# On arm64 the system chromium is at /usr/bin/chromium; set
# AGENT_BROWSER_EXECUTABLE_PATH so agent-browser's daemon uses it instead of
# Chrome for Testing (which has no ARM64 binary). On amd64 the variable is left
# unset so agent-browser uses the Chrome for Testing binary it downloaded above.
RUN printf '#!/bin/sh\n[ -x /usr/bin/chromium ] && export AGENT_BROWSER_EXECUTABLE_PATH=/usr/bin/chromium\nexec "$@"\n' \
> /usr/local/bin/entrypoint.sh \
&& chmod +x /usr/local/bin/entrypoint.sh
ENV AGENT_BROWSER_EXECUTABLE_PATH=/usr/bin/chromium
WORKDIR /app/autogpt_platform/backend
@@ -173,5 +157,4 @@ RUN POETRY_VIRTUALENVS_CREATE=true POETRY_VIRTUALENVS_IN_PROJECT=true \
ENV PORT=8000
ENTRYPOINT ["/usr/local/bin/entrypoint.sh"]
CMD ["rest"]

View File

@@ -18,14 +18,22 @@ from pydantic import BaseModel, Field, SecretStr
from backend.api.external.middleware import require_permission
from backend.api.features.integrations.models import get_all_provider_names
from backend.api.features.integrations.router import (
CredentialsMetaResponse,
to_meta_response,
)
from backend.data.auth.base import APIAuthorizationInfo
from backend.data.model import (
APIKeyCredentials,
Credentials,
CredentialsType,
HostScopedCredentials,
OAuth2Credentials,
UserPasswordCredentials,
is_sdk_default,
)
from backend.integrations.credentials_store import (
is_system_credential,
provider_matches,
)
from backend.integrations.creds_manager import IntegrationCredentialsManager
from backend.integrations.oauth import CREDENTIALS_BY_PROVIDER, HANDLERS_BY_NAME
@@ -91,18 +99,6 @@ class OAuthCompleteResponse(BaseModel):
)
class CredentialSummary(BaseModel):
"""Summary of a credential without sensitive data."""
id: str
provider: str
type: CredentialsType
title: Optional[str] = None
scopes: Optional[list[str]] = None
username: Optional[str] = None
host: Optional[str] = None
class ProviderInfo(BaseModel):
"""Information about an integration provider."""
@@ -473,12 +469,12 @@ async def complete_oauth(
)
@integrations_router.get("/credentials", response_model=list[CredentialSummary])
@integrations_router.get("/credentials", response_model=list[CredentialsMetaResponse])
async def list_credentials(
auth: APIAuthorizationInfo = Security(
require_permission(APIKeyPermission.READ_INTEGRATIONS)
),
) -> list[CredentialSummary]:
) -> list[CredentialsMetaResponse]:
"""
List all credentials for the authenticated user.
@@ -486,28 +482,19 @@ async def list_credentials(
"""
credentials = await creds_manager.store.get_all_creds(auth.user_id)
return [
CredentialSummary(
id=cred.id,
provider=cred.provider,
type=cred.type,
title=cred.title,
scopes=cred.scopes if isinstance(cred, OAuth2Credentials) else None,
username=cred.username if isinstance(cred, OAuth2Credentials) else None,
host=cred.host if isinstance(cred, HostScopedCredentials) else None,
)
for cred in credentials
to_meta_response(cred) for cred in credentials if not is_sdk_default(cred.id)
]
@integrations_router.get(
"/{provider}/credentials", response_model=list[CredentialSummary]
"/{provider}/credentials", response_model=list[CredentialsMetaResponse]
)
async def list_credentials_by_provider(
provider: Annotated[str, Path(title="The provider to list credentials for")],
auth: APIAuthorizationInfo = Security(
require_permission(APIKeyPermission.READ_INTEGRATIONS)
),
) -> list[CredentialSummary]:
) -> list[CredentialsMetaResponse]:
"""
List credentials for a specific provider.
"""
@@ -515,16 +502,7 @@ async def list_credentials_by_provider(
auth.user_id, provider
)
return [
CredentialSummary(
id=cred.id,
provider=cred.provider,
type=cred.type,
title=cred.title,
scopes=cred.scopes if isinstance(cred, OAuth2Credentials) else None,
username=cred.username if isinstance(cred, OAuth2Credentials) else None,
host=cred.host if isinstance(cred, HostScopedCredentials) else None,
)
for cred in credentials
to_meta_response(cred) for cred in credentials if not is_sdk_default(cred.id)
]
@@ -597,11 +575,11 @@ async def create_credential(
# Store credentials
try:
await creds_manager.create(auth.user_id, credentials)
except Exception as e:
logger.error(f"Failed to store credentials: {e}")
except Exception:
logger.exception("Failed to store credentials")
raise HTTPException(
status_code=status.HTTP_500_INTERNAL_SERVER_ERROR,
detail=f"Failed to store credentials: {str(e)}",
detail="Failed to store credentials",
)
logger.info(f"Created {request.type} credentials for provider {provider}")
@@ -639,15 +617,23 @@ async def delete_credential(
use the main API's delete endpoint which handles webhook cleanup and
token revocation.
"""
if is_sdk_default(cred_id):
raise HTTPException(
status_code=status.HTTP_404_NOT_FOUND, detail="Credentials not found"
)
if is_system_credential(cred_id):
raise HTTPException(
status_code=status.HTTP_403_FORBIDDEN,
detail="System-managed credentials cannot be deleted",
)
creds = await creds_manager.store.get_creds_by_id(auth.user_id, cred_id)
if not creds:
raise HTTPException(
status_code=status.HTTP_404_NOT_FOUND, detail="Credentials not found"
)
if creds.provider != provider:
if not provider_matches(creds.provider, provider):
raise HTTPException(
status_code=status.HTTP_404_NOT_FOUND,
detail="Credentials do not match the specified provider",
status_code=status.HTTP_404_NOT_FOUND, detail="Credentials not found"
)
await creds_manager.delete(auth.user_id, cred_id)

View File

@@ -72,7 +72,7 @@ class RunAgentRequest(BaseModel):
def _create_ephemeral_session(user_id: str) -> ChatSession:
"""Create an ephemeral session for stateless API requests."""
return ChatSession.new(user_id)
return ChatSession.new(user_id, dry_run=False)
@tools_router.post(

View File

@@ -0,0 +1,146 @@
"""Admin endpoints for checking and resetting user CoPilot rate limit usage."""
import logging
from typing import Optional
from autogpt_libs.auth import get_user_id, requires_admin_user
from fastapi import APIRouter, Body, HTTPException, Security
from pydantic import BaseModel
from backend.copilot.config import ChatConfig
from backend.copilot.rate_limit import (
get_global_rate_limits,
get_usage_status,
reset_user_usage,
)
from backend.data.user import get_user_by_email, get_user_email_by_id
logger = logging.getLogger(__name__)
config = ChatConfig()
router = APIRouter(
prefix="/admin",
tags=["copilot", "admin"],
dependencies=[Security(requires_admin_user)],
)
class UserRateLimitResponse(BaseModel):
user_id: str
user_email: Optional[str] = None
daily_token_limit: int
weekly_token_limit: int
daily_tokens_used: int
weekly_tokens_used: int
async def _resolve_user_id(
user_id: Optional[str], email: Optional[str]
) -> tuple[str, Optional[str]]:
"""Resolve a user_id and email from the provided parameters.
Returns (user_id, email). Accepts either user_id or email; at least one
must be provided. When both are provided, ``email`` takes precedence.
"""
if email:
user = await get_user_by_email(email)
if not user:
raise HTTPException(
status_code=404, detail="No user found with the provided email."
)
return user.id, email
if not user_id:
raise HTTPException(
status_code=400,
detail="Either user_id or email query parameter is required.",
)
# We have a user_id; try to look up their email for display purposes.
# This is non-critical -- a failure should not block the response.
try:
resolved_email = await get_user_email_by_id(user_id)
except Exception:
logger.warning("Failed to resolve email for user %s", user_id, exc_info=True)
resolved_email = None
return user_id, resolved_email
@router.get(
"/rate_limit",
response_model=UserRateLimitResponse,
summary="Get User Rate Limit",
)
async def get_user_rate_limit(
user_id: Optional[str] = None,
email: Optional[str] = None,
admin_user_id: str = Security(get_user_id),
) -> UserRateLimitResponse:
"""Get a user's current usage and effective rate limits. Admin-only.
Accepts either ``user_id`` or ``email`` as a query parameter.
When ``email`` is provided the user is looked up by email first.
"""
resolved_id, resolved_email = await _resolve_user_id(user_id, email)
logger.info("Admin %s checking rate limit for user %s", admin_user_id, resolved_id)
daily_limit, weekly_limit = await get_global_rate_limits(
resolved_id, config.daily_token_limit, config.weekly_token_limit
)
usage = await get_usage_status(resolved_id, daily_limit, weekly_limit)
return UserRateLimitResponse(
user_id=resolved_id,
user_email=resolved_email,
daily_token_limit=daily_limit,
weekly_token_limit=weekly_limit,
daily_tokens_used=usage.daily.used,
weekly_tokens_used=usage.weekly.used,
)
@router.post(
"/rate_limit/reset",
response_model=UserRateLimitResponse,
summary="Reset User Rate Limit Usage",
)
async def reset_user_rate_limit(
user_id: str = Body(embed=True),
reset_weekly: bool = Body(False, embed=True),
admin_user_id: str = Security(get_user_id),
) -> UserRateLimitResponse:
"""Reset a user's daily usage counter (and optionally weekly). Admin-only."""
logger.info(
"Admin %s resetting rate limit for user %s (reset_weekly=%s)",
admin_user_id,
user_id,
reset_weekly,
)
try:
await reset_user_usage(user_id, reset_weekly=reset_weekly)
except Exception as e:
logger.exception("Failed to reset user usage")
raise HTTPException(status_code=500, detail="Failed to reset usage") from e
daily_limit, weekly_limit = await get_global_rate_limits(
user_id, config.daily_token_limit, config.weekly_token_limit
)
usage = await get_usage_status(user_id, daily_limit, weekly_limit)
try:
resolved_email = await get_user_email_by_id(user_id)
except Exception:
logger.warning("Failed to resolve email for user %s", user_id, exc_info=True)
resolved_email = None
return UserRateLimitResponse(
user_id=user_id,
user_email=resolved_email,
daily_token_limit=daily_limit,
weekly_token_limit=weekly_limit,
daily_tokens_used=usage.daily.used,
weekly_tokens_used=usage.weekly.used,
)

View File

@@ -0,0 +1,263 @@
import json
from types import SimpleNamespace
from unittest.mock import AsyncMock
import fastapi
import fastapi.testclient
import pytest
import pytest_mock
from autogpt_libs.auth.jwt_utils import get_jwt_payload
from pytest_snapshot.plugin import Snapshot
from backend.copilot.rate_limit import CoPilotUsageStatus, UsageWindow
from .rate_limit_admin_routes import router as rate_limit_admin_router
app = fastapi.FastAPI()
app.include_router(rate_limit_admin_router)
client = fastapi.testclient.TestClient(app)
_MOCK_MODULE = "backend.api.features.admin.rate_limit_admin_routes"
_TARGET_EMAIL = "target@example.com"
@pytest.fixture(autouse=True)
def setup_app_admin_auth(mock_jwt_admin):
"""Setup admin auth overrides for all tests in this module"""
app.dependency_overrides[get_jwt_payload] = mock_jwt_admin["get_jwt_payload"]
yield
app.dependency_overrides.clear()
def _mock_usage_status(
daily_used: int = 500_000, weekly_used: int = 3_000_000
) -> CoPilotUsageStatus:
from datetime import UTC, datetime, timedelta
now = datetime.now(UTC)
return CoPilotUsageStatus(
daily=UsageWindow(
used=daily_used, limit=2_500_000, resets_at=now + timedelta(hours=6)
),
weekly=UsageWindow(
used=weekly_used, limit=12_500_000, resets_at=now + timedelta(days=3)
),
)
def _patch_rate_limit_deps(
mocker: pytest_mock.MockerFixture,
target_user_id: str,
daily_used: int = 500_000,
weekly_used: int = 3_000_000,
):
"""Patch the common rate-limit + user-lookup dependencies."""
mocker.patch(
f"{_MOCK_MODULE}.get_global_rate_limits",
new_callable=AsyncMock,
return_value=(2_500_000, 12_500_000),
)
mocker.patch(
f"{_MOCK_MODULE}.get_usage_status",
new_callable=AsyncMock,
return_value=_mock_usage_status(daily_used=daily_used, weekly_used=weekly_used),
)
mocker.patch(
f"{_MOCK_MODULE}.get_user_email_by_id",
new_callable=AsyncMock,
return_value=_TARGET_EMAIL,
)
def test_get_rate_limit(
mocker: pytest_mock.MockerFixture,
configured_snapshot: Snapshot,
target_user_id: str,
) -> None:
"""Test getting rate limit and usage for a user."""
_patch_rate_limit_deps(mocker, target_user_id)
response = client.get("/admin/rate_limit", params={"user_id": target_user_id})
assert response.status_code == 200
data = response.json()
assert data["user_id"] == target_user_id
assert data["user_email"] == _TARGET_EMAIL
assert data["daily_token_limit"] == 2_500_000
assert data["weekly_token_limit"] == 12_500_000
assert data["daily_tokens_used"] == 500_000
assert data["weekly_tokens_used"] == 3_000_000
configured_snapshot.assert_match(
json.dumps(data, indent=2, sort_keys=True) + "\n",
"get_rate_limit",
)
def test_get_rate_limit_by_email(
mocker: pytest_mock.MockerFixture,
target_user_id: str,
) -> None:
"""Test looking up rate limits via email instead of user_id."""
_patch_rate_limit_deps(mocker, target_user_id)
mock_user = SimpleNamespace(id=target_user_id, email=_TARGET_EMAIL)
mocker.patch(
f"{_MOCK_MODULE}.get_user_by_email",
new_callable=AsyncMock,
return_value=mock_user,
)
response = client.get("/admin/rate_limit", params={"email": _TARGET_EMAIL})
assert response.status_code == 200
data = response.json()
assert data["user_id"] == target_user_id
assert data["user_email"] == _TARGET_EMAIL
assert data["daily_token_limit"] == 2_500_000
def test_get_rate_limit_by_email_not_found(
mocker: pytest_mock.MockerFixture,
) -> None:
"""Test that looking up a non-existent email returns 404."""
mocker.patch(
f"{_MOCK_MODULE}.get_user_by_email",
new_callable=AsyncMock,
return_value=None,
)
response = client.get("/admin/rate_limit", params={"email": "nobody@example.com"})
assert response.status_code == 404
def test_get_rate_limit_no_params() -> None:
"""Test that omitting both user_id and email returns 400."""
response = client.get("/admin/rate_limit")
assert response.status_code == 400
def test_reset_user_usage_daily_only(
mocker: pytest_mock.MockerFixture,
configured_snapshot: Snapshot,
target_user_id: str,
) -> None:
"""Test resetting only daily usage (default behaviour)."""
mock_reset = mocker.patch(
f"{_MOCK_MODULE}.reset_user_usage",
new_callable=AsyncMock,
)
_patch_rate_limit_deps(mocker, target_user_id, daily_used=0, weekly_used=3_000_000)
response = client.post(
"/admin/rate_limit/reset",
json={"user_id": target_user_id},
)
assert response.status_code == 200
data = response.json()
assert data["daily_tokens_used"] == 0
# Weekly is untouched
assert data["weekly_tokens_used"] == 3_000_000
mock_reset.assert_awaited_once_with(target_user_id, reset_weekly=False)
configured_snapshot.assert_match(
json.dumps(data, indent=2, sort_keys=True) + "\n",
"reset_user_usage_daily_only",
)
def test_reset_user_usage_daily_and_weekly(
mocker: pytest_mock.MockerFixture,
configured_snapshot: Snapshot,
target_user_id: str,
) -> None:
"""Test resetting both daily and weekly usage."""
mock_reset = mocker.patch(
f"{_MOCK_MODULE}.reset_user_usage",
new_callable=AsyncMock,
)
_patch_rate_limit_deps(mocker, target_user_id, daily_used=0, weekly_used=0)
response = client.post(
"/admin/rate_limit/reset",
json={"user_id": target_user_id, "reset_weekly": True},
)
assert response.status_code == 200
data = response.json()
assert data["daily_tokens_used"] == 0
assert data["weekly_tokens_used"] == 0
mock_reset.assert_awaited_once_with(target_user_id, reset_weekly=True)
configured_snapshot.assert_match(
json.dumps(data, indent=2, sort_keys=True) + "\n",
"reset_user_usage_daily_and_weekly",
)
def test_reset_user_usage_redis_failure(
mocker: pytest_mock.MockerFixture,
target_user_id: str,
) -> None:
"""Test that Redis failure on reset returns 500."""
mocker.patch(
f"{_MOCK_MODULE}.reset_user_usage",
new_callable=AsyncMock,
side_effect=Exception("Redis connection refused"),
)
response = client.post(
"/admin/rate_limit/reset",
json={"user_id": target_user_id},
)
assert response.status_code == 500
def test_get_rate_limit_email_lookup_failure(
mocker: pytest_mock.MockerFixture,
target_user_id: str,
) -> None:
"""Test that failing to resolve a user email degrades gracefully."""
mocker.patch(
f"{_MOCK_MODULE}.get_global_rate_limits",
new_callable=AsyncMock,
return_value=(2_500_000, 12_500_000),
)
mocker.patch(
f"{_MOCK_MODULE}.get_usage_status",
new_callable=AsyncMock,
return_value=_mock_usage_status(),
)
mocker.patch(
f"{_MOCK_MODULE}.get_user_email_by_id",
new_callable=AsyncMock,
side_effect=Exception("DB connection lost"),
)
response = client.get("/admin/rate_limit", params={"user_id": target_user_id})
assert response.status_code == 200
data = response.json()
assert data["user_id"] == target_user_id
assert data["user_email"] is None
def test_admin_endpoints_require_admin_role(mock_jwt_user) -> None:
"""Test that rate limit admin endpoints require admin role."""
app.dependency_overrides[get_jwt_payload] = mock_jwt_user["get_jwt_payload"]
response = client.get("/admin/rate_limit", params={"user_id": "test"})
assert response.status_code == 403
response = client.post(
"/admin/rate_limit/reset",
json={"user_id": "test"},
)
assert response.status_code == 403

View File

@@ -7,6 +7,8 @@ import fastapi
import fastapi.responses
import prisma.enums
import backend.api.features.library.db as library_db
import backend.api.features.library.model as library_model
import backend.api.features.store.cache as store_cache
import backend.api.features.store.db as store_db
import backend.api.features.store.model as store_model
@@ -132,3 +134,40 @@ async def admin_download_agent_file(
return fastapi.responses.FileResponse(
tmp_file.name, filename=file_name, media_type="application/json"
)
@router.get(
"/submissions/{store_listing_version_id}/preview",
summary="Admin Preview Submission Listing",
)
async def admin_preview_submission(
store_listing_version_id: str,
) -> store_model.StoreAgentDetails:
"""
Preview a marketplace submission as it would appear on the listing page.
Bypasses the APPROVED-only StoreAgent view so admins can preview pending
submissions before approving.
"""
return await store_db.get_store_agent_details_as_admin(store_listing_version_id)
@router.post(
"/submissions/{store_listing_version_id}/add-to-library",
summary="Admin Add Pending Agent to Library",
status_code=201,
)
async def admin_add_agent_to_library(
store_listing_version_id: str,
user_id: str = fastapi.Security(autogpt_libs.auth.get_user_id),
) -> library_model.LibraryAgent:
"""
Add a pending marketplace agent to the admin's library for review.
Uses admin-level access to bypass marketplace APPROVED-only checks.
The builder can load the graph because get_graph() checks library
membership as a fallback: "you added it, you keep it."
"""
return await library_db.add_store_agent_to_library_as_admin(
store_listing_version_id=store_listing_version_id,
user_id=user_id,
)

View File

@@ -0,0 +1,335 @@
"""Tests for admin store routes and the bypass logic they depend on.
Tests are organized by what they protect:
- SECRT-2162: get_graph_as_admin bypasses ownership/marketplace checks
- SECRT-2167 security: admin endpoints reject non-admin users
- SECRT-2167 bypass: preview queries StoreListingVersion (not StoreAgent view),
and add-to-library uses get_graph_as_admin (not get_graph)
"""
from datetime import datetime, timezone
from unittest.mock import AsyncMock, MagicMock, patch
import fastapi
import fastapi.responses
import fastapi.testclient
import pytest
import pytest_mock
from autogpt_libs.auth.jwt_utils import get_jwt_payload
from backend.data.graph import get_graph_as_admin
from backend.util.exceptions import NotFoundError
from .store_admin_routes import router as store_admin_router
# Shared constants
ADMIN_USER_ID = "admin-user-id"
CREATOR_USER_ID = "other-creator-id"
GRAPH_ID = "test-graph-id"
GRAPH_VERSION = 3
SLV_ID = "test-store-listing-version-id"
def _make_mock_graph(user_id: str = CREATOR_USER_ID) -> MagicMock:
graph = MagicMock()
graph.userId = user_id
graph.id = GRAPH_ID
graph.version = GRAPH_VERSION
graph.Nodes = []
return graph
# ---- SECRT-2162: get_graph_as_admin bypasses ownership checks ---- #
@pytest.mark.asyncio
async def test_admin_can_access_pending_agent_not_owned() -> None:
"""get_graph_as_admin must return a graph even when the admin doesn't own
it and it's not APPROVED in the marketplace."""
mock_graph = _make_mock_graph()
mock_graph_model = MagicMock(name="GraphModel")
with (
patch("backend.data.graph.AgentGraph.prisma") as mock_prisma,
patch(
"backend.data.graph.GraphModel.from_db",
return_value=mock_graph_model,
),
):
mock_prisma.return_value.find_first = AsyncMock(return_value=mock_graph)
result = await get_graph_as_admin(
graph_id=GRAPH_ID,
version=GRAPH_VERSION,
user_id=ADMIN_USER_ID,
for_export=False,
)
assert result is mock_graph_model
@pytest.mark.asyncio
async def test_admin_download_pending_agent_with_subagents() -> None:
"""get_graph_as_admin with for_export=True must call get_sub_graphs
and pass sub_graphs to GraphModel.from_db."""
mock_graph = _make_mock_graph()
mock_sub_graph = MagicMock(name="SubGraph")
mock_graph_model = MagicMock(name="GraphModel")
with (
patch("backend.data.graph.AgentGraph.prisma") as mock_prisma,
patch(
"backend.data.graph.get_sub_graphs",
new_callable=AsyncMock,
return_value=[mock_sub_graph],
) as mock_get_sub,
patch(
"backend.data.graph.GraphModel.from_db",
return_value=mock_graph_model,
) as mock_from_db,
):
mock_prisma.return_value.find_first = AsyncMock(return_value=mock_graph)
result = await get_graph_as_admin(
graph_id=GRAPH_ID,
version=GRAPH_VERSION,
user_id=ADMIN_USER_ID,
for_export=True,
)
assert result is mock_graph_model
mock_get_sub.assert_awaited_once_with(mock_graph)
mock_from_db.assert_called_once_with(
graph=mock_graph,
sub_graphs=[mock_sub_graph],
for_export=True,
)
# ---- SECRT-2167 security: admin endpoints reject non-admin users ---- #
app = fastapi.FastAPI()
app.include_router(store_admin_router)
@app.exception_handler(NotFoundError)
async def _not_found_handler(
request: fastapi.Request, exc: NotFoundError
) -> fastapi.responses.JSONResponse:
return fastapi.responses.JSONResponse(status_code=404, content={"detail": str(exc)})
client = fastapi.testclient.TestClient(app)
@pytest.fixture(autouse=True)
def setup_app_admin_auth(mock_jwt_admin):
"""Setup admin auth overrides for all route tests in this module."""
app.dependency_overrides[get_jwt_payload] = mock_jwt_admin["get_jwt_payload"]
yield
app.dependency_overrides.clear()
def test_preview_requires_admin(mock_jwt_user) -> None:
"""Non-admin users must get 403 on the preview endpoint."""
app.dependency_overrides[get_jwt_payload] = mock_jwt_user["get_jwt_payload"]
response = client.get(f"/admin/submissions/{SLV_ID}/preview")
assert response.status_code == 403
def test_add_to_library_requires_admin(mock_jwt_user) -> None:
"""Non-admin users must get 403 on the add-to-library endpoint."""
app.dependency_overrides[get_jwt_payload] = mock_jwt_user["get_jwt_payload"]
response = client.post(f"/admin/submissions/{SLV_ID}/add-to-library")
assert response.status_code == 403
def test_preview_nonexistent_submission(
mocker: pytest_mock.MockerFixture,
) -> None:
"""Preview of a nonexistent submission returns 404."""
mocker.patch(
"backend.api.features.admin.store_admin_routes.store_db"
".get_store_agent_details_as_admin",
side_effect=NotFoundError("not found"),
)
response = client.get(f"/admin/submissions/{SLV_ID}/preview")
assert response.status_code == 404
# ---- SECRT-2167 bypass: verify the right data sources are used ---- #
@pytest.mark.asyncio
async def test_preview_queries_store_listing_version_not_store_agent() -> None:
"""get_store_agent_details_as_admin must query StoreListingVersion
directly (not the APPROVED-only StoreAgent view). This is THE test that
prevents the bypass from being accidentally reverted."""
from backend.api.features.store.db import get_store_agent_details_as_admin
mock_slv = MagicMock()
mock_slv.id = SLV_ID
mock_slv.name = "Test Agent"
mock_slv.subHeading = "Short desc"
mock_slv.description = "Long desc"
mock_slv.videoUrl = None
mock_slv.agentOutputDemoUrl = None
mock_slv.imageUrls = ["https://example.com/img.png"]
mock_slv.instructions = None
mock_slv.categories = ["productivity"]
mock_slv.version = 1
mock_slv.agentGraphId = GRAPH_ID
mock_slv.agentGraphVersion = GRAPH_VERSION
mock_slv.updatedAt = datetime(2026, 3, 24, tzinfo=timezone.utc)
mock_slv.recommendedScheduleCron = "0 9 * * *"
mock_listing = MagicMock()
mock_listing.id = "listing-id"
mock_listing.slug = "test-agent"
mock_listing.activeVersionId = SLV_ID
mock_listing.hasApprovedVersion = False
mock_listing.CreatorProfile = MagicMock(username="creator", avatarUrl="")
mock_slv.StoreListing = mock_listing
with (
patch(
"backend.api.features.store.db.prisma.models" ".StoreListingVersion.prisma",
) as mock_slv_prisma,
patch(
"backend.api.features.store.db.prisma.models.StoreAgent.prisma",
) as mock_store_agent_prisma,
):
mock_slv_prisma.return_value.find_unique = AsyncMock(return_value=mock_slv)
result = await get_store_agent_details_as_admin(SLV_ID)
# Verify it queried StoreListingVersion (not the APPROVED-only StoreAgent)
mock_slv_prisma.return_value.find_unique.assert_awaited_once()
await_args = mock_slv_prisma.return_value.find_unique.await_args
assert await_args is not None
assert await_args.kwargs["where"] == {"id": SLV_ID}
# Verify the APPROVED-only StoreAgent view was NOT touched
mock_store_agent_prisma.assert_not_called()
# Verify the result has the right data
assert result.agent_name == "Test Agent"
assert result.agent_image == ["https://example.com/img.png"]
assert result.has_approved_version is False
assert result.runs == 0
assert result.rating == 0.0
@pytest.mark.asyncio
async def test_resolve_graph_admin_uses_get_graph_as_admin() -> None:
"""resolve_graph_for_library(admin=True) must call get_graph_as_admin,
not get_graph. This is THE test that prevents the add-to-library bypass
from being accidentally reverted."""
from backend.api.features.library._add_to_library import resolve_graph_for_library
mock_slv = MagicMock()
mock_slv.AgentGraph = MagicMock(id=GRAPH_ID, version=GRAPH_VERSION)
mock_graph_model = MagicMock(name="GraphModel")
with (
patch(
"backend.api.features.library._add_to_library.prisma.models"
".StoreListingVersion.prisma",
) as mock_prisma,
patch(
"backend.api.features.library._add_to_library.graph_db"
".get_graph_as_admin",
new_callable=AsyncMock,
return_value=mock_graph_model,
) as mock_admin,
patch(
"backend.api.features.library._add_to_library.graph_db.get_graph",
new_callable=AsyncMock,
) as mock_regular,
):
mock_prisma.return_value.find_unique = AsyncMock(return_value=mock_slv)
result = await resolve_graph_for_library(SLV_ID, ADMIN_USER_ID, admin=True)
assert result is mock_graph_model
mock_admin.assert_awaited_once_with(
graph_id=GRAPH_ID, version=GRAPH_VERSION, user_id=ADMIN_USER_ID
)
mock_regular.assert_not_awaited()
@pytest.mark.asyncio
async def test_resolve_graph_regular_uses_get_graph() -> None:
"""resolve_graph_for_library(admin=False) must call get_graph,
not get_graph_as_admin. Ensures the non-admin path is preserved."""
from backend.api.features.library._add_to_library import resolve_graph_for_library
mock_slv = MagicMock()
mock_slv.AgentGraph = MagicMock(id=GRAPH_ID, version=GRAPH_VERSION)
mock_graph_model = MagicMock(name="GraphModel")
with (
patch(
"backend.api.features.library._add_to_library.prisma.models"
".StoreListingVersion.prisma",
) as mock_prisma,
patch(
"backend.api.features.library._add_to_library.graph_db"
".get_graph_as_admin",
new_callable=AsyncMock,
) as mock_admin,
patch(
"backend.api.features.library._add_to_library.graph_db.get_graph",
new_callable=AsyncMock,
return_value=mock_graph_model,
) as mock_regular,
):
mock_prisma.return_value.find_unique = AsyncMock(return_value=mock_slv)
result = await resolve_graph_for_library(SLV_ID, "regular-user-id", admin=False)
assert result is mock_graph_model
mock_regular.assert_awaited_once_with(
graph_id=GRAPH_ID, version=GRAPH_VERSION, user_id="regular-user-id"
)
mock_admin.assert_not_awaited()
# ---- Library membership grants graph access (product decision) ---- #
@pytest.mark.asyncio
async def test_library_member_can_view_pending_agent_in_builder() -> None:
"""After adding a pending agent to their library, the user should be
able to load the graph in the builder via get_graph()."""
mock_graph = _make_mock_graph()
mock_graph_model = MagicMock(name="GraphModel")
mock_library_agent = MagicMock()
mock_library_agent.AgentGraph = mock_graph
with (
patch("backend.data.graph.AgentGraph.prisma") as mock_ag_prisma,
patch(
"backend.data.graph.StoreListingVersion.prisma",
) as mock_slv_prisma,
patch("backend.data.graph.LibraryAgent.prisma") as mock_lib_prisma,
patch(
"backend.data.graph.GraphModel.from_db",
return_value=mock_graph_model,
),
):
mock_ag_prisma.return_value.find_first = AsyncMock(return_value=None)
mock_slv_prisma.return_value.find_first = AsyncMock(return_value=None)
mock_lib_prisma.return_value.find_first = AsyncMock(
return_value=mock_library_agent
)
from backend.data.graph import get_graph
result = await get_graph(
graph_id=GRAPH_ID,
version=GRAPH_VERSION,
user_id=ADMIN_USER_ID,
)
assert result is mock_graph_model, "Library membership should grant graph access"

View File

@@ -11,7 +11,7 @@ from autogpt_libs import auth
from fastapi import APIRouter, HTTPException, Query, Response, Security
from fastapi.responses import StreamingResponse
from prisma.models import UserWorkspaceFile
from pydantic import BaseModel, Field, field_validator
from pydantic import BaseModel, ConfigDict, Field, field_validator
from backend.copilot import service as chat_service
from backend.copilot import stream_registry
@@ -20,6 +20,7 @@ from backend.copilot.executor.utils import enqueue_cancel_task, enqueue_copilot_
from backend.copilot.model import (
ChatMessage,
ChatSession,
ChatSessionMetadata,
append_and_save_message,
create_chat_session,
delete_chat_session,
@@ -30,8 +31,14 @@ from backend.copilot.model import (
from backend.copilot.rate_limit import (
CoPilotUsageStatus,
RateLimitExceeded,
acquire_reset_lock,
check_rate_limit,
get_daily_reset_count,
get_global_rate_limits,
get_usage_status,
increment_daily_reset_count,
release_reset_lock,
reset_daily_usage,
)
from backend.copilot.response_model import StreamError, StreamFinish, StreamHeartbeat
from backend.copilot.tools.e2b_sandbox import kill_sandbox
@@ -59,9 +66,16 @@ from backend.copilot.tools.models import (
UnderstandingUpdatedResponse,
)
from backend.copilot.tracking import track_user_message
from backend.data.credit import UsageTransactionMetadata, get_user_credit_model
from backend.data.redis_client import get_redis_async
from backend.data.understanding import get_business_understanding
from backend.data.workspace import get_or_create_workspace
from backend.util.exceptions import NotFoundError
from backend.util.exceptions import InsufficientBalanceError, NotFoundError
from backend.util.settings import Settings
settings = Settings()
logger = logging.getLogger(__name__)
config = ChatConfig()
@@ -69,8 +83,6 @@ _UUID_RE = re.compile(
r"^[0-9a-f]{8}-[0-9a-f]{4}-[0-9a-f]{4}-[0-9a-f]{4}-[0-9a-f]{12}$", re.I
)
logger = logging.getLogger(__name__)
async def _validate_and_get_session(
session_id: str,
@@ -101,12 +113,25 @@ class StreamChatRequest(BaseModel):
) # Workspace file IDs attached to this message
class CreateSessionRequest(BaseModel):
"""Request model for creating a new chat session.
``dry_run`` is a **top-level** field — do not nest it inside ``metadata``.
Extra/unknown fields are rejected (422) to prevent silent mis-use.
"""
model_config = ConfigDict(extra="forbid")
dry_run: bool = False
class CreateSessionResponse(BaseModel):
"""Response model containing information on a newly created chat session."""
id: str
created_at: str
user_id: str | None
metadata: ChatSessionMetadata = ChatSessionMetadata()
class ActiveStreamInfo(BaseModel):
@@ -127,6 +152,7 @@ class SessionDetailResponse(BaseModel):
active_stream: ActiveStreamInfo | None = None # Present if stream is still active
total_prompt_tokens: int = 0
total_completion_tokens: int = 0
metadata: ChatSessionMetadata = ChatSessionMetadata()
class SessionSummaryResponse(BaseModel):
@@ -237,6 +263,7 @@ async def list_sessions(
)
async def create_session(
user_id: Annotated[str, Security(auth.get_user_id)],
request: CreateSessionRequest | None = None,
) -> CreateSessionResponse:
"""
Create a new chat session.
@@ -245,22 +272,28 @@ async def create_session(
Args:
user_id: The authenticated user ID parsed from the JWT (required).
request: Optional request body. When provided, ``dry_run=True``
forces run_block and run_agent calls to use dry-run simulation.
Returns:
CreateSessionResponse: Details of the created session.
"""
dry_run = request.dry_run if request else False
logger.info(
f"Creating session with user_id: "
f"...{user_id[-8:] if len(user_id) > 8 else '<redacted>'}"
f"{', dry_run=True' if dry_run else ''}"
)
session = await create_chat_session(user_id)
session = await create_chat_session(user_id, dry_run=dry_run)
return CreateSessionResponse(
id=session.session_id,
created_at=session.started_at.isoformat(),
user_id=session.user_id,
metadata=session.metadata,
)
@@ -409,6 +442,7 @@ async def get_session(
active_stream=active_stream_info,
total_prompt_tokens=total_prompt,
total_completion_tokens=total_completion,
metadata=session.metadata,
)
@@ -421,11 +455,187 @@ async def get_copilot_usage(
"""Get CoPilot usage status for the authenticated user.
Returns current token usage vs limits for daily and weekly windows.
Global defaults sourced from LaunchDarkly (falling back to config).
"""
daily_limit, weekly_limit = await get_global_rate_limits(
user_id, config.daily_token_limit, config.weekly_token_limit
)
return await get_usage_status(
user_id=user_id,
daily_token_limit=config.daily_token_limit,
weekly_token_limit=config.weekly_token_limit,
daily_token_limit=daily_limit,
weekly_token_limit=weekly_limit,
rate_limit_reset_cost=config.rate_limit_reset_cost,
)
class RateLimitResetResponse(BaseModel):
"""Response from resetting the daily rate limit."""
success: bool
credits_charged: int = Field(description="Credits charged (in cents)")
remaining_balance: int = Field(description="Credit balance after charge (in cents)")
usage: CoPilotUsageStatus = Field(description="Updated usage status after reset")
@router.post(
"/usage/reset",
status_code=200,
responses={
400: {
"description": "Bad Request (feature disabled or daily limit not reached)"
},
402: {"description": "Payment Required (insufficient credits)"},
429: {
"description": "Too Many Requests (max daily resets exceeded or reset in progress)"
},
503: {
"description": "Service Unavailable (Redis reset failed; credits refunded or support needed)"
},
},
)
async def reset_copilot_usage(
user_id: Annotated[str, Security(auth.get_user_id)],
) -> RateLimitResetResponse:
"""Reset the daily CoPilot rate limit by spending credits.
Allows users who have hit their daily token limit to spend credits
to reset their daily usage counter and continue working.
Returns 400 if the feature is disabled or the user is not over the limit.
Returns 402 if the user has insufficient credits.
"""
cost = config.rate_limit_reset_cost
if cost <= 0:
raise HTTPException(
status_code=400,
detail="Rate limit reset is not available.",
)
if not settings.config.enable_credit:
raise HTTPException(
status_code=400,
detail="Rate limit reset is not available (credit system is disabled).",
)
daily_limit, weekly_limit = await get_global_rate_limits(
user_id, config.daily_token_limit, config.weekly_token_limit
)
if daily_limit <= 0:
raise HTTPException(
status_code=400,
detail="No daily limit is configured — nothing to reset.",
)
# Check max daily resets. get_daily_reset_count returns None when Redis
# is unavailable; reject the reset in that case to prevent unlimited
# free resets when the counter store is down.
reset_count = await get_daily_reset_count(user_id)
if reset_count is None:
raise HTTPException(
status_code=503,
detail="Unable to verify reset eligibility — please try again later.",
)
if config.max_daily_resets > 0 and reset_count >= config.max_daily_resets:
raise HTTPException(
status_code=429,
detail=f"You've used all {config.max_daily_resets} resets for today.",
)
# Acquire a per-user lock to prevent TOCTOU races (concurrent resets).
if not await acquire_reset_lock(user_id):
raise HTTPException(
status_code=429,
detail="A reset is already in progress. Please try again.",
)
try:
# Verify the user is actually at or over their daily limit.
usage_status = await get_usage_status(
user_id=user_id,
daily_token_limit=daily_limit,
weekly_token_limit=weekly_limit,
)
if daily_limit > 0 and usage_status.daily.used < daily_limit:
raise HTTPException(
status_code=400,
detail="You have not reached your daily limit yet.",
)
# If the weekly limit is also exhausted, resetting the daily counter
# won't help — the user would still be blocked by the weekly limit.
if weekly_limit > 0 and usage_status.weekly.used >= weekly_limit:
raise HTTPException(
status_code=400,
detail="Your weekly limit is also reached. Resetting the daily limit won't help.",
)
# Charge credits.
credit_model = await get_user_credit_model(user_id)
try:
remaining = await credit_model.spend_credits(
user_id=user_id,
cost=cost,
metadata=UsageTransactionMetadata(
reason="CoPilot daily rate limit reset",
),
)
except InsufficientBalanceError as e:
raise HTTPException(
status_code=402,
detail="Insufficient credits to reset your rate limit.",
) from e
# Reset daily usage in Redis. If this fails, refund the credits
# so the user is not charged for a service they did not receive.
if not await reset_daily_usage(user_id, daily_token_limit=daily_limit):
# Compensate: refund the charged credits.
refunded = False
try:
await credit_model.top_up_credits(user_id, cost)
refunded = True
logger.warning(
"Refunded %d credits to user %s after Redis reset failure",
cost,
user_id[:8],
)
except Exception:
logger.error(
"CRITICAL: Failed to refund %d credits to user %s "
"after Redis reset failure — manual intervention required",
cost,
user_id[:8],
exc_info=True,
)
if refunded:
raise HTTPException(
status_code=503,
detail="Rate limit reset failed — please try again later. "
"Your credits have not been charged.",
)
raise HTTPException(
status_code=503,
detail="Rate limit reset failed and the automatic refund "
"also failed. Please contact support for assistance.",
)
# Track the reset count for daily cap enforcement.
await increment_daily_reset_count(user_id)
finally:
await release_reset_lock(user_id)
# Return updated usage status.
updated_usage = await get_usage_status(
user_id=user_id,
daily_token_limit=daily_limit,
weekly_token_limit=weekly_limit,
rate_limit_reset_cost=config.rate_limit_reset_cost,
)
return RateLimitResetResponse(
success=True,
credits_charged=cost,
remaining_balance=remaining,
usage=updated_usage,
)
@@ -526,12 +736,16 @@ async def stream_chat_post(
# Pre-turn rate limit check (token-based).
# check_rate_limit short-circuits internally when both limits are 0.
# Global defaults sourced from LaunchDarkly, falling back to config.
if user_id:
try:
daily_limit, weekly_limit = await get_global_rate_limits(
user_id, config.daily_token_limit, config.weekly_token_limit
)
await check_rate_limit(
user_id=user_id,
daily_token_limit=config.daily_token_limit,
weekly_token_limit=config.weekly_token_limit,
daily_token_limit=daily_limit,
weekly_token_limit=weekly_limit,
)
except RateLimitExceeded as e:
raise HTTPException(status_code=429, detail=str(e)) from e
@@ -894,6 +1108,47 @@ async def session_assign_user(
return {"status": "ok"}
# ========== Suggested Prompts ==========
class SuggestedTheme(BaseModel):
"""A themed group of suggested prompts."""
name: str
prompts: list[str]
class SuggestedPromptsResponse(BaseModel):
"""Response model for user-specific suggested prompts grouped by theme."""
themes: list[SuggestedTheme]
@router.get(
"/suggested-prompts",
dependencies=[Security(auth.requires_user)],
)
async def get_suggested_prompts(
user_id: Annotated[str, Security(auth.get_user_id)],
) -> SuggestedPromptsResponse:
"""
Get LLM-generated suggested prompts grouped by theme.
Returns personalized quick-action prompts based on the user's
business understanding. Returns empty themes list if no custom
prompts are available.
"""
understanding = await get_business_understanding(user_id)
if understanding is None or not understanding.suggested_prompts:
return SuggestedPromptsResponse(themes=[])
themes = [
SuggestedTheme(name=name, prompts=prompts)
for name, prompts in understanding.suggested_prompts.items()
]
return SuggestedPromptsResponse(themes=themes)
# ========== Configuration ==========
@@ -942,7 +1197,7 @@ async def health_check() -> dict:
)
# Create and retrieve session to verify full data layer
session = await create_chat_session(health_check_user_id)
session = await create_chat_session(health_check_user_id, dry_run=False)
await get_chat_session(session.session_id, health_check_user_id)
return {

View File

@@ -1,7 +1,7 @@
"""Tests for chat API routes: session title update, file attachment validation, usage, and rate limiting."""
from datetime import UTC, datetime, timedelta
from unittest.mock import AsyncMock
from unittest.mock import AsyncMock, MagicMock
import fastapi
import fastapi.testclient
@@ -368,6 +368,7 @@ def test_usage_returns_daily_and_weekly(
user_id=test_user_id,
daily_token_limit=10000,
weekly_token_limit=50000,
rate_limit_reset_cost=chat_routes.config.rate_limit_reset_cost,
)
@@ -380,6 +381,7 @@ def test_usage_uses_config_limits(
mocker.patch.object(chat_routes.config, "daily_token_limit", 99999)
mocker.patch.object(chat_routes.config, "weekly_token_limit", 77777)
mocker.patch.object(chat_routes.config, "rate_limit_reset_cost", 500)
response = client.get("/usage")
@@ -388,6 +390,7 @@ def test_usage_uses_config_limits(
user_id=test_user_id,
daily_token_limit=99999,
weekly_token_limit=77777,
rate_limit_reset_cost=500,
)
@@ -400,3 +403,126 @@ def test_usage_rejects_unauthenticated_request() -> None:
response = unauthenticated_client.get("/usage")
assert response.status_code == 401
# ─── Suggested prompts endpoint ──────────────────────────────────────
def _mock_get_business_understanding(
mocker: pytest_mock.MockerFixture,
*,
return_value=None,
):
"""Mock get_business_understanding."""
return mocker.patch(
"backend.api.features.chat.routes.get_business_understanding",
new_callable=AsyncMock,
return_value=return_value,
)
def test_suggested_prompts_returns_themes(
mocker: pytest_mock.MockerFixture,
test_user_id: str,
) -> None:
"""User with themed prompts gets them back as themes list."""
mock_understanding = MagicMock()
mock_understanding.suggested_prompts = {
"Learn": ["L1", "L2"],
"Create": ["C1"],
}
_mock_get_business_understanding(mocker, return_value=mock_understanding)
response = client.get("/suggested-prompts")
assert response.status_code == 200
data = response.json()
assert "themes" in data
themes_by_name = {t["name"]: t["prompts"] for t in data["themes"]}
assert themes_by_name["Learn"] == ["L1", "L2"]
assert themes_by_name["Create"] == ["C1"]
def test_suggested_prompts_no_understanding(
mocker: pytest_mock.MockerFixture,
test_user_id: str,
) -> None:
"""User with no understanding gets empty themes list."""
_mock_get_business_understanding(mocker, return_value=None)
response = client.get("/suggested-prompts")
assert response.status_code == 200
assert response.json() == {"themes": []}
def test_suggested_prompts_empty_prompts(
mocker: pytest_mock.MockerFixture,
test_user_id: str,
) -> None:
"""User with understanding but empty prompts gets empty themes list."""
mock_understanding = MagicMock()
mock_understanding.suggested_prompts = {}
_mock_get_business_understanding(mocker, return_value=mock_understanding)
response = client.get("/suggested-prompts")
assert response.status_code == 200
assert response.json() == {"themes": []}
# ─── Create session: dry_run contract ─────────────────────────────────
def _mock_create_chat_session(mocker: pytest_mock.MockerFixture):
"""Mock create_chat_session to return a fake session."""
from backend.copilot.model import ChatSession
async def _fake_create(user_id: str, *, dry_run: bool):
return ChatSession.new(user_id, dry_run=dry_run)
return mocker.patch(
"backend.api.features.chat.routes.create_chat_session",
new_callable=AsyncMock,
side_effect=_fake_create,
)
def test_create_session_dry_run_true(
mocker: pytest_mock.MockerFixture,
test_user_id: str,
) -> None:
"""Sending ``{"dry_run": true}`` sets metadata.dry_run to True."""
_mock_create_chat_session(mocker)
response = client.post("/sessions", json={"dry_run": True})
assert response.status_code == 200
assert response.json()["metadata"]["dry_run"] is True
def test_create_session_dry_run_default_false(
mocker: pytest_mock.MockerFixture,
test_user_id: str,
) -> None:
"""Empty body defaults dry_run to False."""
_mock_create_chat_session(mocker)
response = client.post("/sessions")
assert response.status_code == 200
assert response.json()["metadata"]["dry_run"] is False
def test_create_session_rejects_nested_metadata(
test_user_id: str,
) -> None:
"""Sending ``{"metadata": {"dry_run": true}}`` must return 422, not silently
default to ``dry_run=False``. This guards against the common mistake of
nesting dry_run inside metadata instead of providing it at the top level."""
response = client.post(
"/sessions",
json={"metadata": {"dry_run": True}},
)
assert response.status_code == 422

View File

@@ -0,0 +1,13 @@
"""Override session-scoped fixtures so unit tests run without the server."""
import pytest
@pytest.fixture(scope="session")
def server():
yield None
@pytest.fixture(scope="session", autouse=True)
def graph_cleanup():
yield

View File

@@ -34,16 +34,21 @@ from backend.data.model import (
HostScopedCredentials,
OAuth2Credentials,
UserIntegrations,
is_sdk_default,
)
from backend.data.onboarding import OnboardingStep, complete_onboarding_step
from backend.data.user import get_user_integrations
from backend.executor.utils import add_graph_execution
from backend.integrations.ayrshare import AyrshareClient, SocialPlatform
from backend.integrations.credentials_store import provider_matches
from backend.integrations.credentials_store import (
is_system_credential,
provider_matches,
)
from backend.integrations.creds_manager import (
IntegrationCredentialsManager,
create_mcp_oauth_handler,
)
from backend.integrations.managed_credentials import ensure_managed_credentials
from backend.integrations.oauth import CREDENTIALS_BY_PROVIDER, HANDLERS_BY_NAME
from backend.integrations.providers import ProviderName
from backend.integrations.webhooks import get_webhook_manager
@@ -109,6 +114,7 @@ class CredentialsMetaResponse(BaseModel):
default=None,
description="Host pattern for host-scoped or MCP server URL for MCP credentials",
)
is_managed: bool = False
@model_validator(mode="before")
@classmethod
@@ -138,6 +144,19 @@ class CredentialsMetaResponse(BaseModel):
return None
def to_meta_response(cred: Credentials) -> CredentialsMetaResponse:
return CredentialsMetaResponse(
id=cred.id,
provider=cred.provider,
type=cred.type,
title=cred.title,
scopes=cred.scopes if isinstance(cred, OAuth2Credentials) else None,
username=cred.username if isinstance(cred, OAuth2Credentials) else None,
host=CredentialsMetaResponse.get_host(cred),
is_managed=cred.is_managed,
)
@router.post("/{provider}/callback", summary="Exchange OAuth code for tokens")
async def callback(
provider: Annotated[
@@ -204,34 +223,20 @@ async def callback(
f"and provider {provider.value}"
)
return CredentialsMetaResponse(
id=credentials.id,
provider=credentials.provider,
type=credentials.type,
title=credentials.title,
scopes=credentials.scopes,
username=credentials.username,
host=(CredentialsMetaResponse.get_host(credentials)),
)
return to_meta_response(credentials)
@router.get("/credentials", summary="List Credentials")
async def list_credentials(
user_id: Annotated[str, Security(get_user_id)],
) -> list[CredentialsMetaResponse]:
# Fire-and-forget: provision missing managed credentials in the background.
# The credential appears on the next page load; listing is never blocked.
asyncio.create_task(ensure_managed_credentials(user_id, creds_manager.store))
credentials = await creds_manager.store.get_all_creds(user_id)
return [
CredentialsMetaResponse(
id=cred.id,
provider=cred.provider,
type=cred.type,
title=cred.title,
scopes=cred.scopes if isinstance(cred, OAuth2Credentials) else None,
username=cred.username if isinstance(cred, OAuth2Credentials) else None,
host=CredentialsMetaResponse.get_host(cred),
)
for cred in credentials
to_meta_response(cred) for cred in credentials if not is_sdk_default(cred.id)
]
@@ -242,19 +247,11 @@ async def list_credentials_by_provider(
],
user_id: Annotated[str, Security(get_user_id)],
) -> list[CredentialsMetaResponse]:
asyncio.create_task(ensure_managed_credentials(user_id, creds_manager.store))
credentials = await creds_manager.store.get_creds_by_provider(user_id, provider)
return [
CredentialsMetaResponse(
id=cred.id,
provider=cred.provider,
type=cred.type,
title=cred.title,
scopes=cred.scopes if isinstance(cred, OAuth2Credentials) else None,
username=cred.username if isinstance(cred, OAuth2Credentials) else None,
host=CredentialsMetaResponse.get_host(cred),
)
for cred in credentials
to_meta_response(cred) for cred in credentials if not is_sdk_default(cred.id)
]
@@ -267,18 +264,21 @@ async def get_credential(
],
cred_id: Annotated[str, Path(title="The ID of the credentials to retrieve")],
user_id: Annotated[str, Security(get_user_id)],
) -> Credentials:
) -> CredentialsMetaResponse:
if is_sdk_default(cred_id):
raise HTTPException(
status_code=status.HTTP_404_NOT_FOUND, detail="Credentials not found"
)
credential = await creds_manager.get(user_id, cred_id)
if not credential:
raise HTTPException(
status_code=status.HTTP_404_NOT_FOUND, detail="Credentials not found"
)
if credential.provider != provider:
if not provider_matches(credential.provider, provider):
raise HTTPException(
status_code=status.HTTP_404_NOT_FOUND,
detail="Credentials do not match the specified provider",
status_code=status.HTTP_404_NOT_FOUND, detail="Credentials not found"
)
return credential
return to_meta_response(credential)
@router.post("/{provider}/credentials", status_code=201, summary="Create Credentials")
@@ -288,16 +288,22 @@ async def create_credentials(
ProviderName, Path(title="The provider to create credentials for")
],
credentials: Credentials,
) -> Credentials:
) -> CredentialsMetaResponse:
if is_sdk_default(credentials.id):
raise HTTPException(
status_code=status.HTTP_403_FORBIDDEN,
detail="Cannot create credentials with a reserved ID",
)
credentials.provider = provider
try:
await creds_manager.create(user_id, credentials)
except Exception as e:
except Exception:
logger.exception("Failed to store credentials")
raise HTTPException(
status_code=status.HTTP_500_INTERNAL_SERVER_ERROR,
detail=f"Failed to store credentials: {str(e)}",
detail="Failed to store credentials",
)
return credentials
return to_meta_response(credentials)
class CredentialsDeletionResponse(BaseModel):
@@ -332,15 +338,29 @@ async def delete_credentials(
bool, Query(title="Whether to proceed if any linked webhooks are still in use")
] = False,
) -> CredentialsDeletionResponse | CredentialsDeletionNeedsConfirmationResponse:
if is_sdk_default(cred_id):
raise HTTPException(
status_code=status.HTTP_404_NOT_FOUND, detail="Credentials not found"
)
if is_system_credential(cred_id):
raise HTTPException(
status_code=status.HTTP_403_FORBIDDEN,
detail="System-managed credentials cannot be deleted",
)
creds = await creds_manager.store.get_creds_by_id(user_id, cred_id)
if not creds:
raise HTTPException(
status_code=status.HTTP_404_NOT_FOUND, detail="Credentials not found"
)
if creds.provider != provider:
if not provider_matches(creds.provider, provider):
raise HTTPException(
status_code=status.HTTP_404_NOT_FOUND,
detail="Credentials do not match the specified provider",
detail="Credentials not found",
)
if creds.is_managed:
raise HTTPException(
status_code=status.HTTP_403_FORBIDDEN,
detail="AutoGPT-managed credentials cannot be deleted",
)
try:

View File

@@ -0,0 +1,570 @@
"""Tests for credentials API security: no secret leakage, SDK defaults filtered."""
from contextlib import asynccontextmanager
from unittest.mock import AsyncMock, MagicMock, patch
import fastapi
import fastapi.testclient
import pytest
from pydantic import SecretStr
from backend.api.features.integrations.router import router
from backend.data.model import (
APIKeyCredentials,
HostScopedCredentials,
OAuth2Credentials,
UserPasswordCredentials,
)
app = fastapi.FastAPI()
app.include_router(router)
client = fastapi.testclient.TestClient(app)
TEST_USER_ID = "test-user-id"
def _make_api_key_cred(cred_id: str = "cred-123", provider: str = "openai"):
return APIKeyCredentials(
id=cred_id,
provider=provider,
title="My API Key",
api_key=SecretStr("sk-secret-key-value"),
)
def _make_oauth2_cred(cred_id: str = "cred-456", provider: str = "github"):
return OAuth2Credentials(
id=cred_id,
provider=provider,
title="My OAuth",
access_token=SecretStr("ghp_secret_token"),
refresh_token=SecretStr("ghp_refresh_secret"),
scopes=["repo", "user"],
username="testuser",
)
def _make_user_password_cred(cred_id: str = "cred-789", provider: str = "openai"):
return UserPasswordCredentials(
id=cred_id,
provider=provider,
title="My Login",
username=SecretStr("admin"),
password=SecretStr("s3cret-pass"),
)
def _make_host_scoped_cred(cred_id: str = "cred-host", provider: str = "openai"):
return HostScopedCredentials(
id=cred_id,
provider=provider,
title="Host Cred",
host="https://api.example.com",
headers={"Authorization": SecretStr("Bearer top-secret")},
)
def _make_sdk_default_cred(provider: str = "openai"):
return APIKeyCredentials(
id=f"{provider}-default",
provider=provider,
title=f"{provider} (default)",
api_key=SecretStr("sk-platform-secret-key"),
)
@pytest.fixture(autouse=True)
def setup_auth(mock_jwt_user):
from autogpt_libs.auth.jwt_utils import get_jwt_payload
app.dependency_overrides[get_jwt_payload] = mock_jwt_user["get_jwt_payload"]
yield
app.dependency_overrides.clear()
class TestGetCredentialReturnsMetaOnly:
"""GET /{provider}/credentials/{cred_id} must not return secrets."""
def test_api_key_credential_no_secret(self):
cred = _make_api_key_cred()
with (
patch.object(router, "dependencies", []),
patch("backend.api.features.integrations.router.creds_manager") as mock_mgr,
):
mock_mgr.get = AsyncMock(return_value=cred)
resp = client.get("/openai/credentials/cred-123")
assert resp.status_code == 200
data = resp.json()
assert data["id"] == "cred-123"
assert data["provider"] == "openai"
assert data["type"] == "api_key"
assert "api_key" not in data
assert "sk-secret-key-value" not in str(data)
def test_oauth2_credential_no_secret(self):
cred = _make_oauth2_cred()
with patch(
"backend.api.features.integrations.router.creds_manager"
) as mock_mgr:
mock_mgr.get = AsyncMock(return_value=cred)
resp = client.get("/github/credentials/cred-456")
assert resp.status_code == 200
data = resp.json()
assert data["id"] == "cred-456"
assert data["scopes"] == ["repo", "user"]
assert data["username"] == "testuser"
assert "access_token" not in data
assert "refresh_token" not in data
assert "ghp_" not in str(data)
def test_user_password_credential_no_secret(self):
cred = _make_user_password_cred()
with patch(
"backend.api.features.integrations.router.creds_manager"
) as mock_mgr:
mock_mgr.get = AsyncMock(return_value=cred)
resp = client.get("/openai/credentials/cred-789")
assert resp.status_code == 200
data = resp.json()
assert data["id"] == "cred-789"
assert "password" not in data
assert "username" not in data or data["username"] is None
assert "s3cret-pass" not in str(data)
assert "admin" not in str(data)
def test_host_scoped_credential_no_secret(self):
cred = _make_host_scoped_cred()
with patch(
"backend.api.features.integrations.router.creds_manager"
) as mock_mgr:
mock_mgr.get = AsyncMock(return_value=cred)
resp = client.get("/openai/credentials/cred-host")
assert resp.status_code == 200
data = resp.json()
assert data["id"] == "cred-host"
assert data["host"] == "https://api.example.com"
assert "headers" not in data
assert "top-secret" not in str(data)
def test_get_credential_wrong_provider_returns_404(self):
"""Provider mismatch should return generic 404, not leak credential existence."""
cred = _make_api_key_cred(provider="openai")
with patch(
"backend.api.features.integrations.router.creds_manager"
) as mock_mgr:
mock_mgr.get = AsyncMock(return_value=cred)
resp = client.get("/github/credentials/cred-123")
assert resp.status_code == 404
assert resp.json()["detail"] == "Credentials not found"
def test_list_credentials_no_secrets(self):
"""List endpoint must not leak secrets in any credential."""
creds = [_make_api_key_cred(), _make_oauth2_cred()]
with patch(
"backend.api.features.integrations.router.creds_manager"
) as mock_mgr:
mock_mgr.store.get_all_creds = AsyncMock(return_value=creds)
resp = client.get("/credentials")
assert resp.status_code == 200
raw = str(resp.json())
assert "sk-secret-key-value" not in raw
assert "ghp_secret_token" not in raw
assert "ghp_refresh_secret" not in raw
class TestSdkDefaultCredentialsNotAccessible:
"""SDK default credentials (ID ending in '-default') must be hidden."""
def test_get_sdk_default_returns_404(self):
with patch(
"backend.api.features.integrations.router.creds_manager"
) as mock_mgr:
mock_mgr.get = AsyncMock()
resp = client.get("/openai/credentials/openai-default")
assert resp.status_code == 404
mock_mgr.get.assert_not_called()
def test_list_credentials_excludes_sdk_defaults(self):
user_cred = _make_api_key_cred()
sdk_cred = _make_sdk_default_cred("openai")
with patch(
"backend.api.features.integrations.router.creds_manager"
) as mock_mgr:
mock_mgr.store.get_all_creds = AsyncMock(return_value=[user_cred, sdk_cred])
resp = client.get("/credentials")
assert resp.status_code == 200
data = resp.json()
ids = [c["id"] for c in data]
assert "cred-123" in ids
assert "openai-default" not in ids
def test_list_by_provider_excludes_sdk_defaults(self):
user_cred = _make_api_key_cred()
sdk_cred = _make_sdk_default_cred("openai")
with patch(
"backend.api.features.integrations.router.creds_manager"
) as mock_mgr:
mock_mgr.store.get_creds_by_provider = AsyncMock(
return_value=[user_cred, sdk_cred]
)
resp = client.get("/openai/credentials")
assert resp.status_code == 200
data = resp.json()
ids = [c["id"] for c in data]
assert "cred-123" in ids
assert "openai-default" not in ids
def test_delete_sdk_default_returns_404(self):
with patch(
"backend.api.features.integrations.router.creds_manager"
) as mock_mgr:
mock_mgr.store.get_creds_by_id = AsyncMock()
resp = client.request("DELETE", "/openai/credentials/openai-default")
assert resp.status_code == 404
mock_mgr.store.get_creds_by_id.assert_not_called()
class TestCreateCredentialNoSecretInResponse:
"""POST /{provider}/credentials must not return secrets."""
def test_create_api_key_no_secret_in_response(self):
with patch(
"backend.api.features.integrations.router.creds_manager"
) as mock_mgr:
mock_mgr.create = AsyncMock()
resp = client.post(
"/openai/credentials",
json={
"id": "new-cred",
"provider": "openai",
"type": "api_key",
"title": "New Key",
"api_key": "sk-newsecret",
},
)
assert resp.status_code == 201
data = resp.json()
assert data["id"] == "new-cred"
assert "api_key" not in data
assert "sk-newsecret" not in str(data)
def test_create_with_sdk_default_id_rejected(self):
with patch(
"backend.api.features.integrations.router.creds_manager"
) as mock_mgr:
mock_mgr.create = AsyncMock()
resp = client.post(
"/openai/credentials",
json={
"id": "openai-default",
"provider": "openai",
"type": "api_key",
"title": "Sneaky",
"api_key": "sk-evil",
},
)
assert resp.status_code == 403
mock_mgr.create.assert_not_called()
class TestManagedCredentials:
"""AutoGPT-managed credentials cannot be deleted by users."""
def test_delete_is_managed_returns_403(self):
cred = APIKeyCredentials(
id="managed-cred-1",
provider="agent_mail",
title="AgentMail (managed by AutoGPT)",
api_key=SecretStr("sk-managed-key"),
is_managed=True,
)
with patch(
"backend.api.features.integrations.router.creds_manager"
) as mock_mgr:
mock_mgr.store.get_creds_by_id = AsyncMock(return_value=cred)
resp = client.request("DELETE", "/agent_mail/credentials/managed-cred-1")
assert resp.status_code == 403
assert "AutoGPT-managed" in resp.json()["detail"]
def test_list_credentials_includes_is_managed_field(self):
managed = APIKeyCredentials(
id="managed-1",
provider="agent_mail",
title="AgentMail (managed)",
api_key=SecretStr("sk-key"),
is_managed=True,
)
regular = APIKeyCredentials(
id="regular-1",
provider="openai",
title="My Key",
api_key=SecretStr("sk-key"),
)
with patch(
"backend.api.features.integrations.router.creds_manager"
) as mock_mgr:
mock_mgr.store.get_all_creds = AsyncMock(return_value=[managed, regular])
resp = client.get("/credentials")
assert resp.status_code == 200
data = resp.json()
managed_cred = next(c for c in data if c["id"] == "managed-1")
regular_cred = next(c for c in data if c["id"] == "regular-1")
assert managed_cred["is_managed"] is True
assert regular_cred["is_managed"] is False
# ---------------------------------------------------------------------------
# Managed credential provisioning infrastructure
# ---------------------------------------------------------------------------
def _make_managed_cred(
provider: str = "agent_mail", pod_id: str = "pod-abc"
) -> APIKeyCredentials:
return APIKeyCredentials(
id="managed-auto",
provider=provider,
title="AgentMail (managed by AutoGPT)",
api_key=SecretStr("sk-pod-key"),
is_managed=True,
metadata={"pod_id": pod_id},
)
def _make_store_mock(**kwargs) -> MagicMock:
"""Create a store mock with a working async ``locks()`` context manager."""
@asynccontextmanager
async def _noop_locked(key):
yield
locks_obj = MagicMock()
locks_obj.locked = _noop_locked
store = MagicMock(**kwargs)
store.locks = AsyncMock(return_value=locks_obj)
return store
class TestEnsureManagedCredentials:
"""Unit tests for the ensure/cleanup helpers in managed_credentials.py."""
@pytest.mark.asyncio
async def test_provisions_when_missing(self):
"""Provider.provision() is called when no managed credential exists."""
from backend.integrations.managed_credentials import (
_PROVIDERS,
_provisioned_users,
ensure_managed_credentials,
)
cred = _make_managed_cred()
provider = MagicMock()
provider.provider_name = "test_provider"
provider.is_available = AsyncMock(return_value=True)
provider.provision = AsyncMock(return_value=cred)
store = _make_store_mock()
store.has_managed_credential = AsyncMock(return_value=False)
store.add_managed_credential = AsyncMock()
saved = dict(_PROVIDERS)
_PROVIDERS.clear()
_PROVIDERS["test_provider"] = provider
_provisioned_users.pop("user-1", None)
try:
await ensure_managed_credentials("user-1", store)
finally:
_PROVIDERS.clear()
_PROVIDERS.update(saved)
_provisioned_users.pop("user-1", None)
provider.provision.assert_awaited_once_with("user-1")
store.add_managed_credential.assert_awaited_once_with("user-1", cred)
@pytest.mark.asyncio
async def test_skips_when_already_exists(self):
"""Provider.provision() is NOT called when managed credential exists."""
from backend.integrations.managed_credentials import (
_PROVIDERS,
_provisioned_users,
ensure_managed_credentials,
)
provider = MagicMock()
provider.provider_name = "test_provider"
provider.is_available = AsyncMock(return_value=True)
provider.provision = AsyncMock()
store = _make_store_mock()
store.has_managed_credential = AsyncMock(return_value=True)
saved = dict(_PROVIDERS)
_PROVIDERS.clear()
_PROVIDERS["test_provider"] = provider
_provisioned_users.pop("user-1", None)
try:
await ensure_managed_credentials("user-1", store)
finally:
_PROVIDERS.clear()
_PROVIDERS.update(saved)
_provisioned_users.pop("user-1", None)
provider.provision.assert_not_awaited()
@pytest.mark.asyncio
async def test_skips_when_unavailable(self):
"""Provider.provision() is NOT called when provider is not available."""
from backend.integrations.managed_credentials import (
_PROVIDERS,
_provisioned_users,
ensure_managed_credentials,
)
provider = MagicMock()
provider.provider_name = "test_provider"
provider.is_available = AsyncMock(return_value=False)
provider.provision = AsyncMock()
store = _make_store_mock()
store.has_managed_credential = AsyncMock()
saved = dict(_PROVIDERS)
_PROVIDERS.clear()
_PROVIDERS["test_provider"] = provider
_provisioned_users.pop("user-1", None)
try:
await ensure_managed_credentials("user-1", store)
finally:
_PROVIDERS.clear()
_PROVIDERS.update(saved)
_provisioned_users.pop("user-1", None)
provider.provision.assert_not_awaited()
store.has_managed_credential.assert_not_awaited()
@pytest.mark.asyncio
async def test_provision_failure_does_not_propagate(self):
"""A failed provision is logged but does not raise."""
from backend.integrations.managed_credentials import (
_PROVIDERS,
_provisioned_users,
ensure_managed_credentials,
)
provider = MagicMock()
provider.provider_name = "test_provider"
provider.is_available = AsyncMock(return_value=True)
provider.provision = AsyncMock(side_effect=RuntimeError("boom"))
store = _make_store_mock()
store.has_managed_credential = AsyncMock(return_value=False)
saved = dict(_PROVIDERS)
_PROVIDERS.clear()
_PROVIDERS["test_provider"] = provider
_provisioned_users.pop("user-1", None)
try:
await ensure_managed_credentials("user-1", store)
finally:
_PROVIDERS.clear()
_PROVIDERS.update(saved)
_provisioned_users.pop("user-1", None)
# No exception raised — provisioning failure is swallowed.
class TestCleanupManagedCredentials:
"""Unit tests for cleanup_managed_credentials."""
@pytest.mark.asyncio
async def test_calls_deprovision_for_managed_creds(self):
from backend.integrations.managed_credentials import (
_PROVIDERS,
cleanup_managed_credentials,
)
cred = _make_managed_cred()
provider = MagicMock()
provider.provider_name = "agent_mail"
provider.deprovision = AsyncMock()
store = MagicMock()
store.get_all_creds = AsyncMock(return_value=[cred])
saved = dict(_PROVIDERS)
_PROVIDERS.clear()
_PROVIDERS["agent_mail"] = provider
try:
await cleanup_managed_credentials("user-1", store)
finally:
_PROVIDERS.clear()
_PROVIDERS.update(saved)
provider.deprovision.assert_awaited_once_with("user-1", cred)
@pytest.mark.asyncio
async def test_skips_non_managed_creds(self):
from backend.integrations.managed_credentials import (
_PROVIDERS,
cleanup_managed_credentials,
)
regular = _make_api_key_cred()
provider = MagicMock()
provider.provider_name = "openai"
provider.deprovision = AsyncMock()
store = MagicMock()
store.get_all_creds = AsyncMock(return_value=[regular])
saved = dict(_PROVIDERS)
_PROVIDERS.clear()
_PROVIDERS["openai"] = provider
try:
await cleanup_managed_credentials("user-1", store)
finally:
_PROVIDERS.clear()
_PROVIDERS.update(saved)
provider.deprovision.assert_not_awaited()
@pytest.mark.asyncio
async def test_deprovision_failure_does_not_propagate(self):
from backend.integrations.managed_credentials import (
_PROVIDERS,
cleanup_managed_credentials,
)
cred = _make_managed_cred()
provider = MagicMock()
provider.provider_name = "agent_mail"
provider.deprovision = AsyncMock(side_effect=RuntimeError("boom"))
store = MagicMock()
store.get_all_creds = AsyncMock(return_value=[cred])
saved = dict(_PROVIDERS)
_PROVIDERS.clear()
_PROVIDERS["agent_mail"] = provider
try:
await cleanup_managed_credentials("user-1", store)
finally:
_PROVIDERS.clear()
_PROVIDERS.update(saved)
# No exception raised — cleanup failure is swallowed.

View File

@@ -0,0 +1,120 @@
"""Shared logic for adding store agents to a user's library.
Both `add_store_agent_to_library` and `add_store_agent_to_library_as_admin`
delegate to these helpers so the duplication-prone create/restore/dedup
logic lives in exactly one place.
"""
import logging
import prisma.errors
import prisma.models
import backend.api.features.library.model as library_model
import backend.data.graph as graph_db
from backend.data.graph import GraphModel, GraphSettings
from backend.data.includes import library_agent_include
from backend.util.exceptions import NotFoundError
from backend.util.json import SafeJson
logger = logging.getLogger(__name__)
async def resolve_graph_for_library(
store_listing_version_id: str,
user_id: str,
*,
admin: bool,
) -> GraphModel:
"""Look up a StoreListingVersion and resolve its graph.
When ``admin=True``, uses ``get_graph_as_admin`` to bypass the marketplace
APPROVED-only check. Otherwise uses the regular ``get_graph``.
"""
slv = await prisma.models.StoreListingVersion.prisma().find_unique(
where={"id": store_listing_version_id}, include={"AgentGraph": True}
)
if not slv or not slv.AgentGraph:
raise NotFoundError(
f"Store listing version {store_listing_version_id} not found or invalid"
)
ag = slv.AgentGraph
if admin:
graph_model = await graph_db.get_graph_as_admin(
graph_id=ag.id, version=ag.version, user_id=user_id
)
else:
graph_model = await graph_db.get_graph(
graph_id=ag.id, version=ag.version, user_id=user_id
)
if not graph_model:
raise NotFoundError(f"Graph #{ag.id} v{ag.version} not found or accessible")
return graph_model
async def add_graph_to_library(
store_listing_version_id: str,
graph_model: GraphModel,
user_id: str,
) -> library_model.LibraryAgent:
"""Check existing / restore soft-deleted / create new LibraryAgent.
Uses a create-then-catch-UniqueViolationError-then-update pattern on
the (userId, agentGraphId, agentGraphVersion) composite unique constraint.
This is more robust than ``upsert`` because Prisma's upsert atomicity
guarantees are not well-documented for all versions.
"""
settings_json = SafeJson(GraphSettings.from_graph(graph_model).model_dump())
_include = library_agent_include(
user_id, include_nodes=False, include_executions=False
)
try:
added_agent = await prisma.models.LibraryAgent.prisma().create(
data={
"User": {"connect": {"id": user_id}},
"AgentGraph": {
"connect": {
"graphVersionId": {
"id": graph_model.id,
"version": graph_model.version,
}
}
},
"isCreatedByUser": False,
"useGraphIsActiveVersion": False,
"settings": settings_json,
},
include=_include,
)
except prisma.errors.UniqueViolationError:
# Already exists — update to restore if previously soft-deleted/archived
added_agent = await prisma.models.LibraryAgent.prisma().update(
where={
"userId_agentGraphId_agentGraphVersion": {
"userId": user_id,
"agentGraphId": graph_model.id,
"agentGraphVersion": graph_model.version,
}
},
data={
"isDeleted": False,
"isArchived": False,
"settings": settings_json,
},
include=_include,
)
if added_agent is None:
raise NotFoundError(
f"LibraryAgent for graph #{graph_model.id} "
f"v{graph_model.version} not found after UniqueViolationError"
)
logger.debug(
f"Added graph #{graph_model.id} v{graph_model.version} "
f"for store listing version #{store_listing_version_id} "
f"to library for user #{user_id}"
)
return library_model.LibraryAgent.from_db(added_agent)

View File

@@ -0,0 +1,80 @@
from unittest.mock import AsyncMock, MagicMock, patch
import prisma.errors
import pytest
from ._add_to_library import add_graph_to_library
@pytest.mark.asyncio
async def test_add_graph_to_library_create_new_agent() -> None:
"""When no matching LibraryAgent exists, create inserts a new one."""
graph_model = MagicMock(id="graph-id", version=2, nodes=[])
created_agent = MagicMock(name="CreatedLibraryAgent")
converted_agent = MagicMock(name="ConvertedLibraryAgent")
with (
patch(
"backend.api.features.library._add_to_library.prisma.models.LibraryAgent.prisma"
) as mock_prisma,
patch(
"backend.api.features.library._add_to_library.library_model.LibraryAgent.from_db",
return_value=converted_agent,
) as mock_from_db,
):
mock_prisma.return_value.create = AsyncMock(return_value=created_agent)
result = await add_graph_to_library("slv-id", graph_model, "user-id")
assert result is converted_agent
mock_from_db.assert_called_once_with(created_agent)
# Verify create was called with correct data
create_call = mock_prisma.return_value.create.call_args
create_data = create_call.kwargs["data"]
assert create_data["User"] == {"connect": {"id": "user-id"}}
assert create_data["AgentGraph"] == {
"connect": {"graphVersionId": {"id": "graph-id", "version": 2}}
}
assert create_data["isCreatedByUser"] is False
assert create_data["useGraphIsActiveVersion"] is False
@pytest.mark.asyncio
async def test_add_graph_to_library_unique_violation_updates_existing() -> None:
"""UniqueViolationError on create falls back to update."""
graph_model = MagicMock(id="graph-id", version=2, nodes=[])
updated_agent = MagicMock(name="UpdatedLibraryAgent")
converted_agent = MagicMock(name="ConvertedLibraryAgent")
with (
patch(
"backend.api.features.library._add_to_library.prisma.models.LibraryAgent.prisma"
) as mock_prisma,
patch(
"backend.api.features.library._add_to_library.library_model.LibraryAgent.from_db",
return_value=converted_agent,
) as mock_from_db,
):
mock_prisma.return_value.create = AsyncMock(
side_effect=prisma.errors.UniqueViolationError(
MagicMock(), message="unique constraint"
)
)
mock_prisma.return_value.update = AsyncMock(return_value=updated_agent)
result = await add_graph_to_library("slv-id", graph_model, "user-id")
assert result is converted_agent
mock_from_db.assert_called_once_with(updated_agent)
# Verify update was called with correct where and data
update_call = mock_prisma.return_value.update.call_args
assert update_call.kwargs["where"] == {
"userId_agentGraphId_agentGraphVersion": {
"userId": "user-id",
"agentGraphId": "graph-id",
"agentGraphVersion": 2,
}
}
update_data = update_call.kwargs["data"]
assert update_data["isDeleted"] is False
assert update_data["isArchived"] is False

View File

@@ -336,12 +336,15 @@ async def get_library_agent_by_graph_id(
user_id: str,
graph_id: str,
graph_version: Optional[int] = None,
include_archived: bool = False,
) -> library_model.LibraryAgent | None:
filter: prisma.types.LibraryAgentWhereInput = {
"agentGraphId": graph_id,
"userId": user_id,
"isDeleted": False,
}
if not include_archived:
filter["isArchived"] = False
if graph_version is not None:
filter["agentGraphVersion"] = graph_version
@@ -433,32 +436,53 @@ async def create_library_agent(
async with transaction() as tx:
library_agents = await asyncio.gather(
*(
prisma.models.LibraryAgent.prisma(tx).create(
data=prisma.types.LibraryAgentCreateInput(
isCreatedByUser=(user_id == user_id),
useGraphIsActiveVersion=True,
User={"connect": {"id": user_id}},
AgentGraph={
"connect": {
"graphVersionId": {
"id": graph_entry.id,
"version": graph_entry.version,
prisma.models.LibraryAgent.prisma(tx).upsert(
where={
"userId_agentGraphId_agentGraphVersion": {
"userId": user_id,
"agentGraphId": graph_entry.id,
"agentGraphVersion": graph_entry.version,
}
},
data={
"create": prisma.types.LibraryAgentCreateInput(
isCreatedByUser=(user_id == graph.user_id),
useGraphIsActiveVersion=True,
User={"connect": {"id": user_id}},
AgentGraph={
"connect": {
"graphVersionId": {
"id": graph_entry.id,
"version": graph_entry.version,
}
}
}
},
settings=SafeJson(
GraphSettings.from_graph(
graph_entry,
hitl_safe_mode=hitl_safe_mode,
sensitive_action_safe_mode=sensitive_action_safe_mode,
).model_dump()
),
**(
{"Folder": {"connect": {"id": folder_id}}}
if folder_id and graph_entry is graph
else {}
),
),
"update": {
"isDeleted": False,
"isArchived": False,
"useGraphIsActiveVersion": True,
"settings": SafeJson(
GraphSettings.from_graph(
graph_entry,
hitl_safe_mode=hitl_safe_mode,
sensitive_action_safe_mode=sensitive_action_safe_mode,
).model_dump()
),
},
settings=SafeJson(
GraphSettings.from_graph(
graph_entry,
hitl_safe_mode=hitl_safe_mode,
sensitive_action_safe_mode=sensitive_action_safe_mode,
).model_dump()
),
**(
{"Folder": {"connect": {"id": folder_id}}}
if folder_id and graph_entry is graph
else {}
),
),
},
include=library_agent_include(
user_id, include_nodes=False, include_executions=False
),
@@ -582,7 +606,9 @@ async def update_graph_in_library(
created_graph = await graph_db.create_graph(graph_model, user_id)
library_agent = await get_library_agent_by_graph_id(user_id, created_graph.id)
library_agent = await get_library_agent_by_graph_id(
user_id, created_graph.id, include_archived=True
)
if not library_agent:
raise NotFoundError(f"Library agent not found for graph {created_graph.id}")
@@ -818,92 +844,38 @@ async def delete_library_agent_by_graph_id(graph_id: str, user_id: str) -> None:
async def add_store_agent_to_library(
store_listing_version_id: str, user_id: str
) -> library_model.LibraryAgent:
"""Adds a marketplace agent to the users library.
See also: `add_store_agent_to_library_as_admin()` which uses
`get_graph_as_admin` to bypass marketplace status checks for admin review.
"""
Adds an agent from a store listing version to the user's library if they don't already have it.
from ._add_to_library import add_graph_to_library, resolve_graph_for_library
Args:
store_listing_version_id: The ID of the store listing version containing the agent.
user_id: The users library to which the agent is being added.
Returns:
The newly created LibraryAgent if successfully added, the existing corresponding one if any.
Raises:
NotFoundError: If the store listing or associated agent is not found.
DatabaseError: If there's an issue creating the LibraryAgent record.
"""
logger.debug(
f"Adding agent from store listing version #{store_listing_version_id} "
f"to library for user #{user_id}"
)
store_listing_version = (
await prisma.models.StoreListingVersion.prisma().find_unique(
where={"id": store_listing_version_id}, include={"AgentGraph": True}
)
graph_model = await resolve_graph_for_library(
store_listing_version_id, user_id, admin=False
)
if not store_listing_version or not store_listing_version.AgentGraph:
logger.warning(f"Store listing version not found: {store_listing_version_id}")
raise NotFoundError(
f"Store listing version {store_listing_version_id} not found or invalid"
)
return await add_graph_to_library(store_listing_version_id, graph_model, user_id)
graph = store_listing_version.AgentGraph
# Convert to GraphModel to check for HITL blocks
graph_model = await graph_db.get_graph(
graph_id=graph.id,
version=graph.version,
user_id=user_id,
include_subgraphs=False,
async def add_store_agent_to_library_as_admin(
store_listing_version_id: str, user_id: str
) -> library_model.LibraryAgent:
"""Admin variant that uses `get_graph_as_admin` to bypass marketplace
APPROVED-only checks, allowing admins to add pending agents for review."""
from ._add_to_library import add_graph_to_library, resolve_graph_for_library
logger.warning(
f"ADMIN adding agent from store listing version "
f"#{store_listing_version_id} to library for user #{user_id}"
)
if not graph_model:
raise NotFoundError(
f"Graph #{graph.id} v{graph.version} not found or accessible"
)
# Check if user already has this agent (non-deleted)
if existing := await get_library_agent_by_graph_id(
user_id, graph.id, graph.version
):
return existing
# Check for soft-deleted version and restore it
deleted_agent = await prisma.models.LibraryAgent.prisma().find_unique(
where={
"userId_agentGraphId_agentGraphVersion": {
"userId": user_id,
"agentGraphId": graph.id,
"agentGraphVersion": graph.version,
}
},
graph_model = await resolve_graph_for_library(
store_listing_version_id, user_id, admin=True
)
if deleted_agent and deleted_agent.isDeleted:
return await update_library_agent(deleted_agent.id, user_id, is_deleted=False)
# Create LibraryAgent entry
added_agent = await prisma.models.LibraryAgent.prisma().create(
data={
"User": {"connect": {"id": user_id}},
"AgentGraph": {
"connect": {
"graphVersionId": {"id": graph.id, "version": graph.version}
}
},
"isCreatedByUser": False,
"useGraphIsActiveVersion": False,
"settings": SafeJson(GraphSettings.from_graph(graph_model).model_dump()),
},
include=library_agent_include(
user_id, include_nodes=False, include_executions=False
),
)
logger.debug(
f"Added graph #{graph.id} v{graph.version}"
f"for store listing version #{store_listing_version.id} "
f"to library for user #{user_id}"
)
return library_model.LibraryAgent.from_db(added_agent)
return await add_graph_to_library(store_listing_version_id, graph_model, user_id)
##############################################

View File

@@ -1,4 +1,6 @@
from contextlib import asynccontextmanager
from datetime import datetime
from unittest.mock import AsyncMock, MagicMock, patch
import prisma.enums
import prisma.models
@@ -85,10 +87,6 @@ async def test_get_library_agents(mocker):
async def test_add_agent_to_library(mocker):
await connect()
# Mock the transaction context
mock_transaction = mocker.patch("backend.api.features.library.db.transaction")
mock_transaction.return_value.__aenter__ = mocker.AsyncMock(return_value=None)
mock_transaction.return_value.__aexit__ = mocker.AsyncMock(return_value=None)
# Mock data
mock_store_listing_data = prisma.models.StoreListingVersion(
id="version123",
@@ -143,15 +141,18 @@ async def test_add_agent_to_library(mocker):
)
mock_library_agent = mocker.patch("prisma.models.LibraryAgent.prisma")
mock_library_agent.return_value.find_first = mocker.AsyncMock(return_value=None)
mock_library_agent.return_value.find_unique = mocker.AsyncMock(return_value=None)
mock_library_agent.return_value.create = mocker.AsyncMock(
return_value=mock_library_agent_data
)
# Mock graph_db.get_graph function that's called to check for HITL blocks
mock_graph_db = mocker.patch("backend.api.features.library.db.graph_db")
# Mock graph_db.get_graph function that's called in resolve_graph_for_library
# (lives in _add_to_library.py after refactor, not db.py)
mock_graph_db = mocker.patch(
"backend.api.features.library._add_to_library.graph_db"
)
mock_graph_model = mocker.Mock()
mock_graph_model.id = "agent1"
mock_graph_model.version = 1
mock_graph_model.nodes = (
[]
) # Empty list so _has_human_in_the_loop_blocks returns False
@@ -170,37 +171,27 @@ async def test_add_agent_to_library(mocker):
mock_store_listing_version.return_value.find_unique.assert_called_once_with(
where={"id": "version123"}, include={"AgentGraph": True}
)
mock_library_agent.return_value.find_unique.assert_called_once_with(
where={
"userId_agentGraphId_agentGraphVersion": {
"userId": "test-user",
"agentGraphId": "agent1",
"agentGraphVersion": 1,
}
},
)
# Check that create was called with the expected data including settings
create_call_args = mock_library_agent.return_value.create.call_args
assert create_call_args is not None
# Verify the main structure
expected_data = {
# Verify the create data structure
create_data = create_call_args.kwargs["data"]
expected_create = {
"User": {"connect": {"id": "test-user"}},
"AgentGraph": {"connect": {"graphVersionId": {"id": "agent1", "version": 1}}},
"isCreatedByUser": False,
"useGraphIsActiveVersion": False,
}
actual_data = create_call_args[1]["data"]
# Check that all expected fields are present
for key, value in expected_data.items():
assert actual_data[key] == value
for key, value in expected_create.items():
assert create_data[key] == value
# Check that settings field is present and is a SafeJson object
assert "settings" in actual_data
assert hasattr(actual_data["settings"], "__class__") # Should be a SafeJson object
assert "settings" in create_data
assert hasattr(create_data["settings"], "__class__") # Should be a SafeJson object
# Check include parameter
assert create_call_args[1]["include"] == library_agent_include(
assert create_call_args.kwargs["include"] == library_agent_include(
"test-user", include_nodes=False, include_executions=False
)
@@ -224,3 +215,141 @@ async def test_add_agent_to_library_not_found(mocker):
mock_store_listing_version.return_value.find_unique.assert_called_once_with(
where={"id": "version123"}, include={"AgentGraph": True}
)
@pytest.mark.asyncio
async def test_get_library_agent_by_graph_id_excludes_archived(mocker):
mock_library_agent = mocker.patch("prisma.models.LibraryAgent.prisma")
mock_library_agent.return_value.find_first = mocker.AsyncMock(return_value=None)
result = await db.get_library_agent_by_graph_id("test-user", "agent1", 7)
assert result is None
mock_library_agent.return_value.find_first.assert_called_once()
where = mock_library_agent.return_value.find_first.call_args.kwargs["where"]
assert where == {
"agentGraphId": "agent1",
"userId": "test-user",
"isDeleted": False,
"isArchived": False,
"agentGraphVersion": 7,
}
@pytest.mark.asyncio
async def test_get_library_agent_by_graph_id_can_include_archived(mocker):
mock_library_agent = mocker.patch("prisma.models.LibraryAgent.prisma")
mock_library_agent.return_value.find_first = mocker.AsyncMock(return_value=None)
result = await db.get_library_agent_by_graph_id(
"test-user",
"agent1",
7,
include_archived=True,
)
assert result is None
mock_library_agent.return_value.find_first.assert_called_once()
where = mock_library_agent.return_value.find_first.call_args.kwargs["where"]
assert where == {
"agentGraphId": "agent1",
"userId": "test-user",
"isDeleted": False,
"agentGraphVersion": 7,
}
@pytest.mark.asyncio
async def test_update_graph_in_library_allows_archived_library_agent(mocker):
graph = mocker.Mock(id="graph-id")
existing_version = mocker.Mock(version=1, is_active=True)
graph_model = mocker.Mock()
created_graph = mocker.Mock(id="graph-id", version=2, is_active=False)
current_library_agent = mocker.Mock()
updated_library_agent = mocker.Mock()
mocker.patch(
"backend.api.features.library.db.graph_db.get_graph_all_versions",
new=mocker.AsyncMock(return_value=[existing_version]),
)
mocker.patch(
"backend.api.features.library.db.graph_db.make_graph_model",
return_value=graph_model,
)
mocker.patch(
"backend.api.features.library.db.graph_db.create_graph",
new=mocker.AsyncMock(return_value=created_graph),
)
mock_get_library_agent = mocker.patch(
"backend.api.features.library.db.get_library_agent_by_graph_id",
new=mocker.AsyncMock(return_value=current_library_agent),
)
mock_update_library_agent = mocker.patch(
"backend.api.features.library.db.update_library_agent_version_and_settings",
new=mocker.AsyncMock(return_value=updated_library_agent),
)
result_graph, result_library_agent = await db.update_graph_in_library(
graph,
"test-user",
)
assert result_graph is created_graph
assert result_library_agent is updated_library_agent
assert graph.version == 2
graph_model.reassign_ids.assert_called_once_with(
user_id="test-user", reassign_graph_id=False
)
mock_get_library_agent.assert_awaited_once_with(
"test-user",
"graph-id",
include_archived=True,
)
mock_update_library_agent.assert_awaited_once_with("test-user", created_graph)
@pytest.mark.asyncio
async def test_create_library_agent_uses_upsert():
"""create_library_agent should use upsert (not create) to handle duplicates."""
mock_graph = MagicMock()
mock_graph.id = "graph-1"
mock_graph.version = 1
mock_graph.user_id = "user-1"
mock_graph.nodes = []
mock_graph.sub_graphs = []
mock_upserted = MagicMock(name="UpsertedLibraryAgent")
@asynccontextmanager
async def fake_tx():
yield None
with (
patch("backend.api.features.library.db.transaction", fake_tx),
patch("prisma.models.LibraryAgent.prisma") as mock_prisma,
patch(
"backend.api.features.library.db.add_generated_agent_image",
new=AsyncMock(),
),
patch(
"backend.api.features.library.model.LibraryAgent.from_db",
return_value=MagicMock(),
),
):
mock_prisma.return_value.upsert = AsyncMock(return_value=mock_upserted)
result = await db.create_library_agent(mock_graph, "user-1")
assert len(result) == 1
upsert_call = mock_prisma.return_value.upsert.call_args
assert upsert_call is not None
# Verify the upsert where clause uses the composite unique key
where = upsert_call.kwargs["where"]
assert "userId_agentGraphId_agentGraphVersion" in where
# Verify the upsert data has both create and update branches
data = upsert_call.kwargs["data"]
assert "create" in data
assert "update" in data
# Verify update branch restores soft-deleted/archived agents
assert data["update"]["isDeleted"] is False
assert data["update"]["isArchived"] is False

View File

@@ -12,6 +12,7 @@ Tests cover:
5. Complete OAuth flow end-to-end
"""
import asyncio
import base64
import hashlib
import secrets
@@ -58,14 +59,27 @@ async def test_user(server, test_user_id: str):
yield test_user_id
# Cleanup - delete in correct order due to foreign key constraints
await PrismaOAuthAccessToken.prisma().delete_many(where={"userId": test_user_id})
await PrismaOAuthRefreshToken.prisma().delete_many(where={"userId": test_user_id})
await PrismaOAuthAuthorizationCode.prisma().delete_many(
where={"userId": test_user_id}
)
await PrismaOAuthApplication.prisma().delete_many(where={"ownerId": test_user_id})
await PrismaUser.prisma().delete(where={"id": test_user_id})
# Cleanup - delete in correct order due to foreign key constraints.
# Wrap in try/except because the event loop or Prisma engine may already
# be closed during session teardown on Python 3.12+.
try:
await asyncio.gather(
PrismaOAuthAccessToken.prisma().delete_many(where={"userId": test_user_id}),
PrismaOAuthRefreshToken.prisma().delete_many(
where={"userId": test_user_id}
),
PrismaOAuthAuthorizationCode.prisma().delete_many(
where={"userId": test_user_id}
),
)
await asyncio.gather(
PrismaOAuthApplication.prisma().delete_many(
where={"ownerId": test_user_id}
),
PrismaUser.prisma().delete(where={"id": test_user_id}),
)
except RuntimeError:
pass
@pytest_asyncio.fixture

View File

@@ -391,6 +391,11 @@ async def get_available_graph(
async def get_store_agent_by_version_id(
store_listing_version_id: str,
) -> store_model.StoreAgentDetails:
"""Get agent details from the StoreAgent view (APPROVED agents only).
See also: `get_store_agent_details_as_admin()` which bypasses the
APPROVED-only StoreAgent view for admin preview of pending submissions.
"""
logger.debug(f"Getting store agent details for {store_listing_version_id}")
try:
@@ -411,6 +416,57 @@ async def get_store_agent_by_version_id(
raise DatabaseError("Failed to fetch agent details") from e
async def get_store_agent_details_as_admin(
store_listing_version_id: str,
) -> store_model.StoreAgentDetails:
"""Get agent details for admin preview, bypassing the APPROVED-only
StoreAgent view. Queries StoreListingVersion directly so pending
submissions are visible."""
slv = await prisma.models.StoreListingVersion.prisma().find_unique(
where={"id": store_listing_version_id},
include={
"StoreListing": {"include": {"CreatorProfile": True}},
},
)
if not slv or not slv.StoreListing:
raise NotFoundError(
f"Store listing version {store_listing_version_id} not found"
)
listing = slv.StoreListing
# CreatorProfile is a required FK relation — should always exist.
# If it's None, the DB is in a bad state.
profile = listing.CreatorProfile
if not profile:
raise DatabaseError(
f"StoreListing {listing.id} has no CreatorProfile — FK violated"
)
return store_model.StoreAgentDetails(
store_listing_version_id=slv.id,
slug=listing.slug,
agent_name=slv.name,
agent_video=slv.videoUrl or "",
agent_output_demo=slv.agentOutputDemoUrl or "",
agent_image=slv.imageUrls,
creator=profile.username,
creator_avatar=profile.avatarUrl or "",
sub_heading=slv.subHeading,
description=slv.description,
instructions=slv.instructions,
categories=slv.categories,
runs=0,
rating=0.0,
versions=[str(slv.version)],
graph_id=slv.agentGraphId,
graph_versions=[str(slv.agentGraphVersion)],
last_updated=slv.updatedAt,
recommended_schedule_cron=slv.recommendedScheduleCron,
active_version_id=listing.activeVersionId or slv.id,
has_approved_version=listing.hasApprovedVersion,
)
class StoreCreatorsSortOptions(Enum):
# NOTE: values correspond 1:1 to columns of the Creator view
AGENT_RATING = "agent_rating"

View File

@@ -592,6 +592,11 @@ async def fulfill_checkout(user_id: Annotated[str, Security(get_user_id)]):
async def configure_user_auto_top_up(
request: AutoTopUpConfig, user_id: Annotated[str, Security(get_user_id)]
) -> str:
"""Configure auto top-up settings and perform an immediate top-up if needed.
Raises HTTPException(422) if the request parameters are invalid or if
the credit top-up fails.
"""
if request.threshold < 0:
raise HTTPException(status_code=422, detail="Threshold must be greater than 0")
if request.amount < 500 and request.amount != 0:
@@ -606,10 +611,20 @@ async def configure_user_auto_top_up(
user_credit_model = await get_user_credit_model(user_id)
current_balance = await user_credit_model.get_credits(user_id)
if current_balance < request.threshold:
await user_credit_model.top_up_credits(user_id, request.amount)
else:
await user_credit_model.top_up_credits(user_id, 0)
try:
if current_balance < request.threshold:
await user_credit_model.top_up_credits(user_id, request.amount)
else:
await user_credit_model.top_up_credits(user_id, 0)
except ValueError as e:
known_messages = (
"must not be negative",
"already exists for user",
"No payment method found",
)
if any(msg in str(e) for msg in known_messages):
raise HTTPException(status_code=422, detail=str(e))
raise
await set_auto_top_up(
user_id, AutoTopUpConfig(threshold=request.threshold, amount=request.amount)
@@ -965,14 +980,16 @@ async def execute_graph(
source: Annotated[GraphExecutionSource | None, Body(embed=True)] = None,
graph_version: Optional[int] = None,
preset_id: Optional[str] = None,
dry_run: Annotated[bool, Body(embed=True)] = False,
) -> execution_db.GraphExecutionMeta:
user_credit_model = await get_user_credit_model(user_id)
current_balance = await user_credit_model.get_credits(user_id)
if current_balance <= 0:
raise HTTPException(
status_code=402,
detail="Insufficient balance to execute the agent. Please top up your account.",
)
if not dry_run:
user_credit_model = await get_user_credit_model(user_id)
current_balance = await user_credit_model.get_credits(user_id)
if current_balance <= 0:
raise HTTPException(
status_code=402,
detail="Insufficient balance to execute the agent. Please top up your account.",
)
try:
result = await execution_utils.add_graph_execution(
@@ -982,6 +999,7 @@ async def execute_graph(
preset_id=preset_id,
graph_version=graph_version,
graph_credentials_inputs=credentials_inputs,
dry_run=dry_run,
)
# Record successful graph execution
record_graph_execution(graph_id=graph_id, status="success", user_id=user_id)

View File

@@ -188,6 +188,7 @@ async def upload_file(
user_id: Annotated[str, fastapi.Security(get_user_id)],
file: UploadFile,
session_id: str | None = Query(default=None),
overwrite: bool = Query(default=False),
) -> UploadFileResponse:
"""
Upload a file to the user's workspace.
@@ -248,7 +249,9 @@ async def upload_file(
# Write file via WorkspaceManager
manager = WorkspaceManager(user_id, workspace.id, session_id)
try:
workspace_file = await manager.write_file(content, filename)
workspace_file = await manager.write_file(
content, filename, overwrite=overwrite
)
except ValueError as e:
raise fastapi.HTTPException(status_code=409, detail=str(e)) from e

View File

@@ -18,6 +18,7 @@ from prisma.errors import PrismaError
import backend.api.features.admin.credit_admin_routes
import backend.api.features.admin.execution_analytics_routes
import backend.api.features.admin.rate_limit_admin_routes
import backend.api.features.admin.store_admin_routes
import backend.api.features.builder
import backend.api.features.builder.routes
@@ -117,6 +118,11 @@ async def lifespan_context(app: fastapi.FastAPI):
AutoRegistry.patch_integrations()
# Register managed credential providers (e.g. AgentMail)
from backend.integrations.managed_providers import register_all
register_all()
await backend.data.block.initialize_blocks()
await backend.data.user.migrate_and_encrypt_user_integrations()
@@ -210,13 +216,22 @@ instrument_fastapi(
def handle_internal_http_error(status_code: int = 500, log_error: bool = True):
def handler(request: fastapi.Request, exc: Exception):
if log_error:
logger.exception(
"%s %s failed. Investigate and resolve the underlying issue: %s",
request.method,
request.url.path,
exc,
exc_info=exc,
)
if status_code >= 500:
logger.exception(
"%s %s failed. Investigate and resolve the underlying issue: %s",
request.method,
request.url.path,
exc,
exc_info=exc,
)
else:
logger.warning(
"%s %s failed with %d: %s",
request.method,
request.url.path,
status_code,
exc,
)
hint = (
"Adjust the request and retry."
@@ -266,12 +281,10 @@ async def validation_error_handler(
app.add_exception_handler(PrismaError, handle_internal_http_error(500))
app.add_exception_handler(
FolderAlreadyExistsError, handle_internal_http_error(409, False)
)
app.add_exception_handler(FolderValidationError, handle_internal_http_error(400, False))
app.add_exception_handler(NotFoundError, handle_internal_http_error(404, False))
app.add_exception_handler(NotAuthorizedError, handle_internal_http_error(403, False))
app.add_exception_handler(FolderAlreadyExistsError, handle_internal_http_error(409))
app.add_exception_handler(FolderValidationError, handle_internal_http_error(400))
app.add_exception_handler(NotFoundError, handle_internal_http_error(404))
app.add_exception_handler(NotAuthorizedError, handle_internal_http_error(403))
app.add_exception_handler(RequestValidationError, validation_error_handler)
app.add_exception_handler(pydantic.ValidationError, validation_error_handler)
app.add_exception_handler(MissingConfigError, handle_internal_http_error(503))
@@ -311,6 +324,11 @@ app.include_router(
tags=["v2", "admin"],
prefix="/api/executions",
)
app.include_router(
backend.api.features.admin.rate_limit_admin_routes.router,
tags=["v2", "admin"],
prefix="/api/copilot",
)
app.include_router(
backend.api.features.executions.review.routes.router,
tags=["v2", "executions", "review"],
@@ -521,8 +539,11 @@ class AgentServer(backend.util.service.AppProcess):
user_id: str,
provider: ProviderName,
credentials: Credentials,
) -> Credentials:
from .features.integrations.router import create_credentials, get_credential
):
from backend.api.features.integrations.router import (
create_credentials,
get_credential,
)
try:
return await create_credentials(

View File

@@ -1,3 +1,4 @@
import re
from typing import Any
from backend.blocks._base import (
@@ -19,6 +20,33 @@ from backend.blocks.llm import (
)
from backend.data.model import APIKeyCredentials, NodeExecutionStats, SchemaField
# Minimum max_output_tokens accepted by OpenAI-compatible APIs.
# A true/false answer fits comfortably within this budget.
MIN_LLM_OUTPUT_TOKENS = 16
def _parse_boolean_response(response_text: str) -> tuple[bool, str | None]:
"""Parse an LLM response into a boolean result.
Returns a ``(result, error)`` tuple. *error* is ``None`` when the
response is unambiguous; otherwise it contains a diagnostic message
and *result* defaults to ``False``.
"""
text = response_text.strip().lower()
if text == "true":
return True, None
if text == "false":
return False, None
# Fuzzy match use word boundaries to avoid false positives like "untrue".
tokens = set(re.findall(r"\b(true|false|yes|no|1|0)\b", text))
if tokens == {"true"} or tokens == {"yes"} or tokens == {"1"}:
return True, None
if tokens == {"false"} or tokens == {"no"} or tokens == {"0"}:
return False, None
return False, f"Unclear AI response: '{response_text}'"
class AIConditionBlock(AIBlockBase):
"""
@@ -162,54 +190,26 @@ class AIConditionBlock(AIBlockBase):
]
# Call the LLM
try:
response = await self.llm_call(
credentials=credentials,
llm_model=input_data.model,
prompt=prompt,
max_tokens=10, # We only expect a true/false response
response = await self.llm_call(
credentials=credentials,
llm_model=input_data.model,
prompt=prompt,
max_tokens=MIN_LLM_OUTPUT_TOKENS,
)
# Extract the boolean result from the response
result, error = _parse_boolean_response(response.response)
if error:
yield "error", error
# Update internal stats
self.merge_stats(
NodeExecutionStats(
input_token_count=response.prompt_tokens,
output_token_count=response.completion_tokens,
)
# Extract the boolean result from the response
response_text = response.response.strip().lower()
if response_text == "true":
result = True
elif response_text == "false":
result = False
else:
# If the response is not clear, try to interpret it using word boundaries
import re
# Use word boundaries to avoid false positives like 'untrue' or '10'
tokens = set(re.findall(r"\b(true|false|yes|no|1|0)\b", response_text))
if tokens == {"true"} or tokens == {"yes"} or tokens == {"1"}:
result = True
elif tokens == {"false"} or tokens == {"no"} or tokens == {"0"}:
result = False
else:
# Unclear or conflicting response - default to False and yield error
result = False
yield "error", f"Unclear AI response: '{response.response}'"
# Update internal stats
self.merge_stats(
NodeExecutionStats(
input_token_count=response.prompt_tokens,
output_token_count=response.completion_tokens,
)
)
self.prompt = response.prompt
except Exception as e:
# In case of any error, default to False to be safe
result = False
# Log the error but don't fail the block execution
import logging
logger = logging.getLogger(__name__)
logger.error(f"AI condition evaluation failed: {str(e)}")
yield "error", f"AI evaluation failed: {str(e)}"
)
self.prompt = response.prompt
# Yield results
yield "result", result

View File

@@ -0,0 +1,147 @@
"""Tests for AIConditionBlock regression coverage for max_tokens and error propagation."""
from __future__ import annotations
from typing import cast
import pytest
from backend.blocks.ai_condition import (
MIN_LLM_OUTPUT_TOKENS,
AIConditionBlock,
_parse_boolean_response,
)
from backend.blocks.llm import (
DEFAULT_LLM_MODEL,
TEST_CREDENTIALS,
TEST_CREDENTIALS_INPUT,
AICredentials,
LLMResponse,
)
_TEST_AI_CREDENTIALS = cast(AICredentials, TEST_CREDENTIALS_INPUT)
# ---------------------------------------------------------------------------
# Helper to collect all yields from the async generator
# ---------------------------------------------------------------------------
async def _collect_outputs(block: AIConditionBlock, input_data, credentials):
outputs: dict[str, object] = {}
async for name, value in block.run(input_data, credentials=credentials):
outputs[name] = value
return outputs
def _make_input(**overrides) -> AIConditionBlock.Input:
defaults: dict = {
"input_value": "hello@example.com",
"condition": "the input is an email address",
"yes_value": "yes!",
"no_value": "no!",
"model": DEFAULT_LLM_MODEL,
"credentials": TEST_CREDENTIALS_INPUT,
}
defaults.update(overrides)
return AIConditionBlock.Input(**defaults)
def _mock_llm_response(response_text: str) -> LLMResponse:
return LLMResponse(
raw_response="",
prompt=[],
response=response_text,
tool_calls=None,
prompt_tokens=10,
completion_tokens=5,
reasoning=None,
)
# ---------------------------------------------------------------------------
# _parse_boolean_response unit tests
# ---------------------------------------------------------------------------
class TestParseBooleanResponse:
def test_true_exact(self):
assert _parse_boolean_response("true") == (True, None)
def test_false_exact(self):
assert _parse_boolean_response("false") == (False, None)
def test_true_with_whitespace(self):
assert _parse_boolean_response(" True ") == (True, None)
def test_yes_fuzzy(self):
assert _parse_boolean_response("Yes") == (True, None)
def test_no_fuzzy(self):
assert _parse_boolean_response("no") == (False, None)
def test_one_fuzzy(self):
assert _parse_boolean_response("1") == (True, None)
def test_zero_fuzzy(self):
assert _parse_boolean_response("0") == (False, None)
def test_unclear_response(self):
result, error = _parse_boolean_response("I'm not sure")
assert result is False
assert error is not None
assert "Unclear" in error
def test_conflicting_tokens(self):
result, error = _parse_boolean_response("true and false")
assert result is False
assert error is not None
# ---------------------------------------------------------------------------
# Regression: max_tokens is set to MIN_LLM_OUTPUT_TOKENS
# ---------------------------------------------------------------------------
class TestMaxTokensRegression:
@pytest.mark.asyncio
async def test_llm_call_receives_min_output_tokens(self):
"""max_tokens must be MIN_LLM_OUTPUT_TOKENS (16) the previous value
of 1 was too low and caused OpenAI to reject the request."""
block = AIConditionBlock()
captured_kwargs: dict = {}
async def spy_llm_call(**kwargs):
captured_kwargs.update(kwargs)
return _mock_llm_response("true")
block.llm_call = spy_llm_call # type: ignore[assignment]
input_data = _make_input()
await _collect_outputs(block, input_data, credentials=TEST_CREDENTIALS)
assert captured_kwargs["max_tokens"] == MIN_LLM_OUTPUT_TOKENS
assert captured_kwargs["max_tokens"] == 16
# ---------------------------------------------------------------------------
# Regression: exceptions from llm_call must propagate
# ---------------------------------------------------------------------------
class TestExceptionPropagation:
@pytest.mark.asyncio
async def test_llm_call_exception_propagates(self):
"""If llm_call raises, the exception must NOT be swallowed.
Previously the block caught all exceptions and silently returned
result=False."""
block = AIConditionBlock()
async def boom(**kwargs):
raise RuntimeError("LLM provider error")
block.llm_call = boom # type: ignore[assignment]
input_data = _make_input()
with pytest.raises(RuntimeError, match="LLM provider error"):
await _collect_outputs(block, input_data, credentials=TEST_CREDENTIALS)

View File

@@ -15,6 +15,12 @@ from backend.blocks._base import (
BlockSchemaInput,
BlockSchemaOutput,
)
from backend.copilot.permissions import (
CopilotPermissions,
ToolName,
all_known_tool_names,
validate_block_identifiers,
)
from backend.data.model import SchemaField
if TYPE_CHECKING:
@@ -96,6 +102,65 @@ class AutoPilotBlock(Block):
advanced=True,
)
tools: list[ToolName] = SchemaField(
description=(
"Tool names to filter. Works with tools_exclude to form an "
"allow-list or deny-list. "
"Leave empty to apply no tool filter."
),
default=[],
advanced=True,
)
tools_exclude: bool = SchemaField(
description=(
"Controls how the 'tools' list is interpreted. "
"True (default): 'tools' is a deny-list — listed tools are blocked, "
"all others are allowed. An empty 'tools' list means allow everything. "
"False: 'tools' is an allow-list — only listed tools are permitted."
),
default=True,
advanced=True,
)
blocks: list[str] = SchemaField(
description=(
"Block identifiers to filter when the copilot uses run_block. "
"Each entry can be: a block name (e.g. 'HTTP Request'), "
"a full block UUID, or the first 8 hex characters of the UUID "
"(e.g. 'c069dc6b'). Works with blocks_exclude. "
"Leave empty to apply no block filter."
),
default=[],
advanced=True,
)
blocks_exclude: bool = SchemaField(
description=(
"Controls how the 'blocks' list is interpreted. "
"True (default): 'blocks' is a deny-list — listed blocks are blocked, "
"all others are allowed. An empty 'blocks' list means allow everything. "
"False: 'blocks' is an allow-list — only listed blocks are permitted."
),
default=True,
advanced=True,
)
dry_run: bool = SchemaField(
description=(
"When enabled, run_block and run_agent tool calls in this "
"autopilot session are forced to use dry-run simulation mode. "
"No real API calls, side effects, or credits are consumed "
"by those tools. Useful for testing agent wiring and "
"previewing outputs. "
"Only applies when creating a new session (session_id is empty). "
"When reusing an existing session_id, the session's original "
"dry_run setting is preserved."
),
default=False,
advanced=True,
)
# timeout_seconds removed: the SDK manages its own heartbeat-based
# timeouts internally; wrapping with asyncio.timeout corrupts the
# SDK's internal stream (see service.py CRITICAL comment).
@@ -182,11 +247,11 @@ class AutoPilotBlock(Block):
},
)
async def create_session(self, user_id: str) -> str:
async def create_session(self, user_id: str, *, dry_run: bool) -> str:
"""Create a new chat session and return its ID (mockable for tests)."""
from backend.copilot.model import create_chat_session
from backend.copilot.model import create_chat_session # avoid circular import
session = await create_chat_session(user_id)
session = await create_chat_session(user_id, dry_run=dry_run)
return session.session_id
async def execute_copilot(
@@ -196,6 +261,7 @@ class AutoPilotBlock(Block):
session_id: str,
max_recursion_depth: int,
user_id: str,
permissions: "CopilotPermissions | None" = None,
) -> tuple[str, list[ToolCallEntry], str, str, TokenUsage]:
"""Invoke the copilot and collect all stream results.
@@ -209,14 +275,21 @@ class AutoPilotBlock(Block):
session_id: Chat session to use.
max_recursion_depth: Maximum allowed recursion nesting.
user_id: Authenticated user ID.
permissions: Optional capability filter restricting tools/blocks.
Returns:
A tuple of (response_text, tool_calls, history_json, session_id, usage).
"""
from backend.copilot.sdk.collect import collect_copilot_response
from backend.copilot.sdk.collect import (
collect_copilot_response, # avoid circular import
)
tokens = _check_recursion(max_recursion_depth)
perm_token = None
try:
effective_permissions, perm_token = _merge_inherited_permissions(
permissions
)
effective_prompt = prompt
if system_context:
effective_prompt = f"[System Context: {system_context}]\n\n{prompt}"
@@ -225,6 +298,7 @@ class AutoPilotBlock(Block):
session_id=session_id,
message=effective_prompt,
user_id=user_id,
permissions=effective_permissions,
)
# Build a lightweight conversation summary from streamed data.
@@ -271,6 +345,8 @@ class AutoPilotBlock(Block):
)
finally:
_reset_recursion(tokens)
if perm_token is not None:
_inherited_permissions.reset(perm_token)
async def run(
self,
@@ -295,11 +371,20 @@ class AutoPilotBlock(Block):
yield "error", "max_recursion_depth must be at least 1."
return
# Validate and build permissions eagerly — fail before creating a session.
permissions = await _build_and_validate_permissions(input_data)
if isinstance(permissions, str):
# Validation error returned as a string message.
yield "error", permissions
return
# Create session eagerly so the user always gets the session_id,
# even if the downstream stream fails (avoids orphaned sessions).
sid = input_data.session_id
if not sid:
sid = await self.create_session(execution_context.user_id)
sid = await self.create_session(
execution_context.user_id, dry_run=input_data.dry_run
)
# NOTE: No asyncio.timeout() here — the SDK manages its own
# heartbeat-based timeouts internally. Wrapping with asyncio.timeout
@@ -312,6 +397,7 @@ class AutoPilotBlock(Block):
session_id=sid,
max_recursion_depth=input_data.max_recursion_depth,
user_id=execution_context.user_id,
permissions=permissions,
)
yield "response", response
@@ -374,3 +460,78 @@ def _reset_recursion(
"""Restore recursion depth and limit to their previous values."""
_autopilot_recursion_depth.reset(tokens[0])
_autopilot_recursion_limit.reset(tokens[1])
# ---------------------------------------------------------------------------
# Permission helpers
# ---------------------------------------------------------------------------
# Inherited permissions from a parent AutoPilotBlock execution.
# This acts as a ceiling: child executions can only be more restrictive.
_inherited_permissions: contextvars.ContextVar["CopilotPermissions | None"] = (
contextvars.ContextVar("_inherited_permissions", default=None)
)
async def _build_and_validate_permissions(
input_data: "AutoPilotBlock.Input",
) -> "CopilotPermissions | str":
"""Build a :class:`CopilotPermissions` from block input and validate it.
Returns a :class:`CopilotPermissions` on success or a human-readable
error string if validation fails.
"""
# Tool names are validated by Pydantic via the ToolName Literal type
# at model construction time — no runtime check needed here.
# Validate block identifiers against live block registry.
if input_data.blocks:
invalid_blocks = await validate_block_identifiers(input_data.blocks)
if invalid_blocks:
return (
f"Unknown block identifier(s) in 'blocks': {invalid_blocks}. "
"Use find_block to discover valid block names and IDs. "
"You may also use the first 8 characters of a block UUID."
)
return CopilotPermissions(
tools=list(input_data.tools),
tools_exclude=input_data.tools_exclude,
blocks=input_data.blocks,
blocks_exclude=input_data.blocks_exclude,
)
def _merge_inherited_permissions(
permissions: "CopilotPermissions | None",
) -> "tuple[CopilotPermissions | None, contextvars.Token[CopilotPermissions | None] | None]":
"""Merge *permissions* with any inherited parent permissions.
The merged result is stored back into the contextvar so that any nested
AutoPilotBlock invocation (sub-agent) inherits the merged ceiling.
Returns a tuple of (merged_permissions, reset_token). The caller MUST
reset the contextvar via ``_inherited_permissions.reset(token)`` in a
``finally`` block when ``reset_token`` is not None — this prevents
permission leakage between sequential independent executions in the same
asyncio task.
"""
parent = _inherited_permissions.get()
if permissions is None and parent is None:
return None, None
all_tools = all_known_tool_names()
if permissions is None:
permissions = CopilotPermissions() # allow-all; will be narrowed by parent
merged = (
permissions.merged_with_parent(parent, all_tools)
if parent is not None
else permissions
)
# Store merged permissions as the new inherited ceiling for nested calls.
# Return the token so the caller can restore the previous value in finally.
token = _inherited_permissions.set(merged)
return merged, token

View File

@@ -0,0 +1,265 @@
"""Tests for AutoPilotBlock permission fields and validation."""
from __future__ import annotations
from unittest.mock import AsyncMock, MagicMock, patch
import pytest
from pydantic import ValidationError
from backend.blocks.autopilot import (
AutoPilotBlock,
_build_and_validate_permissions,
_inherited_permissions,
_merge_inherited_permissions,
)
from backend.copilot.permissions import CopilotPermissions, all_known_tool_names
from backend.data.execution import ExecutionContext
# ---------------------------------------------------------------------------
# Helpers
# ---------------------------------------------------------------------------
def _make_input(**kwargs) -> AutoPilotBlock.Input:
defaults = {
"prompt": "Do something",
"system_context": "",
"session_id": "",
"max_recursion_depth": 3,
"tools": [],
"tools_exclude": True,
"blocks": [],
"blocks_exclude": True,
}
defaults.update(kwargs)
return AutoPilotBlock.Input(**defaults)
# ---------------------------------------------------------------------------
# _build_and_validate_permissions
# ---------------------------------------------------------------------------
@pytest.mark.asyncio
class TestBuildAndValidatePermissions:
async def test_empty_inputs_returns_empty_permissions(self):
inp = _make_input()
result = await _build_and_validate_permissions(inp)
assert isinstance(result, CopilotPermissions)
assert result.is_empty()
async def test_valid_tool_names_accepted(self):
inp = _make_input(tools=["run_block", "web_fetch"], tools_exclude=True)
result = await _build_and_validate_permissions(inp)
assert isinstance(result, CopilotPermissions)
assert result.tools == ["run_block", "web_fetch"]
assert result.tools_exclude is True
async def test_invalid_tool_rejected_by_pydantic(self):
"""Invalid tool names are now caught at Pydantic validation time
(Literal type), before ``_build_and_validate_permissions`` is called."""
with pytest.raises(ValidationError, match="not_a_real_tool"):
_make_input(tools=["not_a_real_tool"])
async def test_valid_block_name_accepted(self):
mock_block_cls = MagicMock()
mock_block_cls.return_value.name = "HTTP Request"
with patch(
"backend.blocks.get_blocks",
return_value={"c069dc6b-c3ed-4c12-b6e5-d47361e64ce6": mock_block_cls},
):
inp = _make_input(blocks=["HTTP Request"], blocks_exclude=True)
result = await _build_and_validate_permissions(inp)
assert isinstance(result, CopilotPermissions)
assert result.blocks == ["HTTP Request"]
async def test_valid_partial_uuid_accepted(self):
mock_block_cls = MagicMock()
mock_block_cls.return_value.name = "HTTP Request"
with patch(
"backend.blocks.get_blocks",
return_value={"c069dc6b-c3ed-4c12-b6e5-d47361e64ce6": mock_block_cls},
):
inp = _make_input(blocks=["c069dc6b"], blocks_exclude=False)
result = await _build_and_validate_permissions(inp)
assert isinstance(result, CopilotPermissions)
async def test_invalid_block_identifier_returns_error(self):
mock_block_cls = MagicMock()
mock_block_cls.return_value.name = "HTTP Request"
with patch(
"backend.blocks.get_blocks",
return_value={"c069dc6b-c3ed-4c12-b6e5-d47361e64ce6": mock_block_cls},
):
inp = _make_input(blocks=["totally_fake_block"])
result = await _build_and_validate_permissions(inp)
assert isinstance(result, str)
assert "totally_fake_block" in result
assert "Unknown block identifier" in result
async def test_sdk_builtin_tool_names_accepted(self):
inp = _make_input(tools=["Read", "Task", "WebSearch"], tools_exclude=False)
result = await _build_and_validate_permissions(inp)
assert isinstance(result, CopilotPermissions)
assert not result.tools_exclude
async def test_empty_blocks_skips_validation(self):
# Should not call validate_block_identifiers at all when blocks=[].
with patch(
"backend.copilot.permissions.validate_block_identifiers"
) as mock_validate:
inp = _make_input(blocks=[])
await _build_and_validate_permissions(inp)
mock_validate.assert_not_called()
# ---------------------------------------------------------------------------
# _merge_inherited_permissions
# ---------------------------------------------------------------------------
class TestMergeInheritedPermissions:
def test_no_permissions_no_parent_returns_none(self):
merged, token = _merge_inherited_permissions(None)
assert merged is None
assert token is None
def test_permissions_no_parent_returned_unchanged(self):
perms = CopilotPermissions(tools=["bash_exec"], tools_exclude=True)
merged, token = _merge_inherited_permissions(perms)
try:
assert merged is perms
assert token is not None
finally:
if token is not None:
_inherited_permissions.reset(token)
def test_child_narrows_parent(self):
parent = CopilotPermissions(tools=["bash_exec"], tools_exclude=True)
# Set parent as inherited
outer_token = _inherited_permissions.set(parent)
try:
child = CopilotPermissions(tools=["web_fetch"], tools_exclude=True)
merged, inner_token = _merge_inherited_permissions(child)
try:
assert merged is not None
all_t = all_known_tool_names()
effective = merged.effective_allowed_tools(all_t)
assert "bash_exec" not in effective
assert "web_fetch" not in effective
finally:
if inner_token is not None:
_inherited_permissions.reset(inner_token)
finally:
_inherited_permissions.reset(outer_token)
def test_none_permissions_with_parent_uses_parent(self):
parent = CopilotPermissions(tools=["bash_exec"], tools_exclude=True)
outer_token = _inherited_permissions.set(parent)
try:
merged, inner_token = _merge_inherited_permissions(None)
try:
assert merged is not None
# Merged should have parent's restrictions
effective = merged.effective_allowed_tools(all_known_tool_names())
assert "bash_exec" not in effective
finally:
if inner_token is not None:
_inherited_permissions.reset(inner_token)
finally:
_inherited_permissions.reset(outer_token)
def test_child_cannot_expand_parent_whitelist(self):
parent = CopilotPermissions(tools=["run_block"], tools_exclude=False)
outer_token = _inherited_permissions.set(parent)
try:
# Child tries to allow more tools
child = CopilotPermissions(
tools=["run_block", "bash_exec"], tools_exclude=False
)
merged, inner_token = _merge_inherited_permissions(child)
try:
assert merged is not None
effective = merged.effective_allowed_tools(all_known_tool_names())
assert "bash_exec" not in effective
assert "run_block" in effective
finally:
if inner_token is not None:
_inherited_permissions.reset(inner_token)
finally:
_inherited_permissions.reset(outer_token)
# ---------------------------------------------------------------------------
# AutoPilotBlock.run — validation integration
# ---------------------------------------------------------------------------
@pytest.mark.asyncio
class TestAutoPilotBlockRunPermissions:
async def _collect_outputs(self, block, input_data, user_id="test-user"):
"""Helper to collect all yields from block.run()."""
ctx = ExecutionContext(
user_id=user_id,
graph_id="g1",
graph_exec_id="ge1",
node_exec_id="ne1",
node_id="n1",
)
outputs = {}
async for key, val in block.run(input_data, execution_context=ctx):
outputs[key] = val
return outputs
async def test_invalid_tool_rejected_by_pydantic(self):
"""Invalid tool names are caught at Pydantic validation (Literal type)."""
with pytest.raises(ValidationError, match="not_a_tool"):
_make_input(tools=["not_a_tool"])
async def test_invalid_block_yields_error(self):
mock_block_cls = MagicMock()
mock_block_cls.return_value.name = "HTTP Request"
with patch(
"backend.blocks.get_blocks",
return_value={"c069dc6b-c3ed-4c12-b6e5-d47361e64ce6": mock_block_cls},
):
block = AutoPilotBlock()
inp = _make_input(blocks=["nonexistent_block"])
outputs = await self._collect_outputs(block, inp)
assert "error" in outputs
assert "nonexistent_block" in outputs["error"]
async def test_empty_prompt_yields_error_before_permission_check(self):
block = AutoPilotBlock()
inp = _make_input(prompt=" ", tools=["run_block"])
outputs = await self._collect_outputs(block, inp)
assert "error" in outputs
assert "Prompt cannot be empty" in outputs["error"]
async def test_valid_permissions_passed_to_execute(self):
"""Permissions are forwarded to execute_copilot when valid."""
block = AutoPilotBlock()
captured: dict = {}
async def fake_execute_copilot(self_inner, **kwargs):
captured["permissions"] = kwargs.get("permissions")
return (
"ok",
[],
'[{"role":"user","content":"hi"}]',
"test-sid",
{"prompt_tokens": 1, "completion_tokens": 1, "total_tokens": 2},
)
with patch.object(
AutoPilotBlock, "create_session", new=AsyncMock(return_value="test-sid")
), patch.object(AutoPilotBlock, "execute_copilot", new=fake_execute_copilot):
inp = _make_input(tools=["run_block"], tools_exclude=False)
outputs = await self._collect_outputs(block, inp)
assert "error" not in outputs
perms = captured.get("permissions")
assert isinstance(perms, CopilotPermissions)
assert perms.tools == ["run_block"]
assert perms.tools_exclude is False

View File

@@ -73,7 +73,7 @@ class ReadDiscordMessagesBlock(Block):
id="df06086a-d5ac-4abb-9996-2ad0acb2eff7",
input_schema=ReadDiscordMessagesBlock.Input, # Assign input schema
output_schema=ReadDiscordMessagesBlock.Output, # Assign output schema
description="Reads messages from a Discord channel using a bot token.",
description="Reads new messages from a Discord channel using a bot token and triggers when a new message is posted",
categories={BlockCategory.SOCIAL},
test_input={
"continuous_read": False,

View File

@@ -1,5 +1,6 @@
import asyncio
import base64
import re
from abc import ABC
from email import encoders
from email.mime.base import MIMEBase
@@ -8,7 +9,7 @@ from email.mime.text import MIMEText
from email.policy import SMTP
from email.utils import getaddresses, parseaddr
from pathlib import Path
from typing import List, Literal, Optional
from typing import List, Literal, Optional, Protocol, runtime_checkable
from google.oauth2.credentials import Credentials
from googleapiclient.discovery import build
@@ -42,8 +43,52 @@ NO_WRAP_POLICY = SMTP.clone(max_line_length=0)
def serialize_email_recipients(recipients: list[str]) -> str:
"""Serialize recipients list to comma-separated string."""
return ", ".join(recipients)
"""Serialize recipients list to comma-separated string.
Strips leading/trailing whitespace from each address to keep MIME
headers clean (mirrors the strip done in ``validate_email_recipients``).
"""
return ", ".join(addr.strip() for addr in recipients)
# RFC 5322 simplified pattern: local@domain where domain has at least one dot
_EMAIL_RE = re.compile(r"^[^@\s]+@[^@\s]+\.[^@\s]+$")
def validate_email_recipients(recipients: list[str], field_name: str = "to") -> None:
"""Validate that all recipients are plausible email addresses.
Raises ``ValueError`` with a user-friendly message listing every
invalid entry so the caller (or LLM) can correct them in one pass.
"""
invalid = [addr for addr in recipients if not _EMAIL_RE.match(addr.strip())]
if invalid:
formatted = ", ".join(f"'{a}'" for a in invalid)
raise ValueError(
f"Invalid email address(es) in '{field_name}': {formatted}. "
f"Each entry must be a valid email address (e.g. user@example.com)."
)
@runtime_checkable
class HasRecipients(Protocol):
to: list[str]
cc: list[str]
bcc: list[str]
def validate_all_recipients(input_data: HasRecipients) -> None:
"""Validate to/cc/bcc recipient fields on an input namespace.
Calls ``validate_email_recipients`` for ``to`` (required) and
``cc``/``bcc`` (when non-empty), raising ``ValueError`` on the
first field that contains an invalid address.
"""
validate_email_recipients(input_data.to, "to")
if input_data.cc:
validate_email_recipients(input_data.cc, "cc")
if input_data.bcc:
validate_email_recipients(input_data.bcc, "bcc")
def _make_mime_text(
@@ -100,14 +145,16 @@ async def create_mime_message(
) -> str:
"""Create a MIME message with attachments and return base64-encoded raw message."""
validate_all_recipients(input_data)
message = MIMEMultipart()
message["to"] = serialize_email_recipients(input_data.to)
message["subject"] = input_data.subject
if input_data.cc:
message["cc"] = ", ".join(input_data.cc)
message["cc"] = serialize_email_recipients(input_data.cc)
if input_data.bcc:
message["bcc"] = ", ".join(input_data.bcc)
message["bcc"] = serialize_email_recipients(input_data.bcc)
# Use the new helper function with content_type if available
content_type = getattr(input_data, "content_type", None)
@@ -1167,13 +1214,15 @@ async def _build_reply_message(
references.append(headers["message-id"])
# Create MIME message
validate_all_recipients(input_data)
msg = MIMEMultipart()
if input_data.to:
msg["To"] = ", ".join(input_data.to)
msg["To"] = serialize_email_recipients(input_data.to)
if input_data.cc:
msg["Cc"] = ", ".join(input_data.cc)
msg["Cc"] = serialize_email_recipients(input_data.cc)
if input_data.bcc:
msg["Bcc"] = ", ".join(input_data.bcc)
msg["Bcc"] = serialize_email_recipients(input_data.bcc)
msg["Subject"] = subject
if headers.get("message-id"):
msg["In-Reply-To"] = headers["message-id"]
@@ -1685,13 +1734,16 @@ To: {original_to}
else:
body = f"{forward_header}\n\n{original_body}"
# Validate all recipient lists before building the MIME message
validate_all_recipients(input_data)
# Create MIME message
msg = MIMEMultipart()
msg["To"] = ", ".join(input_data.to)
msg["To"] = serialize_email_recipients(input_data.to)
if input_data.cc:
msg["Cc"] = ", ".join(input_data.cc)
msg["Cc"] = serialize_email_recipients(input_data.cc)
if input_data.bcc:
msg["Bcc"] = ", ".join(input_data.bcc)
msg["Bcc"] = serialize_email_recipients(input_data.bcc)
msg["Subject"] = subject
# Add body with proper content type

View File

@@ -2,6 +2,8 @@ import copy
from datetime import date, time
from typing import Any, Optional
from pydantic import AliasChoices, Field
from backend.blocks._base import (
Block,
BlockCategory,
@@ -28,9 +30,9 @@ class AgentInputBlock(Block):
"""
This block is used to provide input to the graph.
It takes in a value, name, description, default values list and bool to limit selection to default values.
It takes in a value, name, and description.
It Outputs the value passed as input.
It outputs the value passed as input.
"""
class Input(BlockSchemaInput):
@@ -47,12 +49,6 @@ class AgentInputBlock(Block):
default=None,
advanced=True,
)
placeholder_values: list = SchemaField(
description="The placeholder values to be passed as input.",
default_factory=list,
advanced=True,
hidden=True,
)
advanced: bool = SchemaField(
description="Whether to show the input in the advanced section, if the field is not required.",
default=False,
@@ -65,10 +61,7 @@ class AgentInputBlock(Block):
)
def generate_schema(self):
schema = copy.deepcopy(self.get_field_schema("value"))
if possible_values := self.placeholder_values:
schema["enum"] = possible_values
return schema
return copy.deepcopy(self.get_field_schema("value"))
class Output(BlockSchema):
# Use BlockSchema to avoid automatic error field for interface definition
@@ -86,18 +79,16 @@ class AgentInputBlock(Block):
"value": "Hello, World!",
"name": "input_1",
"description": "Example test input.",
"placeholder_values": [],
},
{
"value": "Hello, World!",
"value": 42,
"name": "input_2",
"description": "Example test input with placeholders.",
"placeholder_values": ["Hello, World!"],
"description": "Example numeric input.",
},
],
"test_output": [
("result", "Hello, World!"),
("result", "Hello, World!"),
("result", 42),
],
"categories": {BlockCategory.INPUT, BlockCategory.BASIC},
"block_type": BlockType.INPUT,
@@ -245,13 +236,11 @@ class AgentShortTextInputBlock(AgentInputBlock):
"value": "Hello",
"name": "short_text_1",
"description": "Short text example 1",
"placeholder_values": [],
},
{
"value": "Quick test",
"name": "short_text_2",
"description": "Short text example 2",
"placeholder_values": ["Quick test", "Another option"],
},
],
test_output=[
@@ -285,13 +274,11 @@ class AgentLongTextInputBlock(AgentInputBlock):
"value": "Lorem ipsum dolor sit amet...",
"name": "long_text_1",
"description": "Long text example 1",
"placeholder_values": [],
},
{
"value": "Another multiline text input.",
"name": "long_text_2",
"description": "Long text example 2",
"placeholder_values": ["Another multiline text input."],
},
],
test_output=[
@@ -325,13 +312,11 @@ class AgentNumberInputBlock(AgentInputBlock):
"value": 42,
"name": "number_input_1",
"description": "Number example 1",
"placeholder_values": [],
},
{
"value": 314,
"name": "number_input_2",
"description": "Number example 2",
"placeholder_values": [314, 2718],
},
],
test_output=[
@@ -484,7 +469,8 @@ class AgentFileInputBlock(AgentInputBlock):
class AgentDropdownInputBlock(AgentInputBlock):
"""
A specialized text input block that relies on placeholder_values to present a dropdown.
A specialized text input block that presents a dropdown selector
restricted to a fixed set of values.
"""
class Input(AgentInputBlock.Input):
@@ -494,13 +480,26 @@ class AgentDropdownInputBlock(AgentInputBlock):
advanced=False,
title="Default Value",
)
placeholder_values: list = SchemaField(
description="Possible values for the dropdown.",
# Use Field() directly (not SchemaField) to pass validation_alias,
# which handles backward compat for legacy "placeholder_values" across
# all construction paths (model_construct, __init__, model_validate).
options: list = Field(
default_factory=list,
advanced=False,
title="Dropdown Options",
description=(
"If provided, renders the input as a dropdown selector "
"restricted to these values. Leave empty for free-text input."
),
validation_alias=AliasChoices("options", "placeholder_values"),
json_schema_extra={"advanced": False, "secret": False},
)
def generate_schema(self):
schema = super().generate_schema()
if possible_values := self.options:
schema["enum"] = possible_values
return schema
class Output(AgentInputBlock.Output):
result: str = SchemaField(description="Selected dropdown value.")
@@ -515,13 +514,13 @@ class AgentDropdownInputBlock(AgentInputBlock):
{
"value": "Option A",
"name": "dropdown_1",
"placeholder_values": ["Option A", "Option B", "Option C"],
"options": ["Option A", "Option B", "Option C"],
"description": "Dropdown example 1",
},
{
"value": "Option C",
"name": "dropdown_2",
"placeholder_values": ["Option A", "Option B", "Option C"],
"options": ["Option A", "Option B", "Option C"],
"description": "Dropdown example 2",
},
],

View File

@@ -49,6 +49,9 @@ settings = Settings()
logger = TruncatedLogger(logging.getLogger(__name__), "[LLM-Block]")
fmt = TextFormatter(autoescape=False)
# HTTP status codes for user-caused errors that should not be reported to Sentry.
USER_ERROR_STATUS_CODES = (401, 403, 429)
LLMProviderName = Literal[
ProviderName.AIML_API,
ProviderName.ANTHROPIC,
@@ -101,6 +104,18 @@ class LlmModelMeta(EnumMeta):
class LlmModel(str, Enum, metaclass=LlmModelMeta):
@classmethod
def _missing_(cls, value: object) -> "LlmModel | None":
"""Handle provider-prefixed model names like 'anthropic/claude-sonnet-4-6'."""
if isinstance(value, str) and "/" in value:
stripped = value.split("/", 1)[1]
try:
return cls(stripped)
except ValueError:
return None
return None
# OpenAI models
O3_MINI = "o3-mini"
O3 = "o3-2025-04-16"
@@ -709,6 +724,9 @@ def convert_openai_tool_fmt_to_anthropic(
def extract_openai_reasoning(response) -> str | None:
"""Extract reasoning from OpenAI-compatible response if available."""
"""Note: This will likely not working since the reasoning is not present in another Response API"""
if not response.choices:
logger.warning("LLM response has empty choices in extract_openai_reasoning")
return None
reasoning = None
choice = response.choices[0]
if hasattr(choice, "reasoning") and getattr(choice, "reasoning", None):
@@ -724,6 +742,9 @@ def extract_openai_reasoning(response) -> str | None:
def extract_openai_tool_calls(response) -> list[ToolContentBlock] | None:
"""Extract tool calls from OpenAI-compatible response."""
if not response.choices:
logger.warning("LLM response has empty choices in extract_openai_tool_calls")
return None
if response.choices[0].message.tool_calls:
return [
ToolContentBlock(
@@ -796,6 +817,19 @@ async def llm_call(
)
prompt = result.messages
# Sanitize unpaired surrogates in message content to prevent
# UnicodeEncodeError when httpx encodes the JSON request body.
for msg in prompt:
content = msg.get("content")
if isinstance(content, str):
try:
content.encode("utf-8")
except UnicodeEncodeError:
logger.warning("Sanitized unpaired surrogates in LLM prompt content")
msg["content"] = content.encode("utf-8", errors="surrogatepass").decode(
"utf-8", errors="replace"
)
# Calculate available tokens based on context window and input length
estimated_input_tokens = estimate_token_count(prompt)
model_max_output = llm_model.max_output_tokens or int(2**15)
@@ -878,65 +912,60 @@ async def llm_call(
client = anthropic.AsyncAnthropic(
api_key=credentials.api_key.get_secret_value()
)
try:
resp = await client.messages.create(
model=llm_model.value,
system=sysprompt,
messages=messages,
max_tokens=max_tokens,
tools=an_tools,
timeout=600,
)
resp = await client.messages.create(
model=llm_model.value,
system=sysprompt,
messages=messages,
max_tokens=max_tokens,
tools=an_tools,
timeout=600,
)
if not resp.content:
raise ValueError("No content returned from Anthropic.")
if not resp.content:
raise ValueError("No content returned from Anthropic.")
tool_calls = None
for content_block in resp.content:
# Antropic is different to openai, need to iterate through
# the content blocks to find the tool calls
if content_block.type == "tool_use":
if tool_calls is None:
tool_calls = []
tool_calls.append(
ToolContentBlock(
id=content_block.id,
type=content_block.type,
function=ToolCall(
name=content_block.name,
arguments=json.dumps(content_block.input),
),
)
tool_calls = None
for content_block in resp.content:
# Antropic is different to openai, need to iterate through
# the content blocks to find the tool calls
if content_block.type == "tool_use":
if tool_calls is None:
tool_calls = []
tool_calls.append(
ToolContentBlock(
id=content_block.id,
type=content_block.type,
function=ToolCall(
name=content_block.name,
arguments=json.dumps(content_block.input),
),
)
if not tool_calls and resp.stop_reason == "tool_use":
logger.warning(
f"Tool use stop reason but no tool calls found in content. {resp}"
)
reasoning = None
for content_block in resp.content:
if hasattr(content_block, "type") and content_block.type == "thinking":
reasoning = content_block.thinking
break
return LLMResponse(
raw_response=resp,
prompt=prompt,
response=(
resp.content[0].name
if isinstance(resp.content[0], anthropic.types.ToolUseBlock)
else getattr(resp.content[0], "text", "")
),
tool_calls=tool_calls,
prompt_tokens=resp.usage.input_tokens,
completion_tokens=resp.usage.output_tokens,
reasoning=reasoning,
if not tool_calls and resp.stop_reason == "tool_use":
logger.warning(
f"Tool use stop reason but no tool calls found in content. {resp}"
)
except anthropic.APIError as e:
error_message = f"Anthropic API error: {str(e)}"
logger.error(error_message)
raise ValueError(error_message)
reasoning = None
for content_block in resp.content:
if hasattr(content_block, "type") and content_block.type == "thinking":
reasoning = content_block.thinking
break
return LLMResponse(
raw_response=resp,
prompt=prompt,
response=(
resp.content[0].name
if isinstance(resp.content[0], anthropic.types.ToolUseBlock)
else getattr(resp.content[0], "text", "")
),
tool_calls=tool_calls,
prompt_tokens=resp.usage.input_tokens,
completion_tokens=resp.usage.output_tokens,
reasoning=reasoning,
)
elif provider == "groq":
if tools:
raise ValueError("Groq does not support tools.")
@@ -949,6 +978,8 @@ async def llm_call(
response_format=response_format, # type: ignore
max_tokens=max_tokens,
)
if not response.choices:
raise ValueError("Groq returned empty choices in response")
return LLMResponse(
raw_response=response.choices[0].message,
prompt=prompt,
@@ -1008,12 +1039,8 @@ async def llm_call(
parallel_tool_calls=parallel_tool_calls_param,
)
# If there's no response, raise an error
if not response.choices:
if response:
raise ValueError(f"OpenRouter error: {response}")
else:
raise ValueError("No response from OpenRouter.")
raise ValueError(f"OpenRouter returned empty choices: {response}")
tool_calls = extract_openai_tool_calls(response)
reasoning = extract_openai_reasoning(response)
@@ -1050,12 +1077,8 @@ async def llm_call(
parallel_tool_calls=parallel_tool_calls_param,
)
# If there's no response, raise an error
if not response.choices:
if response:
raise ValueError(f"Llama API error: {response}")
else:
raise ValueError("No response from Llama API.")
raise ValueError(f"Llama API returned empty choices: {response}")
tool_calls = extract_openai_tool_calls(response)
reasoning = extract_openai_reasoning(response)
@@ -1085,6 +1108,8 @@ async def llm_call(
messages=prompt, # type: ignore
max_tokens=max_tokens,
)
if not completion.choices:
raise ValueError("AI/ML API returned empty choices in response")
return LLMResponse(
raw_response=completion.choices[0].message,
@@ -1121,6 +1146,9 @@ async def llm_call(
parallel_tool_calls=parallel_tool_calls_param,
)
if not response.choices:
raise ValueError(f"v0 API returned empty choices: {response}")
tool_calls = extract_openai_tool_calls(response)
reasoning = extract_openai_reasoning(response)
@@ -1449,7 +1477,16 @@ class AIStructuredResponseGeneratorBlock(AIBlockBase):
yield "prompt", self.prompt
return
except Exception as e:
logger.exception(f"Error calling LLM: {e}")
is_user_error = (
isinstance(e, (anthropic.APIStatusError, openai.APIStatusError))
and e.status_code in USER_ERROR_STATUS_CODES
)
if is_user_error:
logger.warning(f"Error calling LLM: {e}")
error_feedback_message = f"Error calling LLM: {e}"
break
else:
logger.exception(f"Error calling LLM: {e}")
if (
"maximum context length" in str(e).lower()
or "token limit" in str(e).lower()
@@ -1979,6 +2016,19 @@ class AIConversationBlock(AIBlockBase):
async def run(
self, input_data: Input, *, credentials: APIKeyCredentials, **kwargs
) -> BlockOutput:
has_messages = any(
isinstance(m, dict)
and isinstance(m.get("content"), str)
and bool(m["content"].strip())
for m in (input_data.messages or [])
)
has_prompt = bool(input_data.prompt and input_data.prompt.strip())
if not has_messages and not has_prompt:
raise ValueError(
"Cannot call LLM with no messages and no prompt. "
"Provide at least one message or a non-empty prompt."
)
response = await self.llm_call(
AIStructuredResponseGeneratorBlock.Input(
prompt=input_data.prompt,

View File

@@ -1,13 +1,8 @@
import logging
import signal
import threading
import warnings
from contextlib import contextmanager
from enum import Enum
# Monkey patch Stagehands to prevent signal handling in worker threads
import stagehand.main
from stagehand import Stagehand
from stagehand import AsyncStagehand
from stagehand.types.session_act_params import Options as ActOptions
from backend.blocks.llm import (
MODEL_METADATA,
@@ -28,46 +23,6 @@ from backend.sdk import (
SchemaField,
)
# Suppress false positive cleanup warning of litellm (a dependency of stagehand)
warnings.filterwarnings("ignore", module="litellm.llms.custom_httpx")
# Store the original method
original_register_signal_handlers = stagehand.main.Stagehand._register_signal_handlers
def safe_register_signal_handlers(self):
"""Only register signal handlers in the main thread"""
if threading.current_thread() is threading.main_thread():
original_register_signal_handlers(self)
else:
# Skip signal handling in worker threads
pass
# Replace the method
stagehand.main.Stagehand._register_signal_handlers = safe_register_signal_handlers
@contextmanager
def disable_signal_handling():
"""Context manager to temporarily disable signal handling"""
if threading.current_thread() is not threading.main_thread():
# In worker threads, temporarily replace signal.signal with a no-op
original_signal = signal.signal
def noop_signal(*args, **kwargs):
pass
signal.signal = noop_signal
try:
yield
finally:
signal.signal = original_signal
else:
# In main thread, don't modify anything
yield
logger = logging.getLogger(__name__)
@@ -148,13 +103,10 @@ class StagehandObserveBlock(Block):
instruction: str = SchemaField(
description="Natural language description of elements or actions to discover.",
)
iframes: bool = SchemaField(
description="Whether to search within iframes. If True, Stagehand will search for actions within iframes.",
default=True,
)
domSettleTimeoutMs: int = SchemaField(
description="Timeout in milliseconds for DOM settlement.Wait longer for dynamic content",
default=45000,
dom_settle_timeout_ms: int = SchemaField(
description="Timeout in ms to wait for the DOM to settle after navigation.",
default=30000,
advanced=True,
)
class Output(BlockSchemaOutput):
@@ -185,32 +137,28 @@ class StagehandObserveBlock(Block):
logger.debug(f"OBSERVE: Using model provider {model_credentials.provider}")
with disable_signal_handling():
stagehand = Stagehand(
api_key=stagehand_credentials.api_key.get_secret_value(),
project_id=input_data.browserbase_project_id,
async with AsyncStagehand(
browserbase_api_key=stagehand_credentials.api_key.get_secret_value(),
browserbase_project_id=input_data.browserbase_project_id,
model_api_key=model_credentials.api_key.get_secret_value(),
) as client:
session = await client.sessions.start(
model_name=input_data.model.provider_name,
model_api_key=model_credentials.api_key.get_secret_value(),
dom_settle_timeout_ms=input_data.dom_settle_timeout_ms,
)
try:
await session.navigate(url=input_data.url)
await stagehand.init()
page = stagehand.page
assert page is not None, "Stagehand page is not initialized"
await page.goto(input_data.url)
observe_results = await page.observe(
input_data.instruction,
iframes=input_data.iframes,
domSettleTimeoutMs=input_data.domSettleTimeoutMs,
)
for result in observe_results:
yield "selector", result.selector
yield "description", result.description
yield "method", result.method
yield "arguments", result.arguments
observe_response = await session.observe(
instruction=input_data.instruction,
)
for result in observe_response.data.result:
yield "selector", result.selector
yield "description", result.description
yield "method", result.method
yield "arguments", result.arguments
finally:
await session.end()
class StagehandActBlock(Block):
@@ -242,24 +190,22 @@ class StagehandActBlock(Block):
description="Variables to use in the action. Variables contains data you want the action to use.",
default_factory=dict,
)
iframes: bool = SchemaField(
description="Whether to search within iframes. If True, Stagehand will search for actions within iframes.",
default=True,
dom_settle_timeout_ms: int = SchemaField(
description="Timeout in ms to wait for the DOM to settle after navigation.",
default=30000,
advanced=True,
)
domSettleTimeoutMs: int = SchemaField(
description="Timeout in milliseconds for DOM settlement.Wait longer for dynamic content",
default=45000,
)
timeoutMs: int = SchemaField(
description="Timeout in milliseconds for DOM ready. Extended timeout for slow-loading forms",
default=60000,
timeout_ms: int = SchemaField(
description="Timeout in ms for each action.",
default=30000,
advanced=True,
)
class Output(BlockSchemaOutput):
success: bool = SchemaField(
description="Whether the action was completed successfully"
)
message: str = SchemaField(description="Details about the actions execution.")
message: str = SchemaField(description="Details about the action's execution.")
action: str = SchemaField(description="Action performed")
def __init__(self):
@@ -282,32 +228,33 @@ class StagehandActBlock(Block):
logger.debug(f"ACT: Using model provider {model_credentials.provider}")
with disable_signal_handling():
stagehand = Stagehand(
api_key=stagehand_credentials.api_key.get_secret_value(),
project_id=input_data.browserbase_project_id,
async with AsyncStagehand(
browserbase_api_key=stagehand_credentials.api_key.get_secret_value(),
browserbase_project_id=input_data.browserbase_project_id,
model_api_key=model_credentials.api_key.get_secret_value(),
) as client:
session = await client.sessions.start(
model_name=input_data.model.provider_name,
model_api_key=model_credentials.api_key.get_secret_value(),
dom_settle_timeout_ms=input_data.dom_settle_timeout_ms,
)
try:
await session.navigate(url=input_data.url)
await stagehand.init()
page = stagehand.page
assert page is not None, "Stagehand page is not initialized"
await page.goto(input_data.url)
for action in input_data.action:
action_results = await page.act(
action,
variables=input_data.variables,
iframes=input_data.iframes,
domSettleTimeoutMs=input_data.domSettleTimeoutMs,
timeoutMs=input_data.timeoutMs,
)
yield "success", action_results.success
yield "message", action_results.message
yield "action", action_results.action
for action in input_data.action:
act_options = ActOptions(
variables={k: v for k, v in input_data.variables.items()},
timeout=input_data.timeout_ms,
)
act_response = await session.act(
input=action,
options=act_options,
)
result = act_response.data.result
yield "success", result.success
yield "message", result.message
yield "action", result.action_description
finally:
await session.end()
class StagehandExtractBlock(Block):
@@ -335,13 +282,10 @@ class StagehandExtractBlock(Block):
instruction: str = SchemaField(
description="Natural language description of elements or actions to discover.",
)
iframes: bool = SchemaField(
description="Whether to search within iframes. If True, Stagehand will search for actions within iframes.",
default=True,
)
domSettleTimeoutMs: int = SchemaField(
description="Timeout in milliseconds for DOM settlement.Wait longer for dynamic content",
default=45000,
dom_settle_timeout_ms: int = SchemaField(
description="Timeout in ms to wait for the DOM to settle after navigation.",
default=30000,
advanced=True,
)
class Output(BlockSchemaOutput):
@@ -367,24 +311,21 @@ class StagehandExtractBlock(Block):
logger.debug(f"EXTRACT: Using model provider {model_credentials.provider}")
with disable_signal_handling():
stagehand = Stagehand(
api_key=stagehand_credentials.api_key.get_secret_value(),
project_id=input_data.browserbase_project_id,
async with AsyncStagehand(
browserbase_api_key=stagehand_credentials.api_key.get_secret_value(),
browserbase_project_id=input_data.browserbase_project_id,
model_api_key=model_credentials.api_key.get_secret_value(),
) as client:
session = await client.sessions.start(
model_name=input_data.model.provider_name,
model_api_key=model_credentials.api_key.get_secret_value(),
dom_settle_timeout_ms=input_data.dom_settle_timeout_ms,
)
try:
await session.navigate(url=input_data.url)
await stagehand.init()
page = stagehand.page
assert page is not None, "Stagehand page is not initialized"
await page.goto(input_data.url)
extraction = await page.extract(
input_data.instruction,
iframes=input_data.iframes,
domSettleTimeoutMs=input_data.domSettleTimeoutMs,
)
yield "extraction", str(extraction.model_dump()["extraction"])
extract_response = await session.extract(
instruction=input_data.instruction,
)
yield "extraction", str(extract_response.data.result)
finally:
await session.end()

View File

@@ -4,6 +4,8 @@ import pytest
from backend.blocks import get_blocks
from backend.blocks._base import Block, BlockSchemaInput
from backend.blocks.io import AgentDropdownInputBlock, AgentInputBlock
from backend.data.graph import BaseGraph
from backend.data.model import SchemaField
from backend.util.test import execute_block_test
@@ -279,3 +281,113 @@ class TestAutoCredentialsFieldsValidation:
assert "Duplicate auto_credentials kwarg_name 'credentials'" in str(
exc_info.value
)
def test_agent_input_block_ignores_legacy_placeholder_values():
"""Verify AgentInputBlock.Input.model_construct tolerates extra placeholder_values
for backward compatibility with existing agent JSON."""
legacy_data = {
"name": "url",
"value": "",
"description": "Enter a URL",
"placeholder_values": ["https://example.com"],
}
instance = AgentInputBlock.Input.model_construct(**legacy_data)
schema = instance.generate_schema()
assert (
"enum" not in schema
), "AgentInputBlock should not produce enum from legacy placeholder_values"
def test_dropdown_input_block_produces_enum():
"""Verify AgentDropdownInputBlock.Input.generate_schema() produces enum
using the canonical 'options' field name."""
opts = ["Option A", "Option B"]
instance = AgentDropdownInputBlock.Input.model_construct(
name="choice", value=None, options=opts
)
schema = instance.generate_schema()
assert schema.get("enum") == opts
def test_dropdown_input_block_legacy_placeholder_values_produces_enum():
"""Verify backward compat: passing legacy 'placeholder_values' to
AgentDropdownInputBlock still produces enum via model_construct remap."""
opts = ["Option A", "Option B"]
instance = AgentDropdownInputBlock.Input.model_construct(
name="choice", value=None, placeholder_values=opts
)
schema = instance.generate_schema()
assert (
schema.get("enum") == opts
), "Legacy placeholder_values should be remapped to options"
def test_generate_schema_integration_legacy_placeholder_values():
"""Test the full Graph._generate_schema path with legacy placeholder_values
on AgentInputBlock — verifies no enum leaks through the graph loading path."""
legacy_input_default = {
"name": "url",
"value": "",
"description": "Enter a URL",
"placeholder_values": ["https://example.com"],
}
result = BaseGraph._generate_schema(
(AgentInputBlock.Input, legacy_input_default),
)
url_props = result["properties"]["url"]
assert (
"enum" not in url_props
), "Graph schema should not contain enum from AgentInputBlock placeholder_values"
def test_generate_schema_integration_dropdown_produces_enum():
"""Test the full Graph._generate_schema path with AgentDropdownInputBlock
— verifies enum IS produced for dropdown blocks using canonical field name."""
dropdown_input_default = {
"name": "color",
"value": None,
"options": ["Red", "Green", "Blue"],
}
result = BaseGraph._generate_schema(
(AgentDropdownInputBlock.Input, dropdown_input_default),
)
color_props = result["properties"]["color"]
assert color_props.get("enum") == [
"Red",
"Green",
"Blue",
], "Graph schema should contain enum from AgentDropdownInputBlock"
def test_generate_schema_integration_dropdown_legacy_placeholder_values():
"""Test the full Graph._generate_schema path with AgentDropdownInputBlock
using legacy 'placeholder_values' — verifies backward compat produces enum."""
legacy_dropdown_input_default = {
"name": "color",
"value": None,
"placeholder_values": ["Red", "Green", "Blue"],
}
result = BaseGraph._generate_schema(
(AgentDropdownInputBlock.Input, legacy_dropdown_input_default),
)
color_props = result["properties"]["color"]
assert color_props.get("enum") == [
"Red",
"Green",
"Blue",
], "Legacy placeholder_values should still produce enum via model_construct remap"
def test_dropdown_input_block_init_legacy_placeholder_values():
"""Verify backward compat: constructing AgentDropdownInputBlock.Input via
model_validate with legacy 'placeholder_values' correctly maps to 'options'."""
opts = ["Option A", "Option B"]
instance = AgentDropdownInputBlock.Input.model_validate(
{"name": "choice", "value": None, "placeholder_values": opts}
)
assert (
instance.options == opts
), "Legacy placeholder_values should be remapped to options via model_validate"
schema = instance.generate_schema()
assert schema.get("enum") == opts

View File

@@ -207,6 +207,51 @@ class TestXMLParserBlockSecurity:
pass
class TestXMLParserBlockSyntaxErrors:
"""XML syntax errors should raise ValueError (not SyntaxError).
This ensures the base Block.execute() wraps them as BlockExecutionError
(expected / user-caused) instead of BlockUnknownError (unexpected / alerts
Sentry).
"""
async def test_unclosed_tag_raises_value_error(self):
"""Unclosed tags should raise ValueError, not SyntaxError."""
block = XMLParserBlock()
bad_xml = "<root><unclosed>"
with pytest.raises(ValueError, match="Unclosed tag"):
async for _ in block.run(XMLParserBlock.Input(input_xml=bad_xml)):
pass
async def test_unexpected_closing_tag_raises_value_error(self):
"""Extra closing tags should raise ValueError, not SyntaxError."""
block = XMLParserBlock()
bad_xml = "</unexpected>"
with pytest.raises(ValueError):
async for _ in block.run(XMLParserBlock.Input(input_xml=bad_xml)):
pass
async def test_empty_xml_raises_value_error(self):
"""Empty XML input should raise ValueError."""
block = XMLParserBlock()
with pytest.raises(ValueError, match="XML input is empty"):
async for _ in block.run(XMLParserBlock.Input(input_xml="")):
pass
async def test_syntax_error_from_parser_becomes_value_error(self):
"""SyntaxErrors from gravitasml library become ValueError (BlockExecutionError)."""
block = XMLParserBlock()
# Malformed XML that might trigger a SyntaxError from the parser
bad_xml = "<root><child>no closing"
with pytest.raises(ValueError):
async for _ in block.run(XMLParserBlock.Input(input_xml=bad_xml)):
pass
class TestStoreMediaFileSecurity:
"""Test file storage security limits."""

View File

@@ -1,9 +1,18 @@
from typing import cast
from unittest.mock import AsyncMock, MagicMock, patch
import anthropic
import httpx
import openai
import pytest
import backend.blocks.llm as llm
from backend.data.model import NodeExecutionStats
# TEST_CREDENTIALS_INPUT is a plain dict that satisfies AICredentials at runtime
# but not at the type level. Cast once here to avoid per-test suppressors.
_TEST_AI_CREDENTIALS = cast(llm.AICredentials, llm.TEST_CREDENTIALS_INPUT)
class TestLLMStatsTracking:
"""Test that LLM blocks correctly track token usage statistics."""
@@ -479,6 +488,154 @@ class TestLLMStatsTracking:
assert outputs["response"] == {"result": "test"}
class TestAIConversationBlockValidation:
"""Test that AIConversationBlock validates inputs before calling the LLM."""
@pytest.mark.asyncio
async def test_empty_messages_and_empty_prompt_raises_error(self):
"""Empty messages with no prompt should raise ValueError, not a cryptic API error."""
block = llm.AIConversationBlock()
input_data = llm.AIConversationBlock.Input(
messages=[],
prompt="",
model=llm.DEFAULT_LLM_MODEL,
credentials=_TEST_AI_CREDENTIALS,
)
with pytest.raises(ValueError, match="no messages and no prompt"):
async for _ in block.run(input_data, credentials=llm.TEST_CREDENTIALS):
pass
@pytest.mark.asyncio
async def test_empty_messages_with_prompt_succeeds(self):
"""Empty messages but a non-empty prompt should proceed without error."""
block = llm.AIConversationBlock()
async def mock_llm_call(input_data, credentials):
return {"response": "OK"}
with patch.object(block, "llm_call", new=AsyncMock(side_effect=mock_llm_call)):
input_data = llm.AIConversationBlock.Input(
messages=[],
prompt="Hello, how are you?",
model=llm.DEFAULT_LLM_MODEL,
credentials=_TEST_AI_CREDENTIALS,
)
outputs = {}
async for name, data in block.run(
input_data, credentials=llm.TEST_CREDENTIALS
):
outputs[name] = data
assert outputs["response"] == "OK"
@pytest.mark.asyncio
async def test_nonempty_messages_with_empty_prompt_succeeds(self):
"""Non-empty messages with no prompt should proceed without error."""
block = llm.AIConversationBlock()
async def mock_llm_call(input_data, credentials):
return {"response": "response from conversation"}
with patch.object(block, "llm_call", new=AsyncMock(side_effect=mock_llm_call)):
input_data = llm.AIConversationBlock.Input(
messages=[{"role": "user", "content": "Hello"}],
prompt="",
model=llm.DEFAULT_LLM_MODEL,
credentials=_TEST_AI_CREDENTIALS,
)
outputs = {}
async for name, data in block.run(
input_data, credentials=llm.TEST_CREDENTIALS
):
outputs[name] = data
assert outputs["response"] == "response from conversation"
@pytest.mark.asyncio
async def test_messages_with_empty_content_raises_error(self):
"""Messages with empty content strings should be treated as no messages."""
block = llm.AIConversationBlock()
input_data = llm.AIConversationBlock.Input(
messages=[{"role": "user", "content": ""}],
prompt="",
model=llm.DEFAULT_LLM_MODEL,
credentials=_TEST_AI_CREDENTIALS,
)
with pytest.raises(ValueError, match="no messages and no prompt"):
async for _ in block.run(input_data, credentials=llm.TEST_CREDENTIALS):
pass
@pytest.mark.asyncio
async def test_messages_with_whitespace_content_raises_error(self):
"""Messages with whitespace-only content should be treated as no messages."""
block = llm.AIConversationBlock()
input_data = llm.AIConversationBlock.Input(
messages=[{"role": "user", "content": " "}],
prompt="",
model=llm.DEFAULT_LLM_MODEL,
credentials=_TEST_AI_CREDENTIALS,
)
with pytest.raises(ValueError, match="no messages and no prompt"):
async for _ in block.run(input_data, credentials=llm.TEST_CREDENTIALS):
pass
@pytest.mark.asyncio
async def test_messages_with_none_entry_raises_error(self):
"""Messages list containing None should be treated as no messages."""
block = llm.AIConversationBlock()
input_data = llm.AIConversationBlock.Input(
messages=[None],
prompt="",
model=llm.DEFAULT_LLM_MODEL,
credentials=_TEST_AI_CREDENTIALS,
)
with pytest.raises(ValueError, match="no messages and no prompt"):
async for _ in block.run(input_data, credentials=llm.TEST_CREDENTIALS):
pass
@pytest.mark.asyncio
async def test_messages_with_empty_dict_raises_error(self):
"""Messages list containing empty dict should be treated as no messages."""
block = llm.AIConversationBlock()
input_data = llm.AIConversationBlock.Input(
messages=[{}],
prompt="",
model=llm.DEFAULT_LLM_MODEL,
credentials=_TEST_AI_CREDENTIALS,
)
with pytest.raises(ValueError, match="no messages and no prompt"):
async for _ in block.run(input_data, credentials=llm.TEST_CREDENTIALS):
pass
@pytest.mark.asyncio
async def test_messages_with_none_content_raises_error(self):
"""Messages with content=None should not crash with AttributeError."""
block = llm.AIConversationBlock()
input_data = llm.AIConversationBlock.Input(
messages=[{"role": "user", "content": None}],
prompt="",
model=llm.DEFAULT_LLM_MODEL,
credentials=_TEST_AI_CREDENTIALS,
)
with pytest.raises(ValueError, match="no messages and no prompt"):
async for _ in block.run(input_data, credentials=llm.TEST_CREDENTIALS):
pass
class TestAITextSummarizerValidation:
"""Test that AITextSummarizerBlock validates LLM responses are strings."""
@@ -655,3 +812,178 @@ class TestAITextSummarizerValidation:
error_message = str(exc_info.value)
assert "Expected a string summary" in error_message
assert "received dict" in error_message
def _make_anthropic_status_error(status_code: int) -> anthropic.APIStatusError:
"""Create an anthropic.APIStatusError with the given status code."""
request = httpx.Request("POST", "https://api.anthropic.com/v1/messages")
response = httpx.Response(status_code, request=request)
return anthropic.APIStatusError(
f"Error code: {status_code}", response=response, body=None
)
def _make_openai_status_error(status_code: int) -> openai.APIStatusError:
"""Create an openai.APIStatusError with the given status code."""
response = httpx.Response(
status_code, request=httpx.Request("POST", "https://api.openai.com/v1/chat")
)
return openai.APIStatusError(
f"Error code: {status_code}", response=response, body=None
)
class TestUserErrorStatusCodeHandling:
"""Test that user-caused LLM API errors (401/403/429) break the retry loop
and are logged as warnings, while server errors (500) trigger retries."""
@pytest.mark.asyncio
@pytest.mark.parametrize("status_code", [401, 403, 429])
async def test_anthropic_user_error_breaks_retry_loop(self, status_code: int):
"""401/403/429 Anthropic errors should break immediately, not retry."""
import backend.blocks.llm as llm
block = llm.AIStructuredResponseGeneratorBlock()
call_count = 0
async def mock_llm_call(*args, **kwargs):
nonlocal call_count
call_count += 1
raise _make_anthropic_status_error(status_code)
with patch.object(block, "llm_call", new=AsyncMock(side_effect=mock_llm_call)):
input_data = llm.AIStructuredResponseGeneratorBlock.Input(
prompt="Test",
expected_format={"key": "desc"},
model=llm.DEFAULT_LLM_MODEL,
credentials=_TEST_AI_CREDENTIALS,
retry=3,
)
with pytest.raises(RuntimeError):
async for _ in block.run(input_data, credentials=llm.TEST_CREDENTIALS):
pass
assert (
call_count == 1
), f"Expected exactly 1 call for status {status_code}, got {call_count}"
@pytest.mark.asyncio
@pytest.mark.parametrize("status_code", [401, 403, 429])
async def test_openai_user_error_breaks_retry_loop(self, status_code: int):
"""401/403/429 OpenAI errors should break immediately, not retry."""
import backend.blocks.llm as llm
block = llm.AIStructuredResponseGeneratorBlock()
call_count = 0
async def mock_llm_call(*args, **kwargs):
nonlocal call_count
call_count += 1
raise _make_openai_status_error(status_code)
with patch.object(block, "llm_call", new=AsyncMock(side_effect=mock_llm_call)):
input_data = llm.AIStructuredResponseGeneratorBlock.Input(
prompt="Test",
expected_format={"key": "desc"},
model=llm.DEFAULT_LLM_MODEL,
credentials=_TEST_AI_CREDENTIALS,
retry=3,
)
with pytest.raises(RuntimeError):
async for _ in block.run(input_data, credentials=llm.TEST_CREDENTIALS):
pass
assert (
call_count == 1
), f"Expected exactly 1 call for status {status_code}, got {call_count}"
@pytest.mark.asyncio
async def test_server_error_retries(self):
"""500 errors should be retried (not break immediately)."""
import backend.blocks.llm as llm
block = llm.AIStructuredResponseGeneratorBlock()
call_count = 0
async def mock_llm_call(*args, **kwargs):
nonlocal call_count
call_count += 1
raise _make_anthropic_status_error(500)
with patch.object(block, "llm_call", new=AsyncMock(side_effect=mock_llm_call)):
input_data = llm.AIStructuredResponseGeneratorBlock.Input(
prompt="Test",
expected_format={"key": "desc"},
model=llm.DEFAULT_LLM_MODEL,
credentials=_TEST_AI_CREDENTIALS,
retry=3,
)
with pytest.raises(RuntimeError):
async for _ in block.run(input_data, credentials=llm.TEST_CREDENTIALS):
pass
assert (
call_count > 1
), f"Expected multiple retry attempts for 500, got {call_count}"
@pytest.mark.asyncio
async def test_user_error_logs_warning_not_exception(self):
"""User-caused errors should log with logger.warning, not logger.exception."""
import backend.blocks.llm as llm
block = llm.AIStructuredResponseGeneratorBlock()
async def mock_llm_call(*args, **kwargs):
raise _make_anthropic_status_error(401)
with patch.object(block, "llm_call", new=AsyncMock(side_effect=mock_llm_call)):
input_data = llm.AIStructuredResponseGeneratorBlock.Input(
prompt="Test",
expected_format={"key": "desc"},
model=llm.DEFAULT_LLM_MODEL,
credentials=_TEST_AI_CREDENTIALS,
)
with (
patch.object(llm.logger, "warning") as mock_warning,
patch.object(llm.logger, "exception") as mock_exception,
pytest.raises(RuntimeError),
):
async for _ in block.run(input_data, credentials=llm.TEST_CREDENTIALS):
pass
mock_warning.assert_called_once()
mock_exception.assert_not_called()
class TestLlmModelMissing:
"""Test that LlmModel handles provider-prefixed model names."""
def test_provider_prefixed_model_resolves(self):
"""Provider-prefixed model string should resolve to the correct enum member."""
assert (
llm.LlmModel("anthropic/claude-sonnet-4-6")
== llm.LlmModel.CLAUDE_4_6_SONNET
)
def test_bare_model_still_works(self):
"""Bare (non-prefixed) model string should still resolve correctly."""
assert llm.LlmModel("claude-sonnet-4-6") == llm.LlmModel.CLAUDE_4_6_SONNET
def test_invalid_prefixed_model_raises(self):
"""Unknown provider-prefixed model string should raise ValueError."""
with pytest.raises(ValueError):
llm.LlmModel("invalid/nonexistent-model")
def test_slash_containing_value_direct_lookup(self):
"""Enum values with '/' (e.g., OpenRouter models) should resolve via direct lookup, not _missing_."""
assert llm.LlmModel("google/gemini-2.5-pro") == llm.LlmModel.GEMINI_2_5_PRO
def test_double_prefixed_slash_model(self):
"""Double-prefixed value should still resolve by stripping first prefix."""
assert (
llm.LlmModel("extra/google/gemini-2.5-pro") == llm.LlmModel.GEMINI_2_5_PRO
)

View File

@@ -0,0 +1,87 @@
"""Tests for empty-choices guard in extract_openai_tool_calls() and extract_openai_reasoning()."""
from unittest.mock import MagicMock
from backend.blocks.llm import extract_openai_reasoning, extract_openai_tool_calls
class TestExtractOpenaiToolCallsEmptyChoices:
"""extract_openai_tool_calls() must return None when choices is empty."""
def test_returns_none_for_empty_choices(self):
response = MagicMock()
response.choices = []
assert extract_openai_tool_calls(response) is None
def test_returns_none_for_none_choices(self):
response = MagicMock()
response.choices = None
assert extract_openai_tool_calls(response) is None
def test_returns_tool_calls_when_choices_present(self):
tool = MagicMock()
tool.id = "call_1"
tool.type = "function"
tool.function.name = "my_func"
tool.function.arguments = '{"a": 1}'
message = MagicMock()
message.tool_calls = [tool]
choice = MagicMock()
choice.message = message
response = MagicMock()
response.choices = [choice]
result = extract_openai_tool_calls(response)
assert result is not None
assert len(result) == 1
assert result[0].function.name == "my_func"
def test_returns_none_when_no_tool_calls(self):
message = MagicMock()
message.tool_calls = None
choice = MagicMock()
choice.message = message
response = MagicMock()
response.choices = [choice]
assert extract_openai_tool_calls(response) is None
class TestExtractOpenaiReasoningEmptyChoices:
"""extract_openai_reasoning() must return None when choices is empty."""
def test_returns_none_for_empty_choices(self):
response = MagicMock()
response.choices = []
assert extract_openai_reasoning(response) is None
def test_returns_none_for_none_choices(self):
response = MagicMock()
response.choices = None
assert extract_openai_reasoning(response) is None
def test_returns_reasoning_from_choice(self):
choice = MagicMock()
choice.reasoning = "Step-by-step reasoning"
choice.message = MagicMock(spec=[]) # no 'reasoning' attr on message
response = MagicMock(spec=[]) # no 'reasoning' attr on response
response.choices = [choice]
result = extract_openai_reasoning(response)
assert result == "Step-by-step reasoning"
def test_returns_none_when_no_reasoning(self):
choice = MagicMock(spec=[]) # no 'reasoning' attr
choice.message = MagicMock(spec=[]) # no 'reasoning' attr
response = MagicMock(spec=[]) # no 'reasoning' attr
response.choices = [choice]
result = extract_openai_reasoning(response)
assert result is None

View File

@@ -57,7 +57,7 @@ async def execute_graph(
@pytest.mark.asyncio(loop_scope="session")
async def test_graph_validation_with_tool_nodes_correct(server: SpinTestServer):
from backend.blocks.agent import AgentExecutorBlock
from backend.blocks.smart_decision_maker import SmartDecisionMakerBlock
from backend.blocks.orchestrator import OrchestratorBlock
from backend.data import graph
test_user = await create_test_user()
@@ -66,7 +66,7 @@ async def test_graph_validation_with_tool_nodes_correct(server: SpinTestServer):
nodes = [
graph.Node(
block_id=SmartDecisionMakerBlock().id,
block_id=OrchestratorBlock().id,
input_default={
"prompt": "Hello, World!",
"credentials": creds,
@@ -108,10 +108,10 @@ async def test_graph_validation_with_tool_nodes_correct(server: SpinTestServer):
@pytest.mark.asyncio(loop_scope="session")
async def test_smart_decision_maker_function_signature(server: SpinTestServer):
async def test_orchestrator_function_signature(server: SpinTestServer):
from backend.blocks.agent import AgentExecutorBlock
from backend.blocks.basic import StoreValueBlock
from backend.blocks.smart_decision_maker import SmartDecisionMakerBlock
from backend.blocks.orchestrator import OrchestratorBlock
from backend.data import graph
test_user = await create_test_user()
@@ -120,7 +120,7 @@ async def test_smart_decision_maker_function_signature(server: SpinTestServer):
nodes = [
graph.Node(
block_id=SmartDecisionMakerBlock().id,
block_id=OrchestratorBlock().id,
input_default={
"prompt": "Hello, World!",
"credentials": creds,
@@ -169,7 +169,7 @@ async def test_smart_decision_maker_function_signature(server: SpinTestServer):
)
test_graph = await create_graph(server, test_graph, test_user)
tool_functions = await SmartDecisionMakerBlock._create_tool_node_signatures(
tool_functions = await OrchestratorBlock._create_tool_node_signatures(
test_graph.nodes[0].id
)
assert tool_functions is not None, "Tool functions should not be None"
@@ -198,12 +198,12 @@ async def test_smart_decision_maker_function_signature(server: SpinTestServer):
@pytest.mark.asyncio
async def test_smart_decision_maker_tracks_llm_stats():
"""Test that SmartDecisionMakerBlock correctly tracks LLM usage stats."""
async def test_orchestrator_tracks_llm_stats():
"""Test that OrchestratorBlock correctly tracks LLM usage stats."""
import backend.blocks.llm as llm_module
from backend.blocks.smart_decision_maker import SmartDecisionMakerBlock
from backend.blocks.orchestrator import OrchestratorBlock
block = SmartDecisionMakerBlock()
block = OrchestratorBlock()
# Mock the llm.llm_call function to return controlled data
mock_response = MagicMock()
@@ -224,14 +224,14 @@ async def test_smart_decision_maker_tracks_llm_stats():
new_callable=AsyncMock,
return_value=mock_response,
), patch.object(
SmartDecisionMakerBlock,
OrchestratorBlock,
"_create_tool_node_signatures",
new_callable=AsyncMock,
return_value=[],
):
# Create test input
input_data = SmartDecisionMakerBlock.Input(
input_data = OrchestratorBlock.Input(
prompt="Should I continue with this task?",
model=llm_module.DEFAULT_LLM_MODEL,
credentials=llm_module.TEST_CREDENTIALS_INPUT, # type: ignore
@@ -274,12 +274,12 @@ async def test_smart_decision_maker_tracks_llm_stats():
@pytest.mark.asyncio
async def test_smart_decision_maker_parameter_validation():
"""Test that SmartDecisionMakerBlock correctly validates tool call parameters."""
async def test_orchestrator_parameter_validation():
"""Test that OrchestratorBlock correctly validates tool call parameters."""
import backend.blocks.llm as llm_module
from backend.blocks.smart_decision_maker import SmartDecisionMakerBlock
from backend.blocks.orchestrator import OrchestratorBlock
block = SmartDecisionMakerBlock()
block = OrchestratorBlock()
# Mock tool functions with specific parameter schema
mock_tool_functions = [
@@ -327,13 +327,13 @@ async def test_smart_decision_maker_parameter_validation():
new_callable=AsyncMock,
return_value=mock_response_with_typo,
) as mock_llm_call, patch.object(
SmartDecisionMakerBlock,
OrchestratorBlock,
"_create_tool_node_signatures",
new_callable=AsyncMock,
return_value=mock_tool_functions,
):
input_data = SmartDecisionMakerBlock.Input(
input_data = OrchestratorBlock.Input(
prompt="Search for keywords",
model=llm_module.DEFAULT_LLM_MODEL,
credentials=llm_module.TEST_CREDENTIALS_INPUT, # type: ignore
@@ -394,13 +394,13 @@ async def test_smart_decision_maker_parameter_validation():
new_callable=AsyncMock,
return_value=mock_response_missing_required,
), patch.object(
SmartDecisionMakerBlock,
OrchestratorBlock,
"_create_tool_node_signatures",
new_callable=AsyncMock,
return_value=mock_tool_functions,
):
input_data = SmartDecisionMakerBlock.Input(
input_data = OrchestratorBlock.Input(
prompt="Search for keywords",
model=llm_module.DEFAULT_LLM_MODEL,
credentials=llm_module.TEST_CREDENTIALS_INPUT, # type: ignore
@@ -454,13 +454,13 @@ async def test_smart_decision_maker_parameter_validation():
new_callable=AsyncMock,
return_value=mock_response_valid,
), patch.object(
SmartDecisionMakerBlock,
OrchestratorBlock,
"_create_tool_node_signatures",
new_callable=AsyncMock,
return_value=mock_tool_functions,
):
input_data = SmartDecisionMakerBlock.Input(
input_data = OrchestratorBlock.Input(
prompt="Search for keywords",
model=llm_module.DEFAULT_LLM_MODEL,
credentials=llm_module.TEST_CREDENTIALS_INPUT, # type: ignore
@@ -518,13 +518,13 @@ async def test_smart_decision_maker_parameter_validation():
new_callable=AsyncMock,
return_value=mock_response_all_params,
), patch.object(
SmartDecisionMakerBlock,
OrchestratorBlock,
"_create_tool_node_signatures",
new_callable=AsyncMock,
return_value=mock_tool_functions,
):
input_data = SmartDecisionMakerBlock.Input(
input_data = OrchestratorBlock.Input(
prompt="Search for keywords",
model=llm_module.DEFAULT_LLM_MODEL,
credentials=llm_module.TEST_CREDENTIALS_INPUT, # type: ignore
@@ -562,12 +562,12 @@ async def test_smart_decision_maker_parameter_validation():
@pytest.mark.asyncio
async def test_smart_decision_maker_raw_response_conversion():
"""Test that SmartDecisionMaker correctly handles different raw_response types with retry mechanism."""
async def test_orchestrator_raw_response_conversion():
"""Test that Orchestrator correctly handles different raw_response types with retry mechanism."""
import backend.blocks.llm as llm_module
from backend.blocks.smart_decision_maker import SmartDecisionMakerBlock
from backend.blocks.orchestrator import OrchestratorBlock
block = SmartDecisionMakerBlock()
block = OrchestratorBlock()
# Mock tool functions
mock_tool_functions = [
@@ -637,7 +637,7 @@ async def test_smart_decision_maker_raw_response_conversion():
with patch(
"backend.blocks.llm.llm_call", new_callable=AsyncMock
) as mock_llm_call, patch.object(
SmartDecisionMakerBlock,
OrchestratorBlock,
"_create_tool_node_signatures",
new_callable=AsyncMock,
return_value=mock_tool_functions,
@@ -646,7 +646,7 @@ async def test_smart_decision_maker_raw_response_conversion():
# Second call returns successful response
mock_llm_call.side_effect = [mock_response_retry, mock_response_success]
input_data = SmartDecisionMakerBlock.Input(
input_data = OrchestratorBlock.Input(
prompt="Test prompt",
model=llm_module.DEFAULT_LLM_MODEL,
credentials=llm_module.TEST_CREDENTIALS_INPUT, # type: ignore
@@ -715,12 +715,12 @@ async def test_smart_decision_maker_raw_response_conversion():
new_callable=AsyncMock,
return_value=mock_response_ollama,
), patch.object(
SmartDecisionMakerBlock,
OrchestratorBlock,
"_create_tool_node_signatures",
new_callable=AsyncMock,
return_value=[], # No tools for this test
):
input_data = SmartDecisionMakerBlock.Input(
input_data = OrchestratorBlock.Input(
prompt="Simple prompt",
model=llm_module.DEFAULT_LLM_MODEL,
credentials=llm_module.TEST_CREDENTIALS_INPUT, # type: ignore
@@ -771,12 +771,12 @@ async def test_smart_decision_maker_raw_response_conversion():
new_callable=AsyncMock,
return_value=mock_response_dict,
), patch.object(
SmartDecisionMakerBlock,
OrchestratorBlock,
"_create_tool_node_signatures",
new_callable=AsyncMock,
return_value=[],
):
input_data = SmartDecisionMakerBlock.Input(
input_data = OrchestratorBlock.Input(
prompt="Another test",
model=llm_module.DEFAULT_LLM_MODEL,
credentials=llm_module.TEST_CREDENTIALS_INPUT, # type: ignore
@@ -811,12 +811,12 @@ async def test_smart_decision_maker_raw_response_conversion():
@pytest.mark.asyncio
async def test_smart_decision_maker_agent_mode():
async def test_orchestrator_agent_mode():
"""Test that agent mode executes tools directly and loops until finished."""
import backend.blocks.llm as llm_module
from backend.blocks.smart_decision_maker import SmartDecisionMakerBlock
from backend.blocks.orchestrator import OrchestratorBlock
block = SmartDecisionMakerBlock()
block = OrchestratorBlock()
# Mock tool call that requires multiple iterations
mock_tool_call_1 = MagicMock()
@@ -893,7 +893,7 @@ async def test_smart_decision_maker_agent_mode():
with patch("backend.blocks.llm.llm_call", llm_call_mock), patch.object(
block, "_create_tool_node_signatures", return_value=mock_tool_signatures
), patch(
"backend.blocks.smart_decision_maker.get_database_manager_async_client",
"backend.blocks.orchestrator.get_database_manager_async_client",
return_value=mock_db_client,
), patch(
"backend.executor.manager.async_update_node_execution_status",
@@ -929,7 +929,7 @@ async def test_smart_decision_maker_agent_mode():
}
# Test agent mode with max_iterations = 3
input_data = SmartDecisionMakerBlock.Input(
input_data = OrchestratorBlock.Input(
prompt="Complete this task using tools",
model=llm_module.DEFAULT_LLM_MODEL,
credentials=llm_module.TEST_CREDENTIALS_INPUT, # type: ignore
@@ -969,12 +969,12 @@ async def test_smart_decision_maker_agent_mode():
@pytest.mark.asyncio
async def test_smart_decision_maker_traditional_mode_default():
async def test_orchestrator_traditional_mode_default():
"""Test that default behavior (agent_mode_max_iterations=0) works as traditional mode."""
import backend.blocks.llm as llm_module
from backend.blocks.smart_decision_maker import SmartDecisionMakerBlock
from backend.blocks.orchestrator import OrchestratorBlock
block = SmartDecisionMakerBlock()
block = OrchestratorBlock()
# Mock tool call
mock_tool_call = MagicMock()
@@ -1018,7 +1018,7 @@ async def test_smart_decision_maker_traditional_mode_default():
):
# Test default behavior (traditional mode)
input_data = SmartDecisionMakerBlock.Input(
input_data = OrchestratorBlock.Input(
prompt="Test prompt",
model=llm_module.DEFAULT_LLM_MODEL,
credentials=llm_module.TEST_CREDENTIALS_INPUT, # type: ignore
@@ -1060,12 +1060,12 @@ async def test_smart_decision_maker_traditional_mode_default():
@pytest.mark.asyncio
async def test_smart_decision_maker_uses_customized_name_for_blocks():
"""Test that SmartDecisionMakerBlock uses customized_name from node metadata for tool names."""
async def test_orchestrator_uses_customized_name_for_blocks():
"""Test that OrchestratorBlock uses customized_name from node metadata for tool names."""
from unittest.mock import MagicMock
from backend.blocks.basic import StoreValueBlock
from backend.blocks.smart_decision_maker import SmartDecisionMakerBlock
from backend.blocks.orchestrator import OrchestratorBlock
from backend.data.graph import Link, Node
# Create a mock node with customized_name in metadata
@@ -1074,13 +1074,14 @@ async def test_smart_decision_maker_uses_customized_name_for_blocks():
mock_node.block_id = StoreValueBlock().id
mock_node.metadata = {"customized_name": "My Custom Tool Name"}
mock_node.block = StoreValueBlock()
mock_node.input_default = {}
# Create a mock link
mock_link = MagicMock(spec=Link)
mock_link.sink_name = "input"
# Call the function directly
result = await SmartDecisionMakerBlock._create_block_function_signature(
result = await OrchestratorBlock._create_block_function_signature(
mock_node, [mock_link]
)
@@ -1091,12 +1092,12 @@ async def test_smart_decision_maker_uses_customized_name_for_blocks():
@pytest.mark.asyncio
async def test_smart_decision_maker_falls_back_to_block_name():
"""Test that SmartDecisionMakerBlock falls back to block.name when no customized_name."""
async def test_orchestrator_falls_back_to_block_name():
"""Test that OrchestratorBlock falls back to block.name when no customized_name."""
from unittest.mock import MagicMock
from backend.blocks.basic import StoreValueBlock
from backend.blocks.smart_decision_maker import SmartDecisionMakerBlock
from backend.blocks.orchestrator import OrchestratorBlock
from backend.data.graph import Link, Node
# Create a mock node without customized_name
@@ -1105,13 +1106,14 @@ async def test_smart_decision_maker_falls_back_to_block_name():
mock_node.block_id = StoreValueBlock().id
mock_node.metadata = {} # No customized_name
mock_node.block = StoreValueBlock()
mock_node.input_default = {}
# Create a mock link
mock_link = MagicMock(spec=Link)
mock_link.sink_name = "input"
# Call the function directly
result = await SmartDecisionMakerBlock._create_block_function_signature(
result = await OrchestratorBlock._create_block_function_signature(
mock_node, [mock_link]
)
@@ -1122,11 +1124,11 @@ async def test_smart_decision_maker_falls_back_to_block_name():
@pytest.mark.asyncio
async def test_smart_decision_maker_uses_customized_name_for_agents():
"""Test that SmartDecisionMakerBlock uses customized_name from metadata for agent nodes."""
async def test_orchestrator_uses_customized_name_for_agents():
"""Test that OrchestratorBlock uses customized_name from metadata for agent nodes."""
from unittest.mock import AsyncMock, MagicMock, patch
from backend.blocks.smart_decision_maker import SmartDecisionMakerBlock
from backend.blocks.orchestrator import OrchestratorBlock
from backend.data.graph import Link, Node
# Create a mock node with customized_name in metadata
@@ -1152,10 +1154,10 @@ async def test_smart_decision_maker_uses_customized_name_for_agents():
mock_db_client.get_graph_metadata.return_value = mock_graph_meta
with patch(
"backend.blocks.smart_decision_maker.get_database_manager_async_client",
"backend.blocks.orchestrator.get_database_manager_async_client",
return_value=mock_db_client,
):
result = await SmartDecisionMakerBlock._create_agent_function_signature(
result = await OrchestratorBlock._create_agent_function_signature(
mock_node, [mock_link]
)
@@ -1166,11 +1168,11 @@ async def test_smart_decision_maker_uses_customized_name_for_agents():
@pytest.mark.asyncio
async def test_smart_decision_maker_agent_falls_back_to_graph_name():
async def test_orchestrator_agent_falls_back_to_graph_name():
"""Test that agent node falls back to graph name when no customized_name."""
from unittest.mock import AsyncMock, MagicMock, patch
from backend.blocks.smart_decision_maker import SmartDecisionMakerBlock
from backend.blocks.orchestrator import OrchestratorBlock
from backend.data.graph import Link, Node
# Create a mock node without customized_name
@@ -1196,10 +1198,10 @@ async def test_smart_decision_maker_agent_falls_back_to_graph_name():
mock_db_client.get_graph_metadata.return_value = mock_graph_meta
with patch(
"backend.blocks.smart_decision_maker.get_database_manager_async_client",
"backend.blocks.orchestrator.get_database_manager_async_client",
return_value=mock_db_client,
):
result = await SmartDecisionMakerBlock._create_agent_function_signature(
result = await OrchestratorBlock._create_agent_function_signature(
mock_node, [mock_link]
)

View File

@@ -3,12 +3,12 @@ from unittest.mock import Mock
import pytest
from backend.blocks.data_manipulation import AddToListBlock, CreateDictionaryBlock
from backend.blocks.smart_decision_maker import SmartDecisionMakerBlock
from backend.blocks.orchestrator import OrchestratorBlock
@pytest.mark.asyncio
async def test_smart_decision_maker_handles_dynamic_dict_fields():
"""Test Smart Decision Maker can handle dynamic dictionary fields (_#_) for any block"""
async def test_orchestrator_handles_dynamic_dict_fields():
"""Test Orchestrator can handle dynamic dictionary fields (_#_) for any block"""
# Create a mock node for CreateDictionaryBlock
mock_node = Mock()
@@ -23,24 +23,24 @@ async def test_smart_decision_maker_handles_dynamic_dict_fields():
source_name="tools_^_create_dict_~_name",
sink_name="values_#_name", # Dynamic dict field
sink_id="dict_node_id",
source_id="smart_decision_node_id",
source_id="orchestrator_node_id",
),
Mock(
source_name="tools_^_create_dict_~_age",
sink_name="values_#_age", # Dynamic dict field
sink_id="dict_node_id",
source_id="smart_decision_node_id",
source_id="orchestrator_node_id",
),
Mock(
source_name="tools_^_create_dict_~_city",
sink_name="values_#_city", # Dynamic dict field
sink_id="dict_node_id",
source_id="smart_decision_node_id",
source_id="orchestrator_node_id",
),
]
# Generate function signature
signature = await SmartDecisionMakerBlock._create_block_function_signature(
signature = await OrchestratorBlock._create_block_function_signature(
mock_node, mock_links # type: ignore
)
@@ -70,8 +70,8 @@ async def test_smart_decision_maker_handles_dynamic_dict_fields():
@pytest.mark.asyncio
async def test_smart_decision_maker_handles_dynamic_list_fields():
"""Test Smart Decision Maker can handle dynamic list fields (_$_) for any block"""
async def test_orchestrator_handles_dynamic_list_fields():
"""Test Orchestrator can handle dynamic list fields (_$_) for any block"""
# Create a mock node for AddToListBlock
mock_node = Mock()
@@ -86,18 +86,18 @@ async def test_smart_decision_maker_handles_dynamic_list_fields():
source_name="tools_^_add_to_list_~_0",
sink_name="entries_$_0", # Dynamic list field
sink_id="list_node_id",
source_id="smart_decision_node_id",
source_id="orchestrator_node_id",
),
Mock(
source_name="tools_^_add_to_list_~_1",
sink_name="entries_$_1", # Dynamic list field
sink_id="list_node_id",
source_id="smart_decision_node_id",
source_id="orchestrator_node_id",
),
]
# Generate function signature
signature = await SmartDecisionMakerBlock._create_block_function_signature(
signature = await OrchestratorBlock._create_block_function_signature(
mock_node, mock_links # type: ignore
)

View File

@@ -1,4 +1,4 @@
"""Comprehensive tests for SmartDecisionMakerBlock dynamic field handling."""
"""Comprehensive tests for OrchestratorBlock dynamic field handling."""
import json
from unittest.mock import AsyncMock, MagicMock, Mock, patch
@@ -6,7 +6,7 @@ from unittest.mock import AsyncMock, MagicMock, Mock, patch
import pytest
from backend.blocks.data_manipulation import AddToListBlock, CreateDictionaryBlock
from backend.blocks.smart_decision_maker import SmartDecisionMakerBlock
from backend.blocks.orchestrator import OrchestratorBlock
from backend.blocks.text import MatchTextPatternBlock
from backend.data.dynamic_fields import get_dynamic_field_description
@@ -37,7 +37,7 @@ async def test_dynamic_field_description_generation():
@pytest.mark.asyncio
async def test_create_block_function_signature_with_dict_fields():
"""Test that function signatures are created correctly for dictionary dynamic fields."""
block = SmartDecisionMakerBlock()
block = OrchestratorBlock()
# Create a mock node for CreateDictionaryBlock
mock_node = Mock()
@@ -52,19 +52,19 @@ async def test_create_block_function_signature_with_dict_fields():
source_name="tools_^_create_dict_~_values___name", # Sanitized source
sink_name="values_#_name", # Original sink
sink_id="dict_node_id",
source_id="smart_decision_node_id",
source_id="orchestrator_node_id",
),
Mock(
source_name="tools_^_create_dict_~_values___age", # Sanitized source
sink_name="values_#_age", # Original sink
sink_id="dict_node_id",
source_id="smart_decision_node_id",
source_id="orchestrator_node_id",
),
Mock(
source_name="tools_^_create_dict_~_values___email", # Sanitized source
sink_name="values_#_email", # Original sink
sink_id="dict_node_id",
source_id="smart_decision_node_id",
source_id="orchestrator_node_id",
),
]
@@ -100,7 +100,7 @@ async def test_create_block_function_signature_with_dict_fields():
@pytest.mark.asyncio
async def test_create_block_function_signature_with_list_fields():
"""Test that function signatures are created correctly for list dynamic fields."""
block = SmartDecisionMakerBlock()
block = OrchestratorBlock()
# Create a mock node for AddToListBlock
mock_node = Mock()
@@ -115,19 +115,19 @@ async def test_create_block_function_signature_with_list_fields():
source_name="tools_^_add_list_~_0",
sink_name="entries_$_0", # Dynamic list field
sink_id="list_node_id",
source_id="smart_decision_node_id",
source_id="orchestrator_node_id",
),
Mock(
source_name="tools_^_add_list_~_1",
sink_name="entries_$_1", # Dynamic list field
sink_id="list_node_id",
source_id="smart_decision_node_id",
source_id="orchestrator_node_id",
),
Mock(
source_name="tools_^_add_list_~_2",
sink_name="entries_$_2", # Dynamic list field
sink_id="list_node_id",
source_id="smart_decision_node_id",
source_id="orchestrator_node_id",
),
]
@@ -154,7 +154,7 @@ async def test_create_block_function_signature_with_list_fields():
@pytest.mark.asyncio
async def test_create_block_function_signature_with_object_fields():
"""Test that function signatures are created correctly for object dynamic fields."""
block = SmartDecisionMakerBlock()
block = OrchestratorBlock()
# Create a mock node for MatchTextPatternBlock (simulating object fields)
mock_node = Mock()
@@ -169,13 +169,13 @@ async def test_create_block_function_signature_with_object_fields():
source_name="tools_^_extract_~_user_name",
sink_name="data_@_user_name", # Dynamic object field
sink_id="extract_node_id",
source_id="smart_decision_node_id",
source_id="orchestrator_node_id",
),
Mock(
source_name="tools_^_extract_~_user_email",
sink_name="data_@_user_email", # Dynamic object field
sink_id="extract_node_id",
source_id="smart_decision_node_id",
source_id="orchestrator_node_id",
),
]
@@ -197,11 +197,11 @@ async def test_create_block_function_signature_with_object_fields():
@pytest.mark.asyncio
async def test_create_tool_node_signatures():
"""Test that the mapping between sanitized and original field names is built correctly."""
block = SmartDecisionMakerBlock()
block = OrchestratorBlock()
# Mock the database client and connected nodes
with patch(
"backend.blocks.smart_decision_maker.get_database_manager_async_client"
"backend.blocks.orchestrator.get_database_manager_async_client"
) as mock_db:
mock_client = AsyncMock()
mock_db.return_value = mock_client
@@ -281,7 +281,7 @@ async def test_create_tool_node_signatures():
@pytest.mark.asyncio
async def test_output_yielding_with_dynamic_fields():
"""Test that outputs are yielded correctly with dynamic field names mapped back."""
block = SmartDecisionMakerBlock()
block = OrchestratorBlock()
# No more sanitized mapping needed since we removed sanitization
@@ -309,13 +309,13 @@ async def test_output_yielding_with_dynamic_fields():
# Mock the LLM call
with patch(
"backend.blocks.smart_decision_maker.llm.llm_call", new_callable=AsyncMock
"backend.blocks.orchestrator.llm.llm_call", new_callable=AsyncMock
) as mock_llm:
mock_llm.return_value = mock_response
# Mock the database manager to avoid HTTP calls during tool execution
with patch(
"backend.blocks.smart_decision_maker.get_database_manager_async_client"
"backend.blocks.orchestrator.get_database_manager_async_client"
) as mock_db_manager, patch.object(
block, "_create_tool_node_signatures", new_callable=AsyncMock
) as mock_sig:
@@ -420,7 +420,7 @@ async def test_output_yielding_with_dynamic_fields():
@pytest.mark.asyncio
async def test_mixed_regular_and_dynamic_fields():
"""Test handling of blocks with both regular and dynamic fields."""
block = SmartDecisionMakerBlock()
block = OrchestratorBlock()
# Create a mock node
mock_node = Mock()
@@ -450,19 +450,19 @@ async def test_mixed_regular_and_dynamic_fields():
source_name="tools_^_test_~_regular",
sink_name="regular_field", # Regular field
sink_id="test_node_id",
source_id="smart_decision_node_id",
source_id="orchestrator_node_id",
),
Mock(
source_name="tools_^_test_~_dict_key",
sink_name="values_#_key1", # Dynamic dict field
sink_id="test_node_id",
source_id="smart_decision_node_id",
source_id="orchestrator_node_id",
),
Mock(
source_name="tools_^_test_~_dict_key2",
sink_name="values_#_key2", # Dynamic dict field
sink_id="test_node_id",
source_id="smart_decision_node_id",
source_id="orchestrator_node_id",
),
]
@@ -488,7 +488,7 @@ async def test_mixed_regular_and_dynamic_fields():
@pytest.mark.asyncio
async def test_validation_errors_dont_pollute_conversation():
"""Test that validation errors are only used during retries and don't pollute the conversation."""
block = SmartDecisionMakerBlock()
block = OrchestratorBlock()
# Track conversation history changes
conversation_snapshots = []
@@ -535,7 +535,7 @@ async def test_validation_errors_dont_pollute_conversation():
# Mock the LLM call
with patch(
"backend.blocks.smart_decision_maker.llm.llm_call", new_callable=AsyncMock
"backend.blocks.orchestrator.llm.llm_call", new_callable=AsyncMock
) as mock_llm:
mock_llm.side_effect = mock_llm_call
@@ -565,7 +565,7 @@ async def test_validation_errors_dont_pollute_conversation():
# Mock the database manager to avoid HTTP calls during tool execution
with patch(
"backend.blocks.smart_decision_maker.get_database_manager_async_client"
"backend.blocks.orchestrator.get_database_manager_async_client"
) as mock_db_manager:
# Set up the mock database manager for agent mode
mock_db_client = AsyncMock()

View File

@@ -0,0 +1,202 @@
"""Tests for ExecutionMode enum and provider validation in the orchestrator.
Covers:
- ExecutionMode enum members exist and have stable values
- EXTENDED_THINKING provider validation (anthropic/open_router allowed, others rejected)
- EXTENDED_THINKING model-name validation (must start with "claude")
"""
from unittest.mock import AsyncMock, MagicMock, patch
import pytest
from backend.blocks.llm import LlmModel
from backend.blocks.orchestrator import ExecutionMode, OrchestratorBlock
# ---------------------------------------------------------------------------
# ExecutionMode enum integrity
# ---------------------------------------------------------------------------
class TestExecutionModeEnum:
"""Guard against accidental renames or removals of enum members."""
def test_built_in_exists(self):
assert hasattr(ExecutionMode, "BUILT_IN")
assert ExecutionMode.BUILT_IN.value == "built_in"
def test_extended_thinking_exists(self):
assert hasattr(ExecutionMode, "EXTENDED_THINKING")
assert ExecutionMode.EXTENDED_THINKING.value == "extended_thinking"
def test_exactly_two_members(self):
"""If a new mode is added, this test should be updated intentionally."""
assert set(ExecutionMode.__members__.keys()) == {
"BUILT_IN",
"EXTENDED_THINKING",
}
def test_string_enum(self):
"""ExecutionMode is a str enum so it serialises cleanly to JSON."""
assert isinstance(ExecutionMode.BUILT_IN, str)
assert isinstance(ExecutionMode.EXTENDED_THINKING, str)
def test_round_trip_from_value(self):
"""Constructing from the string value should return the same member."""
assert ExecutionMode("built_in") is ExecutionMode.BUILT_IN
assert ExecutionMode("extended_thinking") is ExecutionMode.EXTENDED_THINKING
# ---------------------------------------------------------------------------
# Provider validation (inline in OrchestratorBlock.run)
# ---------------------------------------------------------------------------
def _make_model_stub(provider: str, value: str):
"""Create a lightweight stub that behaves like LlmModel for validation."""
metadata = MagicMock()
metadata.provider = provider
stub = MagicMock()
stub.metadata = metadata
stub.value = value
return stub
class TestExtendedThinkingProviderValidation:
"""The orchestrator rejects EXTENDED_THINKING for non-Anthropic providers."""
def test_anthropic_provider_accepted(self):
"""provider='anthropic' + claude model should not raise."""
model = _make_model_stub("anthropic", "claude-opus-4-6")
provider = model.metadata.provider
model_name = model.value
assert provider in ("anthropic", "open_router")
assert model_name.startswith("claude")
def test_open_router_provider_accepted(self):
"""provider='open_router' + claude model should not raise."""
model = _make_model_stub("open_router", "claude-sonnet-4-6")
provider = model.metadata.provider
model_name = model.value
assert provider in ("anthropic", "open_router")
assert model_name.startswith("claude")
def test_openai_provider_rejected(self):
"""provider='openai' should be rejected for EXTENDED_THINKING."""
model = _make_model_stub("openai", "gpt-4o")
provider = model.metadata.provider
assert provider not in ("anthropic", "open_router")
def test_groq_provider_rejected(self):
model = _make_model_stub("groq", "llama-3.3-70b-versatile")
provider = model.metadata.provider
assert provider not in ("anthropic", "open_router")
def test_non_claude_model_rejected_even_if_anthropic_provider(self):
"""A hypothetical non-Claude model with provider='anthropic' is rejected."""
model = _make_model_stub("anthropic", "not-a-claude-model")
model_name = model.value
assert not model_name.startswith("claude")
def test_real_gpt4o_model_rejected(self):
"""Verify a real LlmModel enum member (GPT4O) fails the provider check."""
model = LlmModel.GPT4O
provider = model.metadata.provider
assert provider not in ("anthropic", "open_router")
def test_real_claude_model_passes(self):
"""Verify a real LlmModel enum member (CLAUDE_4_6_SONNET) passes."""
model = LlmModel.CLAUDE_4_6_SONNET
provider = model.metadata.provider
model_name = model.value
assert provider in ("anthropic", "open_router")
assert model_name.startswith("claude")
# ---------------------------------------------------------------------------
# Integration-style: exercise the validation branch via OrchestratorBlock.run
# ---------------------------------------------------------------------------
def _make_input_data(model, execution_mode=ExecutionMode.EXTENDED_THINKING):
"""Build a minimal MagicMock that satisfies OrchestratorBlock.run's early path."""
inp = MagicMock()
inp.execution_mode = execution_mode
inp.model = model
inp.prompt = "test"
inp.sys_prompt = ""
inp.conversation_history = []
inp.last_tool_output = None
inp.prompt_values = {}
return inp
async def _collect_run_outputs(block, input_data, **kwargs):
"""Exhaust the OrchestratorBlock.run async generator, collecting outputs."""
outputs = []
async for item in block.run(input_data, **kwargs):
outputs.append(item)
return outputs
class TestExtendedThinkingValidationRaisesInBlock:
"""Call OrchestratorBlock.run far enough to trigger the ValueError."""
@pytest.mark.asyncio
async def test_non_anthropic_provider_raises_valueerror(self):
"""EXTENDED_THINKING + openai provider raises ValueError."""
block = OrchestratorBlock()
input_data = _make_input_data(model=LlmModel.GPT4O)
with (
patch.object(
block,
"_create_tool_node_signatures",
new_callable=AsyncMock,
return_value=[],
),
pytest.raises(ValueError, match="Anthropic-compatible"),
):
await _collect_run_outputs(
block,
input_data,
credentials=MagicMock(),
graph_id="g",
node_id="n",
graph_exec_id="ge",
node_exec_id="ne",
user_id="u",
graph_version=1,
execution_context=MagicMock(),
execution_processor=MagicMock(),
)
@pytest.mark.asyncio
async def test_non_claude_model_with_anthropic_provider_raises(self):
"""A model with anthropic provider but non-claude name raises ValueError."""
block = OrchestratorBlock()
fake_model = _make_model_stub("anthropic", "not-a-claude-model")
input_data = _make_input_data(model=fake_model)
with (
patch.object(
block,
"_create_tool_node_signatures",
new_callable=AsyncMock,
return_value=[],
),
pytest.raises(ValueError, match="only supports Claude models"),
):
await _collect_run_outputs(
block,
input_data,
credentials=MagicMock(),
graph_id="g",
node_id="n",
graph_exec_id="ge",
node_exec_id="ne",
user_id="u",
graph_version=1,
execution_context=MagicMock(),
execution_processor=MagicMock(),
)

View File

@@ -1,6 +1,6 @@
"""Tests for SmartDecisionMakerBlock compatibility with the OpenAI Responses API.
"""Tests for OrchestratorBlock compatibility with the OpenAI Responses API.
The SmartDecisionMakerBlock manages conversation history in the Chat Completions
The OrchestratorBlock manages conversation history in the Chat Completions
format, but OpenAI models now use the Responses API which has a fundamentally
different conversation structure. These tests document:
@@ -27,8 +27,8 @@ from unittest.mock import AsyncMock, MagicMock, patch
import pytest
from backend.blocks.smart_decision_maker import (
SmartDecisionMakerBlock,
from backend.blocks.orchestrator import (
OrchestratorBlock,
_combine_tool_responses,
_convert_raw_response_to_dict,
_create_tool_response,
@@ -733,7 +733,7 @@ class TestUpdateConversation:
def test_dict_raw_response_no_reasoning_no_tools(self):
"""Dict raw_response, no reasoning → appends assistant dict."""
block = SmartDecisionMakerBlock()
block = OrchestratorBlock()
prompt: list[dict] = []
resp = self._make_response({"role": "assistant", "content": "hi"})
block._update_conversation(prompt, resp)
@@ -741,7 +741,7 @@ class TestUpdateConversation:
def test_dict_raw_response_with_reasoning_no_tool_calls(self):
"""Reasoning present, no tool calls → reasoning prepended."""
block = SmartDecisionMakerBlock()
block = OrchestratorBlock()
prompt: list[dict] = []
resp = self._make_response(
{"role": "assistant", "content": "answer"},
@@ -757,7 +757,7 @@ class TestUpdateConversation:
def test_dict_raw_response_with_reasoning_and_anthropic_tool_calls(self):
"""Reasoning + Anthropic tool_use in content → reasoning skipped."""
block = SmartDecisionMakerBlock()
block = OrchestratorBlock()
prompt: list[dict] = []
raw = {
"role": "assistant",
@@ -772,7 +772,7 @@ class TestUpdateConversation:
def test_with_tool_outputs(self):
"""Tool outputs → extended onto prompt."""
block = SmartDecisionMakerBlock()
block = OrchestratorBlock()
prompt: list[dict] = []
resp = self._make_response({"role": "assistant", "content": None})
outputs = [{"role": "tool", "tool_call_id": "call_1", "content": "r"}]
@@ -782,7 +782,7 @@ class TestUpdateConversation:
def test_without_tool_outputs(self):
"""No tool outputs → only assistant message appended."""
block = SmartDecisionMakerBlock()
block = OrchestratorBlock()
prompt: list[dict] = []
resp = self._make_response({"role": "assistant", "content": "done"})
block._update_conversation(prompt, resp, None)
@@ -790,7 +790,7 @@ class TestUpdateConversation:
def test_string_raw_response(self):
"""Ollama string → wrapped as assistant dict."""
block = SmartDecisionMakerBlock()
block = OrchestratorBlock()
prompt: list[dict] = []
resp = self._make_response("hello from ollama")
block._update_conversation(prompt, resp)
@@ -800,7 +800,7 @@ class TestUpdateConversation:
def test_responses_api_text_response_produces_valid_items(self):
"""Responses API text response → conversation items must have valid role."""
block = SmartDecisionMakerBlock()
block = OrchestratorBlock()
prompt: list[dict] = [
{"role": "system", "content": "sys"},
{"role": "user", "content": "user"},
@@ -820,7 +820,7 @@ class TestUpdateConversation:
def test_responses_api_function_call_produces_valid_items(self):
"""Responses API function_call → conversation items must have valid type."""
block = SmartDecisionMakerBlock()
block = OrchestratorBlock()
prompt: list[dict] = []
resp = self._make_response(
_MockResponse(output=[_MockFunctionCall("tool", "{}", call_id="call_1")])
@@ -856,7 +856,7 @@ async def test_agent_mode_conversation_valid_for_responses_api():
"""
import backend.blocks.llm as llm_module
block = SmartDecisionMakerBlock()
block = OrchestratorBlock()
# First response: tool call
mock_tc = MagicMock()
@@ -936,7 +936,7 @@ async def test_agent_mode_conversation_valid_for_responses_api():
with patch("backend.blocks.llm.llm_call", llm_mock), patch.object(
block, "_create_tool_node_signatures", return_value=tool_sigs
), patch(
"backend.blocks.smart_decision_maker.get_database_manager_async_client",
"backend.blocks.orchestrator.get_database_manager_async_client",
return_value=mock_db,
), patch(
"backend.executor.manager.async_update_node_execution_status",
@@ -945,7 +945,7 @@ async def test_agent_mode_conversation_valid_for_responses_api():
"backend.integrations.creds_manager.IntegrationCredentialsManager"
):
inp = SmartDecisionMakerBlock.Input(
inp = OrchestratorBlock.Input(
prompt="Improve this",
model=llm_module.DEFAULT_LLM_MODEL,
credentials=llm_module.TEST_CREDENTIALS_INPUT, # type: ignore
@@ -992,7 +992,7 @@ async def test_traditional_mode_conversation_valid_for_responses_api():
"""Traditional mode: the yielded conversation must contain only valid items."""
import backend.blocks.llm as llm_module
block = SmartDecisionMakerBlock()
block = OrchestratorBlock()
mock_tc = MagicMock()
mock_tc.function.name = "my_tool"
@@ -1028,7 +1028,7 @@ async def test_traditional_mode_conversation_valid_for_responses_api():
"backend.blocks.llm.llm_call", new_callable=AsyncMock, return_value=resp
), patch.object(block, "_create_tool_node_signatures", return_value=tool_sigs):
inp = SmartDecisionMakerBlock.Input(
inp = OrchestratorBlock.Input(
prompt="Do it",
model=llm_module.DEFAULT_LLM_MODEL,
credentials=llm_module.TEST_CREDENTIALS_INPUT, # type: ignore

File diff suppressed because it is too large Load Diff

View File

@@ -44,7 +44,7 @@ class XMLParserBlock(Block):
elif token.type == "TAG_CLOSE":
depth -= 1
if depth < 0:
raise SyntaxError("Unexpected closing tag in XML input.")
raise ValueError("Unexpected closing tag in XML input.")
elif token.type in {"TEXT", "ESCAPE"}:
if depth == 0 and token.value:
raise ValueError(
@@ -53,7 +53,7 @@ class XMLParserBlock(Block):
)
if depth != 0:
raise SyntaxError("Unclosed tag detected in XML input.")
raise ValueError("Unclosed tag detected in XML input.")
if not root_seen:
raise ValueError("XML must include a root element.")
@@ -76,4 +76,7 @@ class XMLParserBlock(Block):
except ValueError as val_e:
raise ValueError(f"Validation error for dict:{val_e}") from val_e
except SyntaxError as syn_e:
raise SyntaxError(f"Error in input xml syntax: {syn_e}") from syn_e
# Raise as ValueError so the base Block.execute() wraps it as
# BlockExecutionError (expected user-caused failure) instead of
# BlockUnknownError (unexpected platform error that alerts Sentry).
raise ValueError(f"Error in input xml syntax: {syn_e}") from syn_e

View File

@@ -9,12 +9,16 @@ shared tool registry as the SDK path.
import asyncio
import logging
import uuid
from collections.abc import AsyncGenerator
from typing import Any
from collections.abc import AsyncGenerator, Sequence
from dataclasses import dataclass, field
from functools import partial
from typing import Any, cast
import orjson
from langfuse import propagate_attributes
from openai.types.chat import ChatCompletionMessageParam, ChatCompletionToolParam
from backend.copilot.context import set_execution_context
from backend.copilot.model import (
ChatMessage,
ChatSession,
@@ -48,7 +52,17 @@ from backend.copilot.token_tracking import persist_and_record_usage
from backend.copilot.tools import execute_tool, get_available_tools
from backend.copilot.tracking import track_user_message
from backend.util.exceptions import NotFoundError
from backend.util.prompt import compress_context
from backend.util.prompt import (
compress_context,
estimate_token_count,
estimate_token_count_str,
)
from backend.util.tool_call_loop import (
LLMLoopResponse,
LLMToolCall,
ToolCallResult,
tool_call_loop,
)
logger = logging.getLogger(__name__)
@@ -59,6 +73,247 @@ _background_tasks: set[asyncio.Task[Any]] = set()
_MAX_TOOL_ROUNDS = 30
@dataclass
class _BaselineStreamState:
"""Mutable state shared between the tool-call loop callbacks.
Extracted from ``stream_chat_completion_baseline`` so that the callbacks
can be module-level functions instead of deeply nested closures.
"""
pending_events: list[StreamBaseResponse] = field(default_factory=list)
assistant_text: str = ""
text_block_id: str = field(default_factory=lambda: str(uuid.uuid4()))
text_started: bool = False
turn_prompt_tokens: int = 0
turn_completion_tokens: int = 0
async def _baseline_llm_caller(
messages: list[dict[str, Any]],
tools: Sequence[Any],
*,
state: _BaselineStreamState,
) -> LLMLoopResponse:
"""Stream an OpenAI-compatible response and collect results.
Extracted from ``stream_chat_completion_baseline`` for readability.
"""
state.pending_events.append(StreamStartStep())
round_text = ""
try:
client = _get_openai_client()
typed_messages = cast(list[ChatCompletionMessageParam], messages)
if tools:
typed_tools = cast(list[ChatCompletionToolParam], tools)
response = await client.chat.completions.create(
model=config.model,
messages=typed_messages,
tools=typed_tools,
stream=True,
stream_options={"include_usage": True},
)
else:
response = await client.chat.completions.create(
model=config.model,
messages=typed_messages,
stream=True,
stream_options={"include_usage": True},
)
tool_calls_by_index: dict[int, dict[str, str]] = {}
async for chunk in response:
if chunk.usage:
state.turn_prompt_tokens += chunk.usage.prompt_tokens or 0
state.turn_completion_tokens += chunk.usage.completion_tokens or 0
delta = chunk.choices[0].delta if chunk.choices else None
if not delta:
continue
if delta.content:
if not state.text_started:
state.pending_events.append(StreamTextStart(id=state.text_block_id))
state.text_started = True
round_text += delta.content
state.pending_events.append(
StreamTextDelta(id=state.text_block_id, delta=delta.content)
)
if delta.tool_calls:
for tc in delta.tool_calls:
idx = tc.index
if idx not in tool_calls_by_index:
tool_calls_by_index[idx] = {
"id": "",
"name": "",
"arguments": "",
}
entry = tool_calls_by_index[idx]
if tc.id:
entry["id"] = tc.id
if tc.function and tc.function.name:
entry["name"] = tc.function.name
if tc.function and tc.function.arguments:
entry["arguments"] += tc.function.arguments
# Close text block
if state.text_started:
state.pending_events.append(StreamTextEnd(id=state.text_block_id))
state.text_started = False
state.text_block_id = str(uuid.uuid4())
finally:
# Always persist partial text so the session history stays consistent,
# even when the stream is interrupted by an exception.
state.assistant_text += round_text
# Always emit StreamFinishStep to match the StreamStartStep,
# even if an exception occurred during streaming.
state.pending_events.append(StreamFinishStep())
# Convert to shared format
llm_tool_calls = [
LLMToolCall(
id=tc["id"],
name=tc["name"],
arguments=tc["arguments"] or "{}",
)
for tc in tool_calls_by_index.values()
]
return LLMLoopResponse(
response_text=round_text or None,
tool_calls=llm_tool_calls,
raw_response=None, # Not needed for baseline conversation updater
prompt_tokens=0, # Tracked via state accumulators
completion_tokens=0,
)
async def _baseline_tool_executor(
tool_call: LLMToolCall,
tools: Sequence[Any],
*,
state: _BaselineStreamState,
user_id: str | None,
session: ChatSession,
) -> ToolCallResult:
"""Execute a tool via the copilot tool registry.
Extracted from ``stream_chat_completion_baseline`` for readability.
"""
tool_call_id = tool_call.id
tool_name = tool_call.name
raw_args = tool_call.arguments or "{}"
try:
tool_args = orjson.loads(raw_args)
except orjson.JSONDecodeError as parse_err:
parse_error = f"Invalid JSON arguments for tool '{tool_name}': {parse_err}"
logger.warning("[Baseline] %s", parse_error)
state.pending_events.append(
StreamToolOutputAvailable(
toolCallId=tool_call_id,
toolName=tool_name,
output=parse_error,
success=False,
)
)
return ToolCallResult(
tool_call_id=tool_call_id,
tool_name=tool_name,
content=parse_error,
is_error=True,
)
state.pending_events.append(
StreamToolInputStart(toolCallId=tool_call_id, toolName=tool_name)
)
state.pending_events.append(
StreamToolInputAvailable(
toolCallId=tool_call_id,
toolName=tool_name,
input=tool_args,
)
)
try:
result: StreamToolOutputAvailable = await execute_tool(
tool_name=tool_name,
parameters=tool_args,
user_id=user_id,
session=session,
tool_call_id=tool_call_id,
)
state.pending_events.append(result)
tool_output = (
result.output if isinstance(result.output, str) else str(result.output)
)
return ToolCallResult(
tool_call_id=tool_call_id,
tool_name=tool_name,
content=tool_output,
)
except Exception as e:
error_output = f"Tool execution error: {e}"
logger.error(
"[Baseline] Tool %s failed: %s",
tool_name,
error_output,
exc_info=True,
)
state.pending_events.append(
StreamToolOutputAvailable(
toolCallId=tool_call_id,
toolName=tool_name,
output=error_output,
success=False,
)
)
return ToolCallResult(
tool_call_id=tool_call_id,
tool_name=tool_name,
content=error_output,
is_error=True,
)
def _baseline_conversation_updater(
messages: list[dict[str, Any]],
response: LLMLoopResponse,
tool_results: list[ToolCallResult] | None = None,
) -> None:
"""Update OpenAI message list with assistant response + tool results.
Extracted from ``stream_chat_completion_baseline`` for readability.
"""
if tool_results:
# Build assistant message with tool_calls
assistant_msg: dict[str, Any] = {"role": "assistant"}
if response.response_text:
assistant_msg["content"] = response.response_text
assistant_msg["tool_calls"] = [
{
"id": tc.id,
"type": "function",
"function": {"name": tc.name, "arguments": tc.arguments},
}
for tc in response.tool_calls
]
messages.append(assistant_msg)
for tr in tool_results:
messages.append(
{
"role": "tool",
"tool_call_id": tr.tool_call_id,
"content": tr.content,
}
)
else:
if response.response_text:
messages.append({"role": "assistant", "content": response.response_text})
async def _update_title_async(
session_id: str, message: str, user_id: str | None
) -> None:
@@ -203,6 +458,9 @@ async def stream_chat_completion_baseline(
tools = get_available_tools()
# Propagate execution context so tool handlers can read session-level flags.
set_execution_context(user_id, session)
yield StreamStart(messageId=message_id, sessionId=session_id)
# Propagate user/session context to Langfuse so all LLM calls within
@@ -219,191 +477,32 @@ async def stream_chat_completion_baseline(
except Exception:
logger.warning("[Baseline] Langfuse trace context setup failed")
assistant_text = ""
text_block_id = str(uuid.uuid4())
text_started = False
step_open = False
# Token usage accumulators — populated from streaming chunks
turn_prompt_tokens = 0
turn_completion_tokens = 0
_stream_error = False # Track whether an error occurred during streaming
state = _BaselineStreamState()
# Bind extracted module-level callbacks to this request's state/session
# using functools.partial so they satisfy the Protocol signatures.
_bound_llm_caller = partial(_baseline_llm_caller, state=state)
_bound_tool_executor = partial(
_baseline_tool_executor, state=state, user_id=user_id, session=session
)
try:
for _round in range(_MAX_TOOL_ROUNDS):
# Open a new step for each LLM round
yield StreamStartStep()
step_open = True
loop_result = None
async for loop_result in tool_call_loop(
messages=openai_messages,
tools=tools,
llm_call=_bound_llm_caller,
execute_tool=_bound_tool_executor,
update_conversation=_baseline_conversation_updater,
max_iterations=_MAX_TOOL_ROUNDS,
):
# Drain buffered events after each iteration (real-time streaming)
for evt in state.pending_events:
yield evt
state.pending_events.clear()
# Stream a response from the model
create_kwargs: dict[str, Any] = dict(
model=config.model,
messages=openai_messages,
stream=True,
stream_options={"include_usage": True},
)
if tools:
create_kwargs["tools"] = tools
response = await _get_openai_client().chat.completions.create(**create_kwargs) # type: ignore[arg-type] # dynamic kwargs
# Accumulate streamed response (text + tool calls)
round_text = ""
tool_calls_by_index: dict[int, dict[str, str]] = {}
async for chunk in response:
# Capture token usage from the streaming chunk.
# OpenRouter normalises all providers into OpenAI format
# where prompt_tokens already includes cached tokens
# (unlike Anthropic's native API). Use += to sum all
# tool-call rounds since each API call is independent.
# NOTE: stream_options={"include_usage": True} is not
# universally supported — some providers (Mistral, Llama
# via OpenRouter) always return chunk.usage=None. When
# that happens, tokens stay 0 and the tiktoken fallback
# below activates. Fail-open: one round is estimated.
if chunk.usage:
turn_prompt_tokens += chunk.usage.prompt_tokens or 0
turn_completion_tokens += chunk.usage.completion_tokens or 0
delta = chunk.choices[0].delta if chunk.choices else None
if not delta:
continue
# Text content
if delta.content:
if not text_started:
yield StreamTextStart(id=text_block_id)
text_started = True
round_text += delta.content
yield StreamTextDelta(id=text_block_id, delta=delta.content)
# Tool call fragments (streamed incrementally)
if delta.tool_calls:
for tc in delta.tool_calls:
idx = tc.index
if idx not in tool_calls_by_index:
tool_calls_by_index[idx] = {
"id": "",
"name": "",
"arguments": "",
}
entry = tool_calls_by_index[idx]
if tc.id:
entry["id"] = tc.id
if tc.function and tc.function.name:
entry["name"] = tc.function.name
if tc.function and tc.function.arguments:
entry["arguments"] += tc.function.arguments
# Close text block if we had one this round
if text_started:
yield StreamTextEnd(id=text_block_id)
text_started = False
text_block_id = str(uuid.uuid4())
# Accumulate text for session persistence
assistant_text += round_text
# No tool calls -> model is done
if not tool_calls_by_index:
yield StreamFinishStep()
step_open = False
break
# Close step before tool execution
yield StreamFinishStep()
step_open = False
# Append the assistant message with tool_calls to context.
assistant_msg: dict[str, Any] = {"role": "assistant"}
if round_text:
assistant_msg["content"] = round_text
assistant_msg["tool_calls"] = [
{
"id": tc["id"],
"type": "function",
"function": {
"name": tc["name"],
"arguments": tc["arguments"] or "{}",
},
}
for tc in tool_calls_by_index.values()
]
openai_messages.append(assistant_msg)
# Execute each tool call and stream events
for tc in tool_calls_by_index.values():
tool_call_id = tc["id"]
tool_name = tc["name"]
raw_args = tc["arguments"] or "{}"
try:
tool_args = orjson.loads(raw_args)
except orjson.JSONDecodeError as parse_err:
parse_error = (
f"Invalid JSON arguments for tool '{tool_name}': {parse_err}"
)
logger.warning("[Baseline] %s", parse_error)
yield StreamToolOutputAvailable(
toolCallId=tool_call_id,
toolName=tool_name,
output=parse_error,
success=False,
)
openai_messages.append(
{
"role": "tool",
"tool_call_id": tool_call_id,
"content": parse_error,
}
)
continue
yield StreamToolInputStart(toolCallId=tool_call_id, toolName=tool_name)
yield StreamToolInputAvailable(
toolCallId=tool_call_id,
toolName=tool_name,
input=tool_args,
)
# Execute via shared tool registry
try:
result: StreamToolOutputAvailable = await execute_tool(
tool_name=tool_name,
parameters=tool_args,
user_id=user_id,
session=session,
tool_call_id=tool_call_id,
)
yield result
tool_output = (
result.output
if isinstance(result.output, str)
else str(result.output)
)
except Exception as e:
error_output = f"Tool execution error: {e}"
logger.error(
"[Baseline] Tool %s failed: %s",
tool_name,
error_output,
exc_info=True,
)
yield StreamToolOutputAvailable(
toolCallId=tool_call_id,
toolName=tool_name,
output=error_output,
success=False,
)
tool_output = error_output
# Append tool result to context for next round
openai_messages.append(
{
"role": "tool",
"tool_call_id": tool_call_id,
"content": tool_output,
}
)
else:
# for-loop exhausted without break -> tool-round limit hit
if loop_result and not loop_result.finished_naturally:
limit_msg = (
f"Exceeded {_MAX_TOOL_ROUNDS} tool-call rounds "
"without a final response."
@@ -418,11 +517,28 @@ async def stream_chat_completion_baseline(
_stream_error = True
error_msg = str(e) or type(e).__name__
logger.error("[Baseline] Streaming error: %s", error_msg, exc_info=True)
# Close any open text/step before emitting error
if text_started:
yield StreamTextEnd(id=text_block_id)
if step_open:
yield StreamFinishStep()
# Close any open text block. The llm_caller's finally block
# already appended StreamFinishStep to pending_events, so we must
# insert StreamTextEnd *before* StreamFinishStep to preserve the
# protocol ordering:
# StreamStartStep -> StreamTextStart -> ...deltas... ->
# StreamTextEnd -> StreamFinishStep
# Appending (or yielding directly) would place it after
# StreamFinishStep, violating the protocol.
if state.text_started:
# Find the last StreamFinishStep and insert before it.
insert_pos = len(state.pending_events)
for i in range(len(state.pending_events) - 1, -1, -1):
if isinstance(state.pending_events[i], StreamFinishStep):
insert_pos = i
break
state.pending_events.insert(
insert_pos, StreamTextEnd(id=state.text_block_id)
)
# Drain pending events in correct order
for evt in state.pending_events:
yield evt
state.pending_events.clear()
yield StreamError(errorText=error_msg, code="baseline_error")
# Still persist whatever we got
finally:
@@ -442,26 +558,21 @@ async def stream_chat_completion_baseline(
# Skip fallback when an error occurred and no output was produced —
# charging rate-limit tokens for completely failed requests is unfair.
if (
turn_prompt_tokens == 0
and turn_completion_tokens == 0
and not (_stream_error and not assistant_text)
state.turn_prompt_tokens == 0
and state.turn_completion_tokens == 0
and not (_stream_error and not state.assistant_text)
):
from backend.util.prompt import (
estimate_token_count,
estimate_token_count_str,
)
turn_prompt_tokens = max(
state.turn_prompt_tokens = max(
estimate_token_count(openai_messages, model=config.model), 1
)
turn_completion_tokens = estimate_token_count_str(
assistant_text, model=config.model
state.turn_completion_tokens = estimate_token_count_str(
state.assistant_text, model=config.model
)
logger.info(
"[Baseline] No streaming usage reported; estimated tokens: "
"prompt=%d, completion=%d",
turn_prompt_tokens,
turn_completion_tokens,
state.turn_prompt_tokens,
state.turn_completion_tokens,
)
# Persist token usage to session and record for rate limiting.
@@ -471,15 +582,15 @@ async def stream_chat_completion_baseline(
await persist_and_record_usage(
session=session,
user_id=user_id,
prompt_tokens=turn_prompt_tokens,
completion_tokens=turn_completion_tokens,
prompt_tokens=state.turn_prompt_tokens,
completion_tokens=state.turn_completion_tokens,
log_prefix="[Baseline]",
)
# Persist assistant response
if assistant_text:
if state.assistant_text:
session.messages.append(
ChatMessage(role="assistant", content=assistant_text)
ChatMessage(role="assistant", content=state.assistant_text)
)
try:
await upsert_chat_session(session)
@@ -491,11 +602,11 @@ async def stream_chat_completion_baseline(
# aclose() — doing so raises RuntimeError on client disconnect.
# On GeneratorExit the client is already gone, so unreachable yields
# are harmless; on normal completion they reach the SSE stream.
if turn_prompt_tokens > 0 or turn_completion_tokens > 0:
if state.turn_prompt_tokens > 0 or state.turn_completion_tokens > 0:
yield StreamUsage(
prompt_tokens=turn_prompt_tokens,
completion_tokens=turn_completion_tokens,
total_tokens=turn_prompt_tokens + turn_completion_tokens,
prompt_tokens=state.turn_prompt_tokens,
completion_tokens=state.turn_completion_tokens,
total_tokens=state.turn_prompt_tokens + state.turn_completion_tokens,
)
yield StreamFinish()

View File

@@ -31,7 +31,7 @@ async def test_baseline_multi_turn(setup_test_user, test_user_id):
if not api_key:
return pytest.skip("OPEN_ROUTER_API_KEY is not set, skipping test")
session = await create_chat_session(test_user_id)
session = await create_chat_session(test_user_id, dry_run=False)
session = await upsert_chat_session(session)
# --- Turn 1: send a message with a unique keyword ---

View File

@@ -91,6 +91,20 @@ class ChatConfig(BaseSettings):
description="Max tokens per week, resets Monday 00:00 UTC (0 = unlimited)",
)
# Cost (in credits / cents) to reset the daily rate limit using credits.
# When a user hits their daily limit, they can spend this amount to reset
# the daily counter and keep working. Set to 0 to disable the feature.
rate_limit_reset_cost: int = Field(
default=500,
ge=0,
description="Credit cost (in cents) for resetting the daily rate limit. 0 = disabled.",
)
max_daily_resets: int = Field(
default=5,
ge=0,
description="Maximum number of credit-based rate limit resets per user per day. 0 = unlimited.",
)
# Claude Agent SDK Configuration
use_claude_agent_sdk: bool = Field(
default=True,
@@ -164,7 +178,7 @@ class ChatConfig(BaseSettings):
Single source of truth for "will the SDK route through OpenRouter?".
Checks the flag *and* that ``api_key`` + a valid ``base_url`` are
present — mirrors the fallback logic in ``_build_sdk_env``.
present — mirrors the fallback logic in ``build_sdk_env``.
"""
if not self.use_openrouter:
return False

View File

@@ -17,6 +17,9 @@ from backend.util.workspace import WorkspaceManager
if TYPE_CHECKING:
from e2b import AsyncSandbox
from backend.copilot.permissions import CopilotPermissions
# Allowed base directory for the Read tool. Public so service.py can use it
# for sweep operations without depending on a private implementation detail.
# Respects CLAUDE_CONFIG_DIR env var, consistent with transcript.py's
@@ -43,6 +46,12 @@ _current_sandbox: ContextVar["AsyncSandbox | None"] = ContextVar(
)
_current_sdk_cwd: ContextVar[str] = ContextVar("_current_sdk_cwd", default="")
# Current execution's capability filter. None means "no restrictions".
# Set by set_execution_context(); read by run_block and service.py.
_current_permissions: "ContextVar[CopilotPermissions | None]" = ContextVar(
"_current_permissions", default=None
)
def encode_cwd_for_cli(cwd: str) -> str:
"""Encode a working directory path the same way the Claude CLI does.
@@ -63,6 +72,7 @@ def set_execution_context(
session: ChatSession,
sandbox: "AsyncSandbox | None" = None,
sdk_cwd: str | None = None,
permissions: "CopilotPermissions | None" = None,
) -> None:
"""Set per-turn context variables used by file-resolution tool handlers."""
_current_user_id.set(user_id)
@@ -70,6 +80,7 @@ def set_execution_context(
_current_sandbox.set(sandbox)
_current_sdk_cwd.set(sdk_cwd or "")
_current_project_dir.set(_encode_cwd_for_cli(sdk_cwd) if sdk_cwd else "")
_current_permissions.set(permissions)
def get_execution_context() -> tuple[str | None, ChatSession | None]:
@@ -77,6 +88,11 @@ def get_execution_context() -> tuple[str | None, ChatSession | None]:
return _current_user_id.get(), _current_session.get()
def get_current_permissions() -> "CopilotPermissions | None":
"""Return the capability filter for the current execution, or None if unrestricted."""
return _current_permissions.get()
def get_current_sandbox() -> "AsyncSandbox | None":
"""Return the E2B sandbox for the current session, or None if not active."""
return _current_sandbox.get()
@@ -88,17 +104,32 @@ def get_sdk_cwd() -> str:
E2B_WORKDIR = "/home/user"
E2B_ALLOWED_DIRS: tuple[str, ...] = (E2B_WORKDIR, "/tmp")
E2B_ALLOWED_DIRS_STR: str = " or ".join(E2B_ALLOWED_DIRS)
def is_within_allowed_dirs(path: str) -> bool:
"""Return True if *path* is within one of the allowed sandbox directories."""
for allowed in E2B_ALLOWED_DIRS:
if path == allowed or path.startswith(allowed + "/"):
return True
return False
def resolve_sandbox_path(path: str) -> str:
"""Normalise *path* to an absolute sandbox path under ``/home/user``.
"""Normalise *path* to an absolute sandbox path under an allowed directory.
Allowed directories: ``/home/user`` and ``/tmp``.
Relative paths are resolved against ``/home/user``.
Raises :class:`ValueError` if the resolved path escapes the sandbox.
"""
candidate = path if os.path.isabs(path) else os.path.join(E2B_WORKDIR, path)
normalized = os.path.normpath(candidate)
if normalized != E2B_WORKDIR and not normalized.startswith(E2B_WORKDIR + "/"):
raise ValueError(f"Path must be within {E2B_WORKDIR}: {path}")
if not is_within_allowed_dirs(normalized):
raise ValueError(
f"Path must be within {E2B_ALLOWED_DIRS_STR}: {os.path.basename(path)}"
)
return normalized

View File

@@ -11,6 +11,7 @@ import pytest
from backend.copilot.context import (
SDK_PROJECTS_DIR,
_current_project_dir,
get_current_permissions,
get_current_sandbox,
get_execution_context,
get_sdk_cwd,
@@ -18,6 +19,7 @@ from backend.copilot.context import (
resolve_sandbox_path,
set_execution_context,
)
from backend.copilot.permissions import CopilotPermissions
def _make_session() -> MagicMock:
@@ -61,6 +63,19 @@ def test_get_current_sandbox_returns_set_value():
assert get_current_sandbox() is mock_sandbox
def test_set_and_get_current_permissions():
"""set_execution_context stores permissions; get_current_permissions returns it."""
perms = CopilotPermissions(tools=["run_block"], tools_exclude=False)
set_execution_context("u1", _make_session(), permissions=perms)
assert get_current_permissions() is perms
def test_get_current_permissions_defaults_to_none():
"""get_current_permissions returns None when no permissions have been set."""
set_execution_context("u1", _make_session())
assert get_current_permissions() is None
def test_get_sdk_cwd_empty_when_not_set():
"""get_sdk_cwd returns empty string when sdk_cwd is not set."""
set_execution_context("u1", _make_session(), sdk_cwd=None)
@@ -183,10 +198,32 @@ def test_resolve_sandbox_path_normalizes_dots():
def test_resolve_sandbox_path_escape_raises():
with pytest.raises(ValueError, match="/home/user"):
with pytest.raises(ValueError, match="must be within"):
resolve_sandbox_path("/home/user/../../etc/passwd")
def test_resolve_sandbox_path_absolute_outside_raises():
with pytest.raises(ValueError, match="/home/user"):
with pytest.raises(ValueError):
resolve_sandbox_path("/etc/passwd")
def test_resolve_sandbox_path_tmp_allowed():
assert resolve_sandbox_path("/tmp/data.txt") == "/tmp/data.txt"
def test_resolve_sandbox_path_tmp_nested():
assert resolve_sandbox_path("/tmp/a/b/c.txt") == "/tmp/a/b/c.txt"
def test_resolve_sandbox_path_tmp_itself():
assert resolve_sandbox_path("/tmp") == "/tmp"
def test_resolve_sandbox_path_tmp_escape_raises():
with pytest.raises(ValueError):
resolve_sandbox_path("/tmp/../etc/passwd")
def test_resolve_sandbox_path_tmp_prefix_collision_raises():
with pytest.raises(ValueError):
resolve_sandbox_path("/tmp_evil/malicious.txt")

View File

@@ -18,7 +18,13 @@ from prisma.types import (
from backend.data import db
from backend.util.json import SafeJson, sanitize_string
from .model import ChatMessage, ChatSession, ChatSessionInfo
from .model import (
ChatMessage,
ChatSession,
ChatSessionInfo,
ChatSessionMetadata,
invalidate_session_cache,
)
logger = logging.getLogger(__name__)
@@ -35,6 +41,7 @@ async def get_chat_session(session_id: str) -> ChatSession | None:
async def create_chat_session(
session_id: str,
user_id: str,
metadata: ChatSessionMetadata | None = None,
) -> ChatSessionInfo:
"""Create a new chat session in the database."""
data = ChatSessionCreateInput(
@@ -43,6 +50,7 @@ async def create_chat_session(
credentials=SafeJson({}),
successfulAgentRuns=SafeJson({}),
successfulAgentSchedules=SafeJson({}),
metadata=SafeJson((metadata or ChatSessionMetadata()).model_dump()),
)
prisma_session = await PrismaChatSession.prisma().create(data=data)
return ChatSessionInfo.from_db(prisma_session)
@@ -57,7 +65,12 @@ async def update_chat_session(
total_completion_tokens: int | None = None,
title: str | None = None,
) -> ChatSession | None:
"""Update a chat session's metadata."""
"""Update a chat session's mutable fields.
Note: ``metadata`` (which includes ``dry_run``) is intentionally omitted —
it is set once at creation time and treated as immutable for the lifetime
of the session.
"""
data: ChatSessionUpdateInput = {"updatedAt": datetime.now(UTC)}
if credentials is not None:
@@ -217,6 +230,9 @@ async def add_chat_messages_batch(
if msg.get("function_call") is not None:
data["functionCall"] = SafeJson(msg["function_call"])
if msg.get("duration_ms") is not None:
data["durationMs"] = msg["duration_ms"]
messages_data.append(data)
# Run create_many and session update in parallel within transaction
@@ -359,3 +375,22 @@ async def update_tool_message_content(
f"tool_call_id {tool_call_id}: {e}"
)
return False
async def set_turn_duration(session_id: str, duration_ms: int) -> None:
"""Set durationMs on the last assistant message in a session.
Also invalidates the Redis session cache so the next GET returns
the updated duration.
"""
last_msg = await PrismaChatMessage.prisma().find_first(
where={"sessionId": session_id, "role": "assistant"},
order={"sequence": "desc"},
)
if last_msg:
await PrismaChatMessage.prisma().update(
where={"id": last_msg.id},
data={"durationMs": duration_ms},
)
# Invalidate cache so the session is re-fetched from DB with durationMs
await invalidate_session_cache(session_id)

View File

@@ -14,7 +14,7 @@ import time
from backend.copilot import stream_registry
from backend.copilot.baseline import stream_chat_completion_baseline
from backend.copilot.config import ChatConfig
from backend.copilot.response_model import StreamFinish
from backend.copilot.response_model import StreamError
from backend.copilot.sdk import service as sdk_service
from backend.copilot.sdk.dummy import stream_chat_completion_dummy
from backend.executor.cluster_lock import ClusterLock
@@ -23,6 +23,7 @@ from backend.util.feature_flag import Flag, is_feature_enabled
from backend.util.logging import TruncatedLogger, configure_logging
from backend.util.process import set_service_name
from backend.util.retry import func_retry
from backend.util.workspace_storage import shutdown_workspace_storage
from .utils import CoPilotExecutionEntry, CoPilotLogMetadata
@@ -153,8 +154,6 @@ class CoPilotProcessor:
worker's event loop, ensuring ``aiohttp.ClientSession.close()``
runs on the same loop that created the session.
"""
from backend.util.workspace_storage import shutdown_workspace_storage
coro = shutdown_workspace_storage()
try:
future = asyncio.run_coroutine_threadsafe(coro, self.execution_loop)
@@ -268,35 +267,37 @@ class CoPilotProcessor:
log.info(f"Using {'SDK' if use_sdk else 'baseline'} service")
# Stream chat completion and publish chunks to Redis.
async for chunk in stream_fn(
# stream_and_publish wraps the raw stream with registry
# publishing (shared with collect_copilot_response).
raw_stream = stream_fn(
session_id=entry.session_id,
message=entry.message if entry.message else None,
is_user_message=entry.is_user_message,
user_id=entry.user_id,
context=entry.context,
file_ids=entry.file_ids,
)
async for chunk in stream_registry.stream_and_publish(
session_id=entry.session_id,
turn_id=entry.turn_id,
stream=raw_stream,
):
if cancel.is_set():
log.info("Cancel requested, breaking stream")
break
# Capture StreamError so mark_session_completed receives
# the error message (stream_and_publish yields but does
# not publish StreamError — that's done by mark_session_completed).
if isinstance(chunk, StreamError):
error_msg = chunk.errorText
break
current_time = time.monotonic()
if current_time - last_refresh >= refresh_interval:
cluster_lock.refresh()
last_refresh = current_time
# Skip StreamFinish — mark_session_completed publishes it.
if isinstance(chunk, StreamFinish):
continue
try:
await stream_registry.publish_chunk(entry.turn_id, chunk)
except Exception as e:
log.error(
f"Error publishing chunk {type(chunk).__name__}: {e}",
exc_info=True,
)
# Stream loop completed
if cancel.is_set():
log.info("Stream cancelled by user")

View File

@@ -123,6 +123,7 @@ async def get_provider_token(user_id: str, provider: str) -> str | None:
[c for c in creds_list if c.type == "oauth2"],
key=lambda c: 0 if "repo" in (cast(OAuth2Credentials, c).scopes or []) else 1,
)
refresh_failed = False
for creds in oauth2_creds:
if creds.type == "oauth2":
try:
@@ -141,6 +142,7 @@ async def get_provider_token(user_id: str, provider: str) -> str | None:
# Do NOT fall back to the stale token — it is likely expired
# or revoked. Returning None forces the caller to re-auth,
# preventing the LLM from receiving a non-functional token.
refresh_failed = True
continue
_token_cache[cache_key] = token
return token
@@ -152,8 +154,12 @@ async def get_provider_token(user_id: str, provider: str) -> str | None:
_token_cache[cache_key] = token
return token
# No credentials found — cache to avoid repeated DB hits.
_null_cache[cache_key] = True
# Only cache "not connected" when the user truly has no credentials for this
# provider. If we had OAuth credentials but refresh failed (e.g. transient
# network error, event-loop mismatch), do NOT cache the negative result —
# the next call should retry the refresh instead of being blocked for 60 s.
if not refresh_failed:
_null_cache[cache_key] = True
return None

View File

@@ -129,8 +129,15 @@ class TestGetProviderToken:
assert result == "oauth-tok"
@pytest.mark.asyncio(loop_scope="session")
async def test_oauth2_refresh_failure_returns_none(self):
"""On refresh failure, return None instead of caching a stale token."""
async def test_oauth2_refresh_failure_returns_none_without_null_cache(self):
"""On refresh failure, return None but do NOT cache in null_cache.
The user has credentials — they just couldn't be refreshed right now
(e.g. transient network error or event-loop mismatch in the copilot
executor). Caching a negative result would block all credential
lookups for 60 s even though the creds exist and may refresh fine
on the next attempt.
"""
oauth_creds = _make_oauth2_creds("stale-oauth-tok")
mock_manager = MagicMock()
mock_manager.store.get_creds_by_provider = AsyncMock(return_value=[oauth_creds])
@@ -141,6 +148,8 @@ class TestGetProviderToken:
# Stale tokens must NOT be returned — forces re-auth.
assert result is None
# Must NOT cache negative result when refresh failed — next call retries.
assert (_USER, _PROVIDER) not in _null_cache
@pytest.mark.asyncio(loop_scope="session")
async def test_no_credentials_caches_null_entry(self):
@@ -176,6 +185,96 @@ class TestGetProviderToken:
assert _NULL_CACHE_TTL < _TOKEN_CACHE_TTL
class TestThreadSafetyLocks:
"""Bug reproduction: shared AsyncRedisKeyedMutex across threads caused
'Future attached to a different loop' when copilot workers accessed
credentials from different event loops."""
@pytest.mark.asyncio(loop_scope="session")
async def test_store_locks_returns_per_thread_instance(self):
"""IntegrationCredentialsStore.locks() must return different instances
for different threads (via @thread_cached)."""
import asyncio
import concurrent.futures
from backend.integrations.credentials_store import IntegrationCredentialsStore
store = IntegrationCredentialsStore()
async def get_locks_id():
mock_redis = AsyncMock()
with patch(
"backend.integrations.credentials_store.get_redis_async",
return_value=mock_redis,
):
locks = await store.locks()
return id(locks)
# Get locks from main thread
main_id = await get_locks_id()
# Get locks from a worker thread
def run_in_thread():
loop = asyncio.new_event_loop()
try:
return loop.run_until_complete(get_locks_id())
finally:
loop.close()
with concurrent.futures.ThreadPoolExecutor(max_workers=1) as pool:
worker_id = await asyncio.get_event_loop().run_in_executor(
pool, run_in_thread
)
assert main_id != worker_id, (
"Store.locks() returned the same instance across threads. "
"This would cause 'Future attached to a different loop' errors."
)
@pytest.mark.asyncio(loop_scope="session")
async def test_manager_delegates_to_store_locks(self):
"""IntegrationCredentialsManager.locks() should delegate to store."""
from backend.integrations.creds_manager import IntegrationCredentialsManager
manager = IntegrationCredentialsManager()
mock_redis = AsyncMock()
with patch(
"backend.integrations.credentials_store.get_redis_async",
return_value=mock_redis,
):
locks = await manager.locks()
# Should have gotten it from the store
assert locks is not None
class TestRefreshUnlockedPath:
"""Bug reproduction: copilot worker threads need lock-free refresh because
Redis-backed asyncio.Lock created on one event loop can't be used on another."""
@pytest.mark.asyncio(loop_scope="session")
async def test_refresh_if_needed_lock_false_skips_redis(self):
"""refresh_if_needed(lock=False) must not touch Redis locks at all."""
from backend.integrations.creds_manager import IntegrationCredentialsManager
manager = IntegrationCredentialsManager()
creds = _make_oauth2_creds()
mock_handler = MagicMock()
mock_handler.needs_refresh = MagicMock(return_value=False)
with patch(
"backend.integrations.creds_manager._get_provider_oauth_handler",
new_callable=AsyncMock,
return_value=mock_handler,
):
result = await manager.refresh_if_needed(_USER, creds, lock=False)
# Should return credentials without touching locks
assert result.id == creds.id
class TestGetIntegrationEnvVars:
@pytest.mark.asyncio(loop_scope="session")
async def test_injects_all_env_vars_for_provider(self):

View File

@@ -46,6 +46,16 @@ def _get_session_cache_key(session_id: str) -> str:
# ===================== Chat data models ===================== #
class ChatSessionMetadata(BaseModel):
"""Typed metadata stored in the ``metadata`` JSON column of ChatSession.
Add new session-level flags here instead of adding DB columns —
no migration required for new fields as long as a default is provided.
"""
dry_run: bool = False
class ChatMessage(BaseModel):
role: str
content: str | None = None
@@ -54,6 +64,7 @@ class ChatMessage(BaseModel):
refusal: str | None = None
tool_calls: list[dict] | None = None
function_call: dict | None = None
duration_ms: int | None = None
@staticmethod
def from_db(prisma_message: PrismaChatMessage) -> "ChatMessage":
@@ -66,6 +77,7 @@ class ChatMessage(BaseModel):
refusal=prisma_message.refusal,
tool_calls=_parse_json_field(prisma_message.toolCalls),
function_call=_parse_json_field(prisma_message.functionCall),
duration_ms=prisma_message.durationMs,
)
@@ -88,6 +100,12 @@ class ChatSessionInfo(BaseModel):
updated_at: datetime
successful_agent_runs: dict[str, int] = {}
successful_agent_schedules: dict[str, int] = {}
metadata: ChatSessionMetadata = ChatSessionMetadata()
@property
def dry_run(self) -> bool:
"""Convenience accessor for ``metadata.dry_run``."""
return self.metadata.dry_run
@classmethod
def from_db(cls, prisma_session: PrismaChatSession) -> Self:
@@ -101,6 +119,10 @@ class ChatSessionInfo(BaseModel):
prisma_session.successfulAgentSchedules, default={}
)
# Parse typed metadata from the JSON column.
raw_metadata = _parse_json_field(prisma_session.metadata, default={})
metadata = ChatSessionMetadata.model_validate(raw_metadata)
# Calculate usage from token counts.
# NOTE: Per-turn cache_read_tokens / cache_creation_tokens breakdown
# is lost after persistence — the DB only stores aggregate prompt and
@@ -126,6 +148,7 @@ class ChatSessionInfo(BaseModel):
updated_at=prisma_session.updatedAt,
successful_agent_runs=successful_agent_runs,
successful_agent_schedules=successful_agent_schedules,
metadata=metadata,
)
@@ -133,7 +156,7 @@ class ChatSession(ChatSessionInfo):
messages: list[ChatMessage]
@classmethod
def new(cls, user_id: str) -> Self:
def new(cls, user_id: str, *, dry_run: bool) -> Self:
return cls(
session_id=str(uuid.uuid4()),
user_id=user_id,
@@ -143,6 +166,7 @@ class ChatSession(ChatSessionInfo):
credentials={},
started_at=datetime.now(UTC),
updated_at=datetime.now(UTC),
metadata=ChatSessionMetadata(dry_run=dry_run),
)
@classmethod
@@ -530,6 +554,7 @@ async def _save_session_to_db(
await db.create_chat_session(
session_id=session.session_id,
user_id=session.user_id,
metadata=session.metadata,
)
existing_message_count = 0
@@ -607,21 +632,27 @@ async def append_and_save_message(session_id: str, message: ChatMessage) -> Chat
return session
async def create_chat_session(user_id: str) -> ChatSession:
async def create_chat_session(user_id: str, *, dry_run: bool) -> ChatSession:
"""Create a new chat session and persist it.
Args:
user_id: The authenticated user ID.
dry_run: When True, run_block and run_agent tool calls in this
session are forced to use dry-run simulation mode.
Raises:
DatabaseError: If the database write fails. We fail fast to ensure
callers never receive a non-persisted session that only exists
in cache (which would be lost when the cache expires).
"""
session = ChatSession.new(user_id)
session = ChatSession.new(user_id, dry_run=dry_run)
# Create in database first - fail fast if this fails
try:
await chat_db().create_chat_session(
session_id=session.session_id,
user_id=user_id,
metadata=session.metadata,
)
except Exception as e:
logger.error(f"Failed to create session {session.session_id} in database: {e}")

View File

@@ -46,7 +46,7 @@ messages = [
@pytest.mark.asyncio(loop_scope="session")
async def test_chatsession_serialization_deserialization():
s = ChatSession.new(user_id="abc123")
s = ChatSession.new(user_id="abc123", dry_run=False)
s.messages = messages
s.usage = [Usage(prompt_tokens=100, completion_tokens=200, total_tokens=300)]
serialized = s.model_dump_json()
@@ -57,7 +57,7 @@ async def test_chatsession_serialization_deserialization():
@pytest.mark.asyncio(loop_scope="session")
async def test_chatsession_redis_storage(setup_test_user, test_user_id):
s = ChatSession.new(user_id=test_user_id)
s = ChatSession.new(user_id=test_user_id, dry_run=False)
s.messages = messages
s = await upsert_chat_session(s)
@@ -75,7 +75,7 @@ async def test_chatsession_redis_storage_user_id_mismatch(
setup_test_user, test_user_id
):
s = ChatSession.new(user_id=test_user_id)
s = ChatSession.new(user_id=test_user_id, dry_run=False)
s.messages = messages
s = await upsert_chat_session(s)
@@ -90,7 +90,7 @@ async def test_chatsession_db_storage(setup_test_user, test_user_id):
from backend.data.redis_client import get_redis_async
# Create session with messages including assistant message
s = ChatSession.new(user_id=test_user_id)
s = ChatSession.new(user_id=test_user_id, dry_run=False)
s.messages = messages # Contains user, assistant, and tool messages
assert s.session_id is not None, "Session id is not set"
# Upsert to save to both cache and DB
@@ -241,7 +241,7 @@ _raw_tc2 = {
def test_add_tool_call_appends_to_existing_assistant():
"""When the last assistant is from the current turn, tool_call is added to it."""
session = ChatSession.new(user_id="u")
session = ChatSession.new(user_id="u", dry_run=False)
session.messages = [
ChatMessage(role="user", content="hi"),
ChatMessage(role="assistant", content="working on it"),
@@ -254,7 +254,7 @@ def test_add_tool_call_appends_to_existing_assistant():
def test_add_tool_call_creates_assistant_when_none_exists():
"""When there's no current-turn assistant, a new one is created."""
session = ChatSession.new(user_id="u")
session = ChatSession.new(user_id="u", dry_run=False)
session.messages = [
ChatMessage(role="user", content="hi"),
]
@@ -267,7 +267,7 @@ def test_add_tool_call_creates_assistant_when_none_exists():
def test_add_tool_call_does_not_cross_user_boundary():
"""A user message acts as a boundary — previous assistant is not modified."""
session = ChatSession.new(user_id="u")
session = ChatSession.new(user_id="u", dry_run=False)
session.messages = [
ChatMessage(role="assistant", content="old turn"),
ChatMessage(role="user", content="new message"),
@@ -282,7 +282,7 @@ def test_add_tool_call_does_not_cross_user_boundary():
def test_add_tool_call_multiple_times():
"""Multiple long-running tool calls accumulate on the same assistant."""
session = ChatSession.new(user_id="u")
session = ChatSession.new(user_id="u", dry_run=False)
session.messages = [
ChatMessage(role="user", content="hi"),
ChatMessage(role="assistant", content="doing stuff"),
@@ -300,7 +300,7 @@ def test_add_tool_call_multiple_times():
def test_to_openai_messages_merges_split_assistants():
"""End-to-end: session with split assistants produces valid OpenAI messages."""
session = ChatSession.new(user_id="u")
session = ChatSession.new(user_id="u", dry_run=False)
session.messages = [
ChatMessage(role="user", content="build agent"),
ChatMessage(role="assistant", content="Let me build that"),
@@ -352,7 +352,7 @@ async def test_concurrent_saves_collision_detection(setup_test_user, test_user_i
import asyncio
# Create a session with initial messages
session = ChatSession.new(user_id=test_user_id)
session = ChatSession.new(user_id=test_user_id, dry_run=False)
for i in range(3):
session.messages.append(
ChatMessage(

View File

@@ -0,0 +1,431 @@
"""Copilot execution permissions — tool and block allow/deny filtering.
:class:`CopilotPermissions` is the single model used everywhere:
- ``AutoPilotBlock`` reads four block-input fields and builds one instance.
- ``stream_chat_completion_sdk`` applies it when constructing
``ClaudeAgentOptions.allowed_tools`` / ``disallowed_tools``.
- ``run_block`` reads it from the contextvar to gate block execution.
- Recursive (sub-agent) invocations merge parent and child so children
can only be *more* restrictive, never more permissive.
Tool names
----------
Users specify the **short name** as it appears in ``TOOL_REGISTRY`` (e.g.
``run_block``, ``web_fetch``) or as an SDK built-in (e.g. ``Read``,
``Task``, ``WebSearch``). Internally these are mapped to the full SDK
format (``mcp__copilot__run_block``, ``Read``, …) by
:func:`apply_tool_permissions`.
Block identifiers
-----------------
Each entry in ``blocks`` may be one of:
- A **full UUID** (``c069dc6b-c3ed-4c12-b6e5-d47361e64ce6``)
- A **partial UUID** — the first 8-character hex segment (``c069dc6b``)
- A **block name** (case-insensitive, e.g. ``"HTTP Request"``)
:func:`validate_block_identifiers` resolves all entries against the live
block registry and returns any that could not be matched.
Semantics
---------
``tools_exclude=True`` (default) — ``tools`` is a **blacklist**; listed
tools are denied and everything else is allowed. An empty list means
"allow all" (no filtering).
``tools_exclude=False`` — ``tools`` is a **whitelist**; only listed tools
are allowed.
``blocks_exclude`` follows the same pattern for ``blocks``.
Recursion inheritance
---------------------
:meth:`CopilotPermissions.merged_with_parent` produces a new instance that
is at most as permissive as the parent:
- Tools: effective-allowed sets are intersected then stored as a whitelist.
- Blocks: the parent is stored in ``_parent`` and consulted during every
:meth:`is_block_allowed` call so both constraints must pass.
"""
from __future__ import annotations
import re
from typing import Literal, get_args
from pydantic import BaseModel, PrivateAttr
# ---------------------------------------------------------------------------
# Constants — single source of truth for all accepted tool names
# ---------------------------------------------------------------------------
# Literal type combining all valid tool names — used by AutoPilotBlock.Input
# so the frontend renders a multi-select dropdown.
# This is the SINGLE SOURCE OF TRUTH. All other name sets are derived from it.
ToolName = Literal[
# Platform tools (must match keys in TOOL_REGISTRY)
"add_understanding",
"ask_question",
"bash_exec",
"browser_act",
"browser_navigate",
"browser_screenshot",
"connect_integration",
"continue_run_block",
"create_agent",
"create_feature_request",
"create_folder",
"customize_agent",
"delete_folder",
"delete_workspace_file",
"edit_agent",
"find_agent",
"find_block",
"find_library_agent",
"fix_agent_graph",
"get_agent_building_guide",
"get_doc_page",
"get_mcp_guide",
"list_folders",
"list_workspace_files",
"move_agents_to_folder",
"move_folder",
"read_workspace_file",
"run_agent",
"run_block",
"run_mcp_tool",
"search_docs",
"search_feature_requests",
"update_folder",
"validate_agent_graph",
"view_agent_output",
"web_fetch",
"write_workspace_file",
# SDK built-ins
"Edit",
"Glob",
"Grep",
"Read",
"Task",
"TodoWrite",
"WebSearch",
"Write",
]
# Frozen set of all valid tool names — derived from the Literal.
ALL_TOOL_NAMES: frozenset[str] = frozenset(get_args(ToolName))
# SDK built-in tool names — uppercase-initial names are SDK built-ins.
SDK_BUILTIN_TOOL_NAMES: frozenset[str] = frozenset(
n for n in ALL_TOOL_NAMES if n[0].isupper()
)
# Platform tool names — everything that isn't an SDK built-in.
PLATFORM_TOOL_NAMES: frozenset[str] = ALL_TOOL_NAMES - SDK_BUILTIN_TOOL_NAMES
# Compiled regex patterns for block identifier classification.
_FULL_UUID_RE = re.compile(
r"^[0-9a-f]{8}-[0-9a-f]{4}-[0-9a-f]{4}-[0-9a-f]{4}-[0-9a-f]{12}$",
re.IGNORECASE,
)
_PARTIAL_UUID_RE = re.compile(r"^[0-9a-f]{8}$", re.IGNORECASE)
# ---------------------------------------------------------------------------
# Helper — block identifier matching
# ---------------------------------------------------------------------------
def _block_matches(identifier: str, block_id: str, block_name: str) -> bool:
"""Return True if *identifier* resolves to the given block.
Resolution order:
1. Full UUID — exact case-insensitive match against *block_id*.
2. Partial UUID (8 hex chars, first segment) — prefix match.
3. Name — case-insensitive equality against *block_name*.
"""
ident = identifier.strip()
if _FULL_UUID_RE.match(ident):
return ident.lower() == block_id.lower()
if _PARTIAL_UUID_RE.match(ident):
return block_id.lower().startswith(ident.lower())
return ident.lower() == block_name.lower()
# ---------------------------------------------------------------------------
# Model
# ---------------------------------------------------------------------------
class CopilotPermissions(BaseModel):
"""Capability filter for a single copilot execution.
Attributes:
tools: Tool names to filter (short names, e.g. ``run_block``).
tools_exclude: When True (default) ``tools`` is a blacklist;
when False it is a whitelist. Ignored when *tools* is empty.
blocks: Block identifiers (name, full UUID, or 8-char partial UUID).
blocks_exclude: Same semantics as *tools_exclude* but for blocks.
"""
tools: list[str] = []
tools_exclude: bool = True
blocks: list[str] = []
blocks_exclude: bool = True
# Private: parent permissions for recursion inheritance.
# Set only by merged_with_parent(); never exposed in block input schema.
_parent: CopilotPermissions | None = PrivateAttr(default=None)
# ------------------------------------------------------------------
# Tool helpers
# ------------------------------------------------------------------
def effective_allowed_tools(self, all_tools: frozenset[str]) -> frozenset[str]:
"""Compute the set of short tool names that are permitted.
Args:
all_tools: Universe of valid short tool names.
Returns:
Subset of *all_tools* that pass the filter.
"""
if not self.tools:
return frozenset(all_tools)
tool_set = frozenset(self.tools)
if self.tools_exclude:
return all_tools - tool_set
return all_tools & tool_set
# ------------------------------------------------------------------
# Block helpers
# ------------------------------------------------------------------
def is_block_allowed(self, block_id: str, block_name: str) -> bool:
"""Return True if the block may be executed under these permissions.
Checks this instance first, then consults the parent (if any) so
the entire inheritance chain is respected.
"""
if not self._check_block_locally(block_id, block_name):
return False
if self._parent is not None:
return self._parent.is_block_allowed(block_id, block_name)
return True
def _check_block_locally(self, block_id: str, block_name: str) -> bool:
"""Check *only* this instance's block filter (ignores parent)."""
if not self.blocks:
return True # No filter → allow all
matched = any(
_block_matches(identifier, block_id, block_name)
for identifier in self.blocks
)
return not matched if self.blocks_exclude else matched
# ------------------------------------------------------------------
# Recursion / merging
# ------------------------------------------------------------------
def merged_with_parent(
self,
parent: CopilotPermissions,
all_tools: frozenset[str],
) -> CopilotPermissions:
"""Return a new instance that is at most as permissive as *parent*.
- Tools: intersection of effective-allowed sets, stored as a whitelist.
- Blocks: parent is stored internally; both constraints are applied
during :meth:`is_block_allowed`.
"""
merged_tools = self.effective_allowed_tools(
all_tools
) & parent.effective_allowed_tools(all_tools)
result = CopilotPermissions(
tools=sorted(merged_tools),
tools_exclude=False,
blocks=self.blocks,
blocks_exclude=self.blocks_exclude,
)
result._parent = parent
return result
# ------------------------------------------------------------------
# Convenience
# ------------------------------------------------------------------
def is_empty(self) -> bool:
"""Return True when no filtering is configured (allow-all passthrough)."""
return not self.tools and not self.blocks and self._parent is None
# ---------------------------------------------------------------------------
# Validation helpers
# ---------------------------------------------------------------------------
def all_known_tool_names() -> frozenset[str]:
"""Return all short tool names accepted in *tools*.
Returns the pre-computed ``ALL_TOOL_NAMES`` set (derived from the
``ToolName`` Literal). On first call, also verifies consistency with
the live ``TOOL_REGISTRY``.
"""
_assert_tool_names_consistent()
return ALL_TOOL_NAMES
def validate_tool_names(tools: list[str]) -> list[str]:
"""Return entries in *tools* that are not valid tool names.
Args:
tools: List of short tool name strings to validate.
Returns:
List of invalid names (empty if all are valid).
"""
return [t for t in tools if t not in ALL_TOOL_NAMES]
_tool_names_checked = False
def _assert_tool_names_consistent() -> None:
"""Verify that ``PLATFORM_TOOL_NAMES`` matches ``TOOL_REGISTRY`` keys.
Called once lazily (TOOL_REGISTRY has heavy imports). Raises
``AssertionError`` with a helpful diff if they diverge.
"""
global _tool_names_checked
if _tool_names_checked:
return
_tool_names_checked = True
from backend.copilot.tools import TOOL_REGISTRY
registry_keys: frozenset[str] = frozenset(TOOL_REGISTRY.keys())
declared: frozenset[str] = PLATFORM_TOOL_NAMES
if registry_keys != declared:
missing = registry_keys - declared
extra = declared - registry_keys
parts: list[str] = [
"PLATFORM_TOOL_NAMES in permissions.py is out of sync with TOOL_REGISTRY."
]
if missing:
parts.append(f" Missing from PLATFORM_TOOL_NAMES: {sorted(missing)}")
if extra:
parts.append(f" Extra in PLATFORM_TOOL_NAMES: {sorted(extra)}")
parts.append(" Update the ToolName Literal to match.")
raise AssertionError("\n".join(parts))
async def validate_block_identifiers(
identifiers: list[str],
) -> list[str]:
"""Resolve each block identifier and return those that could not be matched.
Args:
identifiers: List of block identifiers (name, full UUID, or partial UUID).
Returns:
List of identifiers that matched no known block.
"""
from backend.blocks import get_blocks
# get_blocks() returns dict[block_id_str, BlockClass]; instantiate once to get names.
block_registry = get_blocks()
block_info = {bid: cls().name for bid, cls in block_registry.items()}
invalid: list[str] = []
for ident in identifiers:
matched = any(
_block_matches(ident, bid, bname) for bid, bname in block_info.items()
)
if not matched:
invalid.append(ident)
return invalid
# ---------------------------------------------------------------------------
# SDK tool-list application
# ---------------------------------------------------------------------------
def apply_tool_permissions(
permissions: CopilotPermissions,
*,
use_e2b: bool = False,
) -> tuple[list[str], list[str]]:
"""Compute (allowed_tools, extra_disallowed) for :class:`ClaudeAgentOptions`.
Takes the base allowed/disallowed lists from
:func:`~backend.copilot.sdk.tool_adapter.get_copilot_tool_names` /
:func:`~backend.copilot.sdk.tool_adapter.get_sdk_disallowed_tools` and
applies *permissions* on top.
Returns:
``(allowed_tools, extra_disallowed)`` where *allowed_tools* is the
possibly-narrowed list to pass to ``ClaudeAgentOptions.allowed_tools``
and *extra_disallowed* is the list to pass to
``ClaudeAgentOptions.disallowed_tools``.
"""
from backend.copilot.sdk.tool_adapter import (
_READ_TOOL_NAME,
MCP_TOOL_PREFIX,
get_copilot_tool_names,
get_sdk_disallowed_tools,
)
from backend.copilot.tools import TOOL_REGISTRY
base_allowed = get_copilot_tool_names(use_e2b=use_e2b)
base_disallowed = get_sdk_disallowed_tools(use_e2b=use_e2b)
if permissions.is_empty():
return base_allowed, base_disallowed
all_tools = all_known_tool_names()
effective = permissions.effective_allowed_tools(all_tools)
# In E2B mode, SDK built-in file tools (Read, Write, Edit, Glob, Grep)
# are replaced by MCP equivalents (read_file, write_file, ...).
# Map each SDK built-in name to its E2B MCP name so users can use the
# familiar names in their permissions and the E2B tools are included.
_SDK_TO_E2B: dict[str, str] = {}
if use_e2b:
from backend.copilot.sdk.e2b_file_tools import E2B_FILE_TOOL_NAMES
_SDK_TO_E2B = dict(
zip(
["Read", "Write", "Edit", "Glob", "Grep"],
E2B_FILE_TOOL_NAMES,
strict=False,
)
)
# Build an updated allowed list by mapping short names → SDK names and
# keeping only those present in the original base_allowed list.
def to_sdk_names(short: str) -> list[str]:
names: list[str] = []
if short in TOOL_REGISTRY:
names.append(f"{MCP_TOOL_PREFIX}{short}")
elif short in _SDK_TO_E2B:
# E2B mode: map SDK built-in file tool to its MCP equivalent.
names.append(f"{MCP_TOOL_PREFIX}{_SDK_TO_E2B[short]}")
else:
names.append(short) # SDK built-in — used as-is
return names
# short names permitted by permissions
permitted_sdk: set[str] = set()
for s in effective:
permitted_sdk.update(to_sdk_names(s))
# Always include the internal Read tool (used by SDK for large/truncated outputs)
permitted_sdk.add(f"{MCP_TOOL_PREFIX}{_READ_TOOL_NAME}")
filtered_allowed = [t for t in base_allowed if t in permitted_sdk]
# Extra disallowed = tools that were in base_allowed but are now removed
removed = set(base_allowed) - set(filtered_allowed)
extra_disallowed = list(set(base_disallowed) | removed)
return filtered_allowed, extra_disallowed

View File

@@ -0,0 +1,579 @@
"""Tests for CopilotPermissions — tool/block capability filtering."""
from __future__ import annotations
import pytest
from backend.copilot.permissions import (
ALL_TOOL_NAMES,
PLATFORM_TOOL_NAMES,
SDK_BUILTIN_TOOL_NAMES,
CopilotPermissions,
_block_matches,
all_known_tool_names,
apply_tool_permissions,
validate_block_identifiers,
validate_tool_names,
)
from backend.copilot.tools import TOOL_REGISTRY
# ---------------------------------------------------------------------------
# _block_matches
# ---------------------------------------------------------------------------
class TestBlockMatches:
BLOCK_ID = "c069dc6b-c3ed-4c12-b6e5-d47361e64ce6"
BLOCK_NAME = "HTTP Request"
def test_full_uuid_match(self):
assert _block_matches(self.BLOCK_ID, self.BLOCK_ID, self.BLOCK_NAME)
def test_full_uuid_case_insensitive(self):
assert _block_matches(self.BLOCK_ID.upper(), self.BLOCK_ID, self.BLOCK_NAME)
def test_full_uuid_no_match(self):
other = "aaaaaaaa-0000-0000-0000-000000000000"
assert not _block_matches(other, self.BLOCK_ID, self.BLOCK_NAME)
def test_partial_uuid_match(self):
assert _block_matches("c069dc6b", self.BLOCK_ID, self.BLOCK_NAME)
def test_partial_uuid_case_insensitive(self):
assert _block_matches("C069DC6B", self.BLOCK_ID, self.BLOCK_NAME)
def test_partial_uuid_no_match(self):
assert not _block_matches("deadbeef", self.BLOCK_ID, self.BLOCK_NAME)
def test_name_match(self):
assert _block_matches("HTTP Request", self.BLOCK_ID, self.BLOCK_NAME)
def test_name_case_insensitive(self):
assert _block_matches("http request", self.BLOCK_ID, self.BLOCK_NAME)
assert _block_matches("HTTP REQUEST", self.BLOCK_ID, self.BLOCK_NAME)
def test_name_no_match(self):
assert not _block_matches("Unknown Block", self.BLOCK_ID, self.BLOCK_NAME)
def test_partial_uuid_not_matching_as_name(self):
# "c069dc6b" is 8 hex chars → treated as partial UUID, NOT name match
assert not _block_matches(
"c069dc6b", "ffffffff-0000-0000-0000-000000000000", "c069dc6b"
)
# ---------------------------------------------------------------------------
# CopilotPermissions.effective_allowed_tools
# ---------------------------------------------------------------------------
ALL_TOOLS = frozenset(
["run_block", "web_fetch", "bash_exec", "find_agent", "Task", "Read"]
)
class TestEffectiveAllowedTools:
def test_empty_list_allows_all(self):
perms = CopilotPermissions(tools=[], tools_exclude=True)
assert perms.effective_allowed_tools(ALL_TOOLS) == ALL_TOOLS
def test_empty_whitelist_allows_all(self):
# edge: tools_exclude=False but empty list → allow all
perms = CopilotPermissions(tools=[], tools_exclude=False)
assert perms.effective_allowed_tools(ALL_TOOLS) == ALL_TOOLS
def test_blacklist_removes_listed(self):
perms = CopilotPermissions(tools=["bash_exec", "web_fetch"], tools_exclude=True)
result = perms.effective_allowed_tools(ALL_TOOLS)
assert "bash_exec" not in result
assert "web_fetch" not in result
assert "run_block" in result
assert "Task" in result
def test_whitelist_keeps_only_listed(self):
perms = CopilotPermissions(tools=["run_block", "Task"], tools_exclude=False)
result = perms.effective_allowed_tools(ALL_TOOLS)
assert result == frozenset(["run_block", "Task"])
def test_whitelist_unknown_tool_yields_empty(self):
perms = CopilotPermissions(tools=["nonexistent"], tools_exclude=False)
result = perms.effective_allowed_tools(ALL_TOOLS)
assert result == frozenset()
def test_blacklist_unknown_tool_ignored(self):
perms = CopilotPermissions(tools=["nonexistent"], tools_exclude=True)
result = perms.effective_allowed_tools(ALL_TOOLS)
assert result == ALL_TOOLS
# ---------------------------------------------------------------------------
# CopilotPermissions.is_block_allowed
# ---------------------------------------------------------------------------
BLOCK_ID = "c069dc6b-c3ed-4c12-b6e5-d47361e64ce6"
BLOCK_NAME = "HTTP Request"
class TestIsBlockAllowed:
def test_empty_allows_everything(self):
perms = CopilotPermissions(blocks=[], blocks_exclude=True)
assert perms.is_block_allowed(BLOCK_ID, BLOCK_NAME)
def test_blacklist_blocks_listed(self):
perms = CopilotPermissions(blocks=["HTTP Request"], blocks_exclude=True)
assert not perms.is_block_allowed(BLOCK_ID, BLOCK_NAME)
def test_blacklist_allows_unlisted(self):
perms = CopilotPermissions(blocks=["Other Block"], blocks_exclude=True)
assert perms.is_block_allowed(BLOCK_ID, BLOCK_NAME)
def test_whitelist_allows_listed(self):
perms = CopilotPermissions(blocks=["HTTP Request"], blocks_exclude=False)
assert perms.is_block_allowed(BLOCK_ID, BLOCK_NAME)
def test_whitelist_blocks_unlisted(self):
perms = CopilotPermissions(blocks=["Other Block"], blocks_exclude=False)
assert not perms.is_block_allowed(BLOCK_ID, BLOCK_NAME)
def test_partial_uuid_blacklist(self):
perms = CopilotPermissions(blocks=["c069dc6b"], blocks_exclude=True)
assert not perms.is_block_allowed(BLOCK_ID, BLOCK_NAME)
def test_full_uuid_whitelist(self):
perms = CopilotPermissions(blocks=[BLOCK_ID], blocks_exclude=False)
assert perms.is_block_allowed(BLOCK_ID, BLOCK_NAME)
def test_parent_blocks_when_child_allows(self):
parent = CopilotPermissions(blocks=["HTTP Request"], blocks_exclude=True)
child = CopilotPermissions(blocks=[], blocks_exclude=True)
child._parent = parent
assert not child.is_block_allowed(BLOCK_ID, BLOCK_NAME)
def test_parent_allows_when_child_blocks(self):
parent = CopilotPermissions(blocks=[], blocks_exclude=True)
child = CopilotPermissions(blocks=["HTTP Request"], blocks_exclude=True)
child._parent = parent
assert not child.is_block_allowed(BLOCK_ID, BLOCK_NAME)
def test_both_must_allow(self):
parent = CopilotPermissions(blocks=["HTTP Request"], blocks_exclude=False)
child = CopilotPermissions(blocks=["HTTP Request"], blocks_exclude=False)
child._parent = parent
assert child.is_block_allowed(BLOCK_ID, BLOCK_NAME)
def test_grandparent_blocks_propagate(self):
grandparent = CopilotPermissions(blocks=["HTTP Request"], blocks_exclude=True)
parent = CopilotPermissions(blocks=[], blocks_exclude=True)
parent._parent = grandparent
child = CopilotPermissions(blocks=[], blocks_exclude=True)
child._parent = parent
assert not child.is_block_allowed(BLOCK_ID, BLOCK_NAME)
# ---------------------------------------------------------------------------
# CopilotPermissions.merged_with_parent
# ---------------------------------------------------------------------------
class TestMergedWithParent:
def test_tool_intersection(self):
all_t = frozenset(["run_block", "web_fetch", "bash_exec"])
parent = CopilotPermissions(tools=["bash_exec"], tools_exclude=True)
child = CopilotPermissions(tools=["web_fetch"], tools_exclude=True)
merged = child.merged_with_parent(parent, all_t)
effective = merged.effective_allowed_tools(all_t)
assert "bash_exec" not in effective
assert "web_fetch" not in effective
assert "run_block" in effective
def test_parent_whitelist_narrows_child(self):
all_t = frozenset(["run_block", "web_fetch", "bash_exec"])
parent = CopilotPermissions(tools=["run_block"], tools_exclude=False)
child = CopilotPermissions(tools=[], tools_exclude=True) # allow all
merged = child.merged_with_parent(parent, all_t)
effective = merged.effective_allowed_tools(all_t)
assert effective == frozenset(["run_block"])
def test_child_cannot_expand_parent_whitelist(self):
all_t = frozenset(["run_block", "web_fetch", "bash_exec"])
parent = CopilotPermissions(tools=["run_block"], tools_exclude=False)
child = CopilotPermissions(
tools=["run_block", "bash_exec"], tools_exclude=False
)
merged = child.merged_with_parent(parent, all_t)
effective = merged.effective_allowed_tools(all_t)
# bash_exec was not in parent's whitelist → must not appear
assert "bash_exec" not in effective
assert "run_block" in effective
def test_merged_stored_as_whitelist(self):
all_t = frozenset(["run_block", "web_fetch"])
parent = CopilotPermissions(tools=[], tools_exclude=True)
child = CopilotPermissions(tools=[], tools_exclude=True)
merged = child.merged_with_parent(parent, all_t)
assert not merged.tools_exclude # stored as whitelist
assert set(merged.tools) == {"run_block", "web_fetch"}
def test_block_parent_stored(self):
all_t = frozenset(["run_block"])
parent = CopilotPermissions(blocks=["HTTP Request"], blocks_exclude=True)
child = CopilotPermissions(blocks=[], blocks_exclude=True)
merged = child.merged_with_parent(parent, all_t)
# Parent restriction is preserved via _parent
assert not merged.is_block_allowed(BLOCK_ID, BLOCK_NAME)
# ---------------------------------------------------------------------------
# CopilotPermissions.is_empty
# ---------------------------------------------------------------------------
class TestIsEmpty:
def test_default_is_empty(self):
assert CopilotPermissions().is_empty()
def test_with_tools_not_empty(self):
assert not CopilotPermissions(tools=["bash_exec"]).is_empty()
def test_with_blocks_not_empty(self):
assert not CopilotPermissions(blocks=["HTTP Request"]).is_empty()
def test_with_parent_not_empty(self):
perms = CopilotPermissions()
perms._parent = CopilotPermissions(tools=["bash_exec"])
assert not perms.is_empty()
# ---------------------------------------------------------------------------
# validate_tool_names
# ---------------------------------------------------------------------------
class TestValidateToolNames:
def test_valid_registry_tool(self):
assert validate_tool_names(["run_block", "web_fetch"]) == []
def test_valid_sdk_builtin(self):
assert validate_tool_names(["Read", "Task", "WebSearch"]) == []
def test_invalid_tool(self):
result = validate_tool_names(["nonexistent_tool"])
assert "nonexistent_tool" in result
def test_mixed(self):
result = validate_tool_names(["run_block", "fake_tool"])
assert "fake_tool" in result
assert "run_block" not in result
def test_empty_list(self):
assert validate_tool_names([]) == []
# ---------------------------------------------------------------------------
# validate_block_identifiers (async)
# ---------------------------------------------------------------------------
@pytest.mark.asyncio
class TestValidateBlockIdentifiers:
async def test_empty_list(self):
result = await validate_block_identifiers([])
assert result == []
async def test_valid_full_uuid(self, mocker):
mock_block = mocker.MagicMock()
mock_block.return_value.name = "HTTP Request"
mocker.patch(
"backend.blocks.get_blocks",
return_value={"c069dc6b-c3ed-4c12-b6e5-d47361e64ce6": mock_block},
)
result = await validate_block_identifiers(
["c069dc6b-c3ed-4c12-b6e5-d47361e64ce6"]
)
assert result == []
async def test_invalid_identifier(self, mocker):
mock_block = mocker.MagicMock()
mock_block.return_value.name = "HTTP Request"
mocker.patch(
"backend.blocks.get_blocks",
return_value={"c069dc6b-c3ed-4c12-b6e5-d47361e64ce6": mock_block},
)
result = await validate_block_identifiers(["totally_unknown"])
assert "totally_unknown" in result
async def test_partial_uuid_match(self, mocker):
mock_block = mocker.MagicMock()
mock_block.return_value.name = "HTTP Request"
mocker.patch(
"backend.blocks.get_blocks",
return_value={"c069dc6b-c3ed-4c12-b6e5-d47361e64ce6": mock_block},
)
result = await validate_block_identifiers(["c069dc6b"])
assert result == []
async def test_name_match(self, mocker):
mock_block = mocker.MagicMock()
mock_block.return_value.name = "HTTP Request"
mocker.patch(
"backend.blocks.get_blocks",
return_value={"c069dc6b-c3ed-4c12-b6e5-d47361e64ce6": mock_block},
)
result = await validate_block_identifiers(["http request"])
assert result == []
# ---------------------------------------------------------------------------
# apply_tool_permissions
# ---------------------------------------------------------------------------
class TestApplyToolPermissions:
def test_empty_permissions_returns_base_unchanged(self, mocker):
mocker.patch(
"backend.copilot.sdk.tool_adapter.get_copilot_tool_names",
return_value=["mcp__copilot__run_block", "mcp__copilot__web_fetch", "Task"],
)
mocker.patch(
"backend.copilot.sdk.tool_adapter.get_sdk_disallowed_tools",
return_value=["Bash"],
)
mocker.patch(
"backend.copilot.sdk.tool_adapter.TOOL_REGISTRY",
{"run_block": object(), "web_fetch": object()},
)
perms = CopilotPermissions()
allowed, disallowed = apply_tool_permissions(perms, use_e2b=False)
assert "mcp__copilot__run_block" in allowed
assert "mcp__copilot__web_fetch" in allowed
def test_blacklist_removes_tool(self, mocker):
mocker.patch(
"backend.copilot.sdk.tool_adapter.get_copilot_tool_names",
return_value=[
"mcp__copilot__run_block",
"mcp__copilot__web_fetch",
"mcp__copilot__bash_exec",
"Task",
],
)
mocker.patch(
"backend.copilot.sdk.tool_adapter.get_sdk_disallowed_tools",
return_value=["Bash"],
)
mocker.patch(
"backend.copilot.sdk.tool_adapter.TOOL_REGISTRY",
{
"run_block": object(),
"web_fetch": object(),
"bash_exec": object(),
},
)
mocker.patch(
"backend.copilot.permissions.all_known_tool_names",
return_value=frozenset(["run_block", "web_fetch", "bash_exec", "Task"]),
)
perms = CopilotPermissions(tools=["bash_exec"], tools_exclude=True)
allowed, _ = apply_tool_permissions(perms, use_e2b=False)
assert "mcp__copilot__bash_exec" not in allowed
assert "mcp__copilot__run_block" in allowed
def test_whitelist_keeps_only_listed(self, mocker):
mocker.patch(
"backend.copilot.sdk.tool_adapter.get_copilot_tool_names",
return_value=[
"mcp__copilot__run_block",
"mcp__copilot__web_fetch",
"Task",
"WebSearch",
],
)
mocker.patch(
"backend.copilot.sdk.tool_adapter.get_sdk_disallowed_tools",
return_value=["Bash"],
)
mocker.patch(
"backend.copilot.sdk.tool_adapter.TOOL_REGISTRY",
{"run_block": object(), "web_fetch": object()},
)
mocker.patch(
"backend.copilot.permissions.all_known_tool_names",
return_value=frozenset(["run_block", "web_fetch", "Task", "WebSearch"]),
)
perms = CopilotPermissions(tools=["run_block"], tools_exclude=False)
allowed, _ = apply_tool_permissions(perms, use_e2b=False)
assert "mcp__copilot__run_block" in allowed
assert "mcp__copilot__web_fetch" not in allowed
assert "Task" not in allowed
def test_read_tool_always_included_even_when_blacklisted(self, mocker):
"""mcp__copilot__Read must stay in allowed even if Read is explicitly blacklisted."""
mocker.patch(
"backend.copilot.sdk.tool_adapter.get_copilot_tool_names",
return_value=[
"mcp__copilot__run_block",
"mcp__copilot__Read",
"Task",
],
)
mocker.patch(
"backend.copilot.sdk.tool_adapter.get_sdk_disallowed_tools",
return_value=[],
)
mocker.patch(
"backend.copilot.sdk.tool_adapter.TOOL_REGISTRY",
{"run_block": object()},
)
mocker.patch(
"backend.copilot.permissions.all_known_tool_names",
return_value=frozenset(["run_block", "Read", "Task"]),
)
# Explicitly blacklist Read
perms = CopilotPermissions(tools=["Read"], tools_exclude=True)
allowed, _ = apply_tool_permissions(perms, use_e2b=False)
assert "mcp__copilot__Read" in allowed # always preserved for SDK internals
assert "mcp__copilot__run_block" in allowed
assert "Task" in allowed
def test_read_tool_always_included_with_narrow_whitelist(self, mocker):
"""mcp__copilot__Read must stay in allowed even when not in a whitelist."""
mocker.patch(
"backend.copilot.sdk.tool_adapter.get_copilot_tool_names",
return_value=[
"mcp__copilot__run_block",
"mcp__copilot__Read",
"Task",
],
)
mocker.patch(
"backend.copilot.sdk.tool_adapter.get_sdk_disallowed_tools",
return_value=[],
)
mocker.patch(
"backend.copilot.sdk.tool_adapter.TOOL_REGISTRY",
{"run_block": object()},
)
mocker.patch(
"backend.copilot.permissions.all_known_tool_names",
return_value=frozenset(["run_block", "Read", "Task"]),
)
# Whitelist only run_block — Read not listed
perms = CopilotPermissions(tools=["run_block"], tools_exclude=False)
allowed, _ = apply_tool_permissions(perms, use_e2b=False)
assert "mcp__copilot__Read" in allowed # always preserved for SDK internals
assert "mcp__copilot__run_block" in allowed
def test_e2b_file_tools_included_when_sdk_builtin_whitelisted(self, mocker):
"""In E2B mode, whitelisting 'Read' must include mcp__copilot__read_file."""
mocker.patch(
"backend.copilot.sdk.tool_adapter.get_copilot_tool_names",
return_value=[
"mcp__copilot__run_block",
"mcp__copilot__Read",
"mcp__copilot__read_file",
"mcp__copilot__write_file",
"Task",
],
)
mocker.patch(
"backend.copilot.sdk.tool_adapter.get_sdk_disallowed_tools",
return_value=["Bash", "Read", "Write", "Edit", "Glob", "Grep"],
)
mocker.patch(
"backend.copilot.sdk.tool_adapter.TOOL_REGISTRY",
{"run_block": object()},
)
mocker.patch(
"backend.copilot.permissions.all_known_tool_names",
return_value=frozenset(["run_block", "Read", "Write", "Task"]),
)
mocker.patch(
"backend.copilot.sdk.e2b_file_tools.E2B_FILE_TOOL_NAMES",
["read_file", "write_file", "edit_file", "glob", "grep"],
)
# Whitelist Read and run_block — E2B read_file should be included
perms = CopilotPermissions(tools=["Read", "run_block"], tools_exclude=False)
allowed, _ = apply_tool_permissions(perms, use_e2b=True)
assert "mcp__copilot__read_file" in allowed
assert "mcp__copilot__run_block" in allowed
# Write not whitelisted — write_file should NOT be included
assert "mcp__copilot__write_file" not in allowed
def test_e2b_file_tools_excluded_when_sdk_builtin_blacklisted(self, mocker):
"""In E2B mode, blacklisting 'Read' must also remove mcp__copilot__read_file."""
mocker.patch(
"backend.copilot.sdk.tool_adapter.get_copilot_tool_names",
return_value=[
"mcp__copilot__run_block",
"mcp__copilot__Read",
"mcp__copilot__read_file",
"Task",
],
)
mocker.patch(
"backend.copilot.sdk.tool_adapter.get_sdk_disallowed_tools",
return_value=["Bash", "Read", "Write", "Edit", "Glob", "Grep"],
)
mocker.patch(
"backend.copilot.sdk.tool_adapter.TOOL_REGISTRY",
{"run_block": object()},
)
mocker.patch(
"backend.copilot.permissions.all_known_tool_names",
return_value=frozenset(["run_block", "Read", "Task"]),
)
mocker.patch(
"backend.copilot.sdk.e2b_file_tools.E2B_FILE_TOOL_NAMES",
["read_file", "write_file", "edit_file", "glob", "grep"],
)
# Blacklist Read — E2B read_file should also be removed
perms = CopilotPermissions(tools=["Read"], tools_exclude=True)
allowed, _ = apply_tool_permissions(perms, use_e2b=True)
assert "mcp__copilot__read_file" not in allowed
assert "mcp__copilot__run_block" in allowed
# mcp__copilot__Read is always preserved for SDK internals
assert "mcp__copilot__Read" in allowed
# ---------------------------------------------------------------------------
# SDK_BUILTIN_TOOL_NAMES sanity check
# ---------------------------------------------------------------------------
class TestSdkBuiltinToolNames:
def test_expected_builtins_present(self):
expected = {
"Read",
"Write",
"Edit",
"Glob",
"Grep",
"Task",
"WebSearch",
"TodoWrite",
}
assert expected.issubset(SDK_BUILTIN_TOOL_NAMES)
def test_platform_names_match_tool_registry(self):
"""PLATFORM_TOOL_NAMES (derived from ToolName Literal) must match TOOL_REGISTRY keys."""
registry_keys = frozenset(TOOL_REGISTRY.keys())
assert PLATFORM_TOOL_NAMES == registry_keys, (
f"ToolName Literal is out of sync with TOOL_REGISTRY. "
f"Missing: {registry_keys - PLATFORM_TOOL_NAMES}, "
f"Extra: {PLATFORM_TOOL_NAMES - registry_keys}"
)
def test_all_tool_names_is_union(self):
"""ALL_TOOL_NAMES must equal PLATFORM_TOOL_NAMES | SDK_BUILTIN_TOOL_NAMES."""
assert ALL_TOOL_NAMES == PLATFORM_TOOL_NAMES | SDK_BUILTIN_TOOL_NAMES
def test_no_overlap_between_platform_and_sdk(self):
"""Platform and SDK built-in names must not overlap."""
assert PLATFORM_TOOL_NAMES.isdisjoint(SDK_BUILTIN_TOOL_NAMES)
def test_known_tools_includes_registry_and_builtins(self):
known = all_known_tool_names()
assert "run_block" in known
assert "Read" in known
assert "Task" in known

View File

@@ -12,34 +12,18 @@ from backend.copilot.tools import TOOL_REGISTRY
# Shared technical notes that apply to both SDK and baseline modes
_SHARED_TOOL_NOTES = f"""\
### Sharing files with the user
After saving a file to the persistent workspace with `write_workspace_file`,
share it with the user by embedding the `download_url` from the response in
your message as a Markdown link or image:
### Sharing files
After `write_workspace_file`, embed the `download_url` in Markdown:
- File: `[report.csv](workspace://file_id#text/csv)`
- Image: `![chart](workspace://file_id#image/png)`
- Video: `![recording](workspace://file_id#video/mp4)`
- **Any file** — shows as a clickable download link:
`[report.csv](workspace://file_id#text/csv)`
- **Image** — renders inline in chat:
`![chart](workspace://file_id#image/png)`
- **Video** — renders inline in chat with player controls:
`![recording](workspace://file_id#video/mp4)`
The `download_url` field in the `write_workspace_file` response is already
in the correct format — paste it directly after the `(` in the Markdown.
### Passing file content to tools — @@agptfile: references
Instead of copying large file contents into a tool argument, pass a file
reference and the platform will load the content for you.
Syntax: `@@agptfile:<uri>[<start>-<end>]`
- `<uri>` **must** start with `workspace://` or `/` (absolute path):
- `workspace://<file_id>` — workspace file by ID
- `workspace:///<path>` — workspace file by virtual path
- `/absolute/local/path` — ephemeral or sdk_cwd file
- E2B sandbox absolute path (e.g. `/home/user/script.py`)
- `[<start>-<end>]` is an optional 1-indexed inclusive line range.
- URIs that do not start with `workspace://` or `/` are **not** expanded.
### File references — @@agptfile:
Pass large file content to tools by reference: `@@agptfile:<uri>[<start>-<end>]`
- `workspace://<file_id>` or `workspace:///<path>` — workspace files
- `/absolute/path` — local/sandbox files
- `[start-end]` — optional 1-indexed line range
- Multiple refs per argument supported. Only `workspace://` and absolute paths are expanded.
Examples:
```
@@ -50,21 +34,9 @@ Examples:
@@agptfile:/home/user/script.py
```
You can embed a reference inside any string argument, or use it as the entire
value. Multiple references in one argument are all expanded.
**Structured data**: When the entire argument is a single file reference, the platform auto-parses by extension/MIME. Supported: JSON, JSONL, CSV, TSV, YAML, TOML, Parquet, Excel (.xlsx only; legacy `.xls` is NOT supported). Unrecognised formats return plain string.
**Structured data**: When the **entire** argument value is a single file
reference (no surrounding text), the platform automatically parses the file
content based on its extension or MIME type. Supported formats: JSON, JSONL,
CSV, TSV, YAML, TOML, Parquet, and Excel (.xlsx — first sheet only).
For example, pass `@@agptfile:workspace://<id>` where the file is a `.csv` and
the rows will be parsed into `list[list[str]]` automatically. If the format is
unrecognised or parsing fails, the content is returned as a plain string.
Legacy `.xls` files are **not** supported — only the modern `.xlsx` format.
**Type coercion**: The platform also coerces expanded values to match the
block's expected input types. For example, if a block expects `list[list[str]]`
and the expanded value is a JSON string, it will be parsed into the correct type.
**Type coercion**: The platform auto-coerces expanded string values to match block input types (e.g. JSON string → `list[list[str]]`).
### Media file inputs (format: "file")
Some block inputs accept media files — their schema shows `"format": "file"`.
@@ -91,6 +63,57 @@ Example — committing an image file to GitHub:
}}
```
### Writing large files — CRITICAL
**Never write an entire large document in a single tool call.** When the
content you want to write exceeds ~2000 words the tool call's output token
limit will silently truncate the arguments, producing an empty `{{}}` input
that fails repeatedly.
**Preferred: compose from file references.** If the data is already in
files (tool outputs, workspace files), compose the report in one call
using `@@agptfile:` references — the system expands them inline:
```bash
cat > report.md << 'EOF'
# Research Report
## Data from web research
@@agptfile:/home/user/web_results.txt
## Block execution output
@@agptfile:workspace://<file_id>
## Conclusion
<brief synthesis>
EOF
```
**Fallback: write section-by-section.** When you must generate content
from conversation context (no files to reference), split into multiple
`bash_exec` calls — one section per call:
```bash
cat > report.md << 'EOF'
# Section 1
<content from your earlier tool call results>
EOF
```
```bash
cat >> report.md << 'EOF'
# Section 2
<content from your earlier tool call results>
EOF
```
Use `cat >` for the first chunk and `cat >>` to append subsequent chunks.
Do not re-fetch or re-generate data you already have from prior tool calls.
After building the file, reference it with `@@agptfile:` in other tools:
`@@agptfile:/home/user/report.md`
### Web search best practices
- If 3 similar web searches don't return the specific data you need, conclude
it isn't publicly available and work with what you have.
- Prefer fewer, well-targeted searches over many variations of the same query.
- When spawning sub-agents for research, ensure each has a distinct
non-overlapping scope to avoid redundant searches.
### Sub-agent tasks
- When using the Task tool, NEVER set `run_in_background` to true.
All tasks must run in the foreground.
@@ -166,17 +189,12 @@ def _build_storage_supplement(
## Tool notes
### Shell commands
- The SDK built-in Bash tool is NOT available. Use the `bash_exec` MCP tool
for shell commands — it runs {sandbox_type}.
### Working directory
- Your working directory is: `{working_dir}`
- All SDK file tools AND `bash_exec` operate on the same filesystem
- Use relative paths or absolute paths under `{working_dir}` for all file operations
### Shell & filesystem
- The SDK built-in Bash tool is NOT available. Use `bash_exec` for shell commands ({sandbox_type}). Working dir: `{working_dir}`
- SDK file tools (Read/Write/Edit/Glob/Grep) and `bash_exec` share one filesystem — use relative or absolute paths under this dir.
- `read_workspace_file`/`write_workspace_file` operate on **persistent cloud workspace storage** (separate from the working dir).
### Two storage systems — CRITICAL to understand
1. **{storage_system_1_name}** (`{working_dir}`):
{characteristics}
{persistence}
@@ -194,9 +212,10 @@ Important files (code, configs, outputs) should be saved to workspace to ensure
### SDK tool-result files
When tool outputs are large, the SDK truncates them and saves the full output to
a local file under `~/.claude/projects/.../tool-results/`. To read these files,
always use `read_file` or `Read` (NOT `read_workspace_file`).
`read_workspace_file` reads from cloud workspace storage, where SDK
tool-results are NOT stored.
always use `Read` (NOT `bash_exec`, NOT `read_workspace_file`).
These files are on the host filesystem — `bash_exec` runs in the sandbox and
CANNOT access them. `read_workspace_file` reads from cloud workspace storage,
where SDK tool-results are NOT stored.
{_SHARED_TOOL_NOTES}{extra_notes}"""

View File

@@ -0,0 +1,28 @@
"""Tests for agent generation guide — verifies clarification section."""
from pathlib import Path
class TestAgentGenerationGuideContainsClarifySection:
"""The agent generation guide must include the clarification section."""
def test_guide_includes_clarify_section(self):
guide_path = Path(__file__).parent / "sdk" / "agent_generation_guide.md"
content = guide_path.read_text(encoding="utf-8")
assert "Before or During Building" in content
def test_guide_mentions_find_block_for_clarification(self):
guide_path = Path(__file__).parent / "sdk" / "agent_generation_guide.md"
content = guide_path.read_text(encoding="utf-8")
clarify_section = content.split("Before or During Building")[1].split(
"### Workflow"
)[0]
assert "find_block" in clarify_section
def test_guide_mentions_ask_question_tool(self):
guide_path = Path(__file__).parent / "sdk" / "agent_generation_guide.md"
content = guide_path.read_text(encoding="utf-8")
clarify_section = content.split("Before or During Building")[1].split(
"### Workflow"
)[0]
assert "ask_question" in clarify_section

View File

@@ -36,6 +36,10 @@ class CoPilotUsageStatus(BaseModel):
daily: UsageWindow
weekly: UsageWindow
reset_cost: int = Field(
default=0,
description="Credit cost (in cents) to reset the daily limit. 0 = feature disabled.",
)
class RateLimitExceeded(Exception):
@@ -61,6 +65,7 @@ async def get_usage_status(
user_id: str,
daily_token_limit: int,
weekly_token_limit: int,
rate_limit_reset_cost: int = 0,
) -> CoPilotUsageStatus:
"""Get current usage status for a user.
@@ -68,6 +73,7 @@ async def get_usage_status(
user_id: The user's ID.
daily_token_limit: Max tokens per day (0 = unlimited).
weekly_token_limit: Max tokens per week (0 = unlimited).
rate_limit_reset_cost: Credit cost (cents) to reset daily limit (0 = disabled).
Returns:
CoPilotUsageStatus with current usage and limits.
@@ -97,6 +103,7 @@ async def get_usage_status(
limit=weekly_token_limit,
resets_at=_weekly_reset_time(now=now),
),
reset_cost=rate_limit_reset_cost,
)
@@ -141,6 +148,110 @@ async def check_rate_limit(
raise RateLimitExceeded("weekly", _weekly_reset_time(now=now))
async def reset_daily_usage(user_id: str, daily_token_limit: int = 0) -> bool:
"""Reset a user's daily token usage counter in Redis.
Called after a user pays credits to extend their daily limit.
Also reduces the weekly usage counter by ``daily_token_limit`` tokens
(clamped to 0) so the user effectively gets one extra day's worth of
weekly capacity.
Args:
user_id: The user's ID.
daily_token_limit: The configured daily token limit. When positive,
the weekly counter is reduced by this amount.
Fails open: returns False if Redis is unavailable (consistent with
the fail-open design of this module).
"""
now = datetime.now(UTC)
try:
redis = await get_redis_async()
# Use a MULTI/EXEC transaction so that DELETE (daily) and DECRBY
# (weekly) either both execute or neither does. This prevents the
# scenario where the daily counter is cleared but the weekly
# counter is not decremented — which would let the caller refund
# credits even though the daily limit was already reset.
d_key = _daily_key(user_id, now=now)
w_key = _weekly_key(user_id, now=now) if daily_token_limit > 0 else None
pipe = redis.pipeline(transaction=True)
pipe.delete(d_key)
if w_key is not None:
pipe.decrby(w_key, daily_token_limit)
results = await pipe.execute()
# Clamp negative weekly counter to 0 (best-effort; not critical).
if w_key is not None:
new_val = results[1] # DECRBY result
if new_val < 0:
await redis.set(w_key, 0, keepttl=True)
logger.info("Reset daily usage for user %s", user_id[:8])
return True
except (RedisError, ConnectionError, OSError):
logger.warning("Redis unavailable for resetting daily usage")
return False
_RESET_LOCK_PREFIX = "copilot:reset_lock"
_RESET_COUNT_PREFIX = "copilot:reset_count"
async def acquire_reset_lock(user_id: str, ttl_seconds: int = 10) -> bool:
"""Acquire a short-lived lock to serialize rate limit resets per user."""
try:
redis = await get_redis_async()
key = f"{_RESET_LOCK_PREFIX}:{user_id}"
return bool(await redis.set(key, "1", nx=True, ex=ttl_seconds))
except (RedisError, ConnectionError, OSError) as exc:
logger.warning("Redis unavailable for reset lock, rejecting reset: %s", exc)
return False
async def release_reset_lock(user_id: str) -> None:
"""Release the per-user reset lock."""
try:
redis = await get_redis_async()
await redis.delete(f"{_RESET_LOCK_PREFIX}:{user_id}")
except (RedisError, ConnectionError, OSError):
pass # Lock will expire via TTL
async def get_daily_reset_count(user_id: str) -> int | None:
"""Get how many times the user has reset today.
Returns None when Redis is unavailable so callers can fail-closed
for billed operations (as opposed to failing open for read-only
rate-limit checks).
"""
now = datetime.now(UTC)
try:
redis = await get_redis_async()
key = f"{_RESET_COUNT_PREFIX}:{user_id}:{now.strftime('%Y-%m-%d')}"
val = await redis.get(key)
return int(val or 0)
except (RedisError, ConnectionError, OSError):
logger.warning("Redis unavailable for reading daily reset count")
return None
async def increment_daily_reset_count(user_id: str) -> None:
"""Increment and track how many resets this user has done today."""
now = datetime.now(UTC)
try:
redis = await get_redis_async()
key = f"{_RESET_COUNT_PREFIX}:{user_id}:{now.strftime('%Y-%m-%d')}"
pipe = redis.pipeline(transaction=True)
pipe.incr(key)
seconds_until_reset = int((_daily_reset_time(now=now) - now).total_seconds())
pipe.expire(key, max(seconds_until_reset, 1))
await pipe.execute()
except (RedisError, ConnectionError, OSError):
logger.warning("Redis unavailable for tracking reset count")
async def record_token_usage(
user_id: str,
prompt_tokens: int,
@@ -231,6 +342,67 @@ async def record_token_usage(
)
async def get_global_rate_limits(
user_id: str,
config_daily: int,
config_weekly: int,
) -> tuple[int, int]:
"""Resolve global rate limits from LaunchDarkly, falling back to config.
Args:
user_id: User ID for LD flag evaluation context.
config_daily: Fallback daily limit from ChatConfig.
config_weekly: Fallback weekly limit from ChatConfig.
Returns:
(daily_token_limit, weekly_token_limit) tuple.
"""
# Lazy import to avoid circular dependency:
# rate_limit -> feature_flag -> settings -> ... -> rate_limit
from backend.util.feature_flag import Flag, get_feature_flag_value
daily_raw = await get_feature_flag_value(
Flag.COPILOT_DAILY_TOKEN_LIMIT.value, user_id, config_daily
)
weekly_raw = await get_feature_flag_value(
Flag.COPILOT_WEEKLY_TOKEN_LIMIT.value, user_id, config_weekly
)
try:
daily = max(0, int(daily_raw))
except (TypeError, ValueError):
logger.warning("Invalid LD value for daily token limit: %r", daily_raw)
daily = config_daily
try:
weekly = max(0, int(weekly_raw))
except (TypeError, ValueError):
logger.warning("Invalid LD value for weekly token limit: %r", weekly_raw)
weekly = config_weekly
return daily, weekly
async def reset_user_usage(user_id: str, *, reset_weekly: bool = False) -> None:
"""Reset a user's usage counters.
Always deletes the daily Redis key. When *reset_weekly* is ``True``,
the weekly key is deleted as well.
Unlike read paths (``get_usage_status``, ``check_rate_limit``) which
fail-open on Redis errors, resets intentionally re-raise so the caller
knows the operation did not succeed. A silent failure here would leave
the admin believing the counters were zeroed when they were not.
"""
now = datetime.now(UTC)
keys_to_delete = [_daily_key(user_id, now=now)]
if reset_weekly:
keys_to_delete.append(_weekly_key(user_id, now=now))
try:
redis = await get_redis_async()
await redis.delete(*keys_to_delete)
except (RedisError, ConnectionError, OSError):
logger.warning("Redis unavailable for resetting user usage")
raise
# ---------------------------------------------------------------------------
# Private helpers
# ---------------------------------------------------------------------------

View File

@@ -12,6 +12,7 @@ from .rate_limit import (
check_rate_limit,
get_usage_status,
record_token_usage,
reset_daily_usage,
)
_USER = "test-user-rl"
@@ -332,3 +333,91 @@ class TestRecordTokenUsage:
):
# Should not raise — fail-open
await record_token_usage(_USER, prompt_tokens=100, completion_tokens=50)
# ---------------------------------------------------------------------------
# reset_daily_usage
# ---------------------------------------------------------------------------
class TestResetDailyUsage:
@staticmethod
def _make_pipeline_mock(decrby_result: int = 0) -> MagicMock:
"""Create a pipeline mock that returns [delete_result, decrby_result]."""
pipe = MagicMock()
pipe.execute = AsyncMock(return_value=[1, decrby_result])
return pipe
@pytest.mark.asyncio
async def test_deletes_daily_key(self):
mock_pipe = self._make_pipeline_mock(decrby_result=0)
mock_redis = AsyncMock()
mock_redis.pipeline = lambda **_kw: mock_pipe
with patch(
"backend.copilot.rate_limit.get_redis_async",
return_value=mock_redis,
):
result = await reset_daily_usage(_USER, daily_token_limit=10000)
assert result is True
mock_pipe.delete.assert_called_once()
@pytest.mark.asyncio
async def test_reduces_weekly_usage_via_decrby(self):
"""Weekly counter should be reduced via DECRBY in the pipeline."""
mock_pipe = self._make_pipeline_mock(decrby_result=35000)
mock_redis = AsyncMock()
mock_redis.pipeline = lambda **_kw: mock_pipe
with patch(
"backend.copilot.rate_limit.get_redis_async",
return_value=mock_redis,
):
await reset_daily_usage(_USER, daily_token_limit=10000)
mock_pipe.decrby.assert_called_once()
mock_redis.set.assert_not_called() # 35000 > 0, no clamp needed
@pytest.mark.asyncio
async def test_clamps_negative_weekly_to_zero(self):
"""If DECRBY goes negative, SET to 0 (outside the pipeline)."""
mock_pipe = self._make_pipeline_mock(decrby_result=-5000)
mock_redis = AsyncMock()
mock_redis.pipeline = lambda **_kw: mock_pipe
with patch(
"backend.copilot.rate_limit.get_redis_async",
return_value=mock_redis,
):
await reset_daily_usage(_USER, daily_token_limit=10000)
mock_pipe.decrby.assert_called_once()
mock_redis.set.assert_called_once()
@pytest.mark.asyncio
async def test_no_weekly_reduction_when_daily_limit_zero(self):
"""When daily_token_limit is 0, weekly counter should not be touched."""
mock_pipe = self._make_pipeline_mock()
mock_pipe.execute = AsyncMock(return_value=[1]) # only delete result
mock_redis = AsyncMock()
mock_redis.pipeline = lambda **_kw: mock_pipe
with patch(
"backend.copilot.rate_limit.get_redis_async",
return_value=mock_redis,
):
await reset_daily_usage(_USER, daily_token_limit=0)
mock_pipe.delete.assert_called_once()
mock_pipe.decrby.assert_not_called()
@pytest.mark.asyncio
async def test_returns_false_when_redis_unavailable(self):
with patch(
"backend.copilot.rate_limit.get_redis_async",
side_effect=ConnectionError("Redis down"),
):
result = await reset_daily_usage(_USER, daily_token_limit=10000)
assert result is False

View File

@@ -0,0 +1,294 @@
"""Unit tests for the POST /usage/reset endpoint."""
from __future__ import annotations
from datetime import UTC, datetime, timedelta
from unittest.mock import AsyncMock, MagicMock, patch
import pytest
from fastapi import HTTPException
from backend.api.features.chat.routes import reset_copilot_usage
from backend.copilot.rate_limit import CoPilotUsageStatus, UsageWindow
from backend.util.exceptions import InsufficientBalanceError
# Minimal config mock matching ChatConfig fields used by the endpoint.
def _make_config(
rate_limit_reset_cost: int = 500,
daily_token_limit: int = 2_500_000,
weekly_token_limit: int = 12_500_000,
max_daily_resets: int = 5,
):
cfg = MagicMock()
cfg.rate_limit_reset_cost = rate_limit_reset_cost
cfg.daily_token_limit = daily_token_limit
cfg.weekly_token_limit = weekly_token_limit
cfg.max_daily_resets = max_daily_resets
return cfg
def _usage(daily_used: int = 3_000_000, daily_limit: int = 2_500_000):
return CoPilotUsageStatus(
daily=UsageWindow(
used=daily_used,
limit=daily_limit,
resets_at=datetime.now(UTC) + timedelta(hours=6),
),
weekly=UsageWindow(
used=5_000_000,
limit=12_500_000,
resets_at=datetime.now(UTC) + timedelta(days=3),
),
)
_MODULE = "backend.api.features.chat.routes"
def _mock_settings(enable_credit: bool = True):
"""Return a mock Settings object with the given enable_credit flag."""
mock = MagicMock()
mock.config.enable_credit = enable_credit
return mock
@pytest.mark.asyncio
class TestResetCopilotUsage:
async def test_feature_disabled_returns_400(self):
"""When rate_limit_reset_cost=0, endpoint returns 400."""
with patch(f"{_MODULE}.config", _make_config(rate_limit_reset_cost=0)):
with pytest.raises(HTTPException) as exc_info:
await reset_copilot_usage(user_id="user-1")
assert exc_info.value.status_code == 400
assert "not available" in exc_info.value.detail
async def test_no_daily_limit_returns_400(self):
"""When daily_token_limit=0 (unlimited), endpoint returns 400."""
with (
patch(f"{_MODULE}.config", _make_config(daily_token_limit=0)),
patch(f"{_MODULE}.settings", _mock_settings()),
):
with pytest.raises(HTTPException) as exc_info:
await reset_copilot_usage(user_id="user-1")
assert exc_info.value.status_code == 400
assert "nothing to reset" in exc_info.value.detail.lower()
async def test_not_at_limit_returns_400(self):
"""When user hasn't hit their daily limit, returns 400."""
cfg = _make_config()
with (
patch(f"{_MODULE}.config", cfg),
patch(f"{_MODULE}.settings", _mock_settings()),
patch(f"{_MODULE}.get_daily_reset_count", AsyncMock(return_value=0)),
patch(f"{_MODULE}.acquire_reset_lock", AsyncMock(return_value=True)),
patch(f"{_MODULE}.release_reset_lock", AsyncMock()) as mock_release,
patch(
f"{_MODULE}.get_usage_status",
AsyncMock(return_value=_usage(daily_used=1_000_000)),
),
):
with pytest.raises(HTTPException) as exc_info:
await reset_copilot_usage(user_id="user-1")
assert exc_info.value.status_code == 400
assert "not reached" in exc_info.value.detail
mock_release.assert_awaited_once()
async def test_insufficient_credits_returns_402(self):
"""When user doesn't have enough credits, returns 402."""
mock_credit_model = AsyncMock()
mock_credit_model.spend_credits.side_effect = InsufficientBalanceError(
message="Insufficient balance",
user_id="user-1",
balance=50,
amount=200,
)
cfg = _make_config()
with (
patch(f"{_MODULE}.config", cfg),
patch(f"{_MODULE}.settings", _mock_settings()),
patch(f"{_MODULE}.get_daily_reset_count", AsyncMock(return_value=0)),
patch(f"{_MODULE}.acquire_reset_lock", AsyncMock(return_value=True)),
patch(f"{_MODULE}.release_reset_lock", AsyncMock()) as mock_release,
patch(
f"{_MODULE}.get_usage_status",
AsyncMock(return_value=_usage()),
),
patch(
f"{_MODULE}.get_user_credit_model",
AsyncMock(return_value=mock_credit_model),
),
):
with pytest.raises(HTTPException) as exc_info:
await reset_copilot_usage(user_id="user-1")
assert exc_info.value.status_code == 402
mock_release.assert_awaited_once()
async def test_happy_path(self):
"""Successful reset: charges credits, resets usage, returns response."""
mock_credit_model = AsyncMock()
mock_credit_model.spend_credits.return_value = 1500 # remaining balance
cfg = _make_config()
updated_usage = _usage(daily_used=0)
with (
patch(f"{_MODULE}.config", cfg),
patch(f"{_MODULE}.settings", _mock_settings()),
patch(f"{_MODULE}.get_daily_reset_count", AsyncMock(return_value=0)),
patch(f"{_MODULE}.acquire_reset_lock", AsyncMock(return_value=True)),
patch(f"{_MODULE}.release_reset_lock", AsyncMock()),
patch(
f"{_MODULE}.get_usage_status",
AsyncMock(side_effect=[_usage(), updated_usage]),
),
patch(
f"{_MODULE}.get_user_credit_model",
AsyncMock(return_value=mock_credit_model),
),
patch(
f"{_MODULE}.reset_daily_usage", AsyncMock(return_value=True)
) as mock_reset,
patch(f"{_MODULE}.increment_daily_reset_count", AsyncMock()) as mock_incr,
):
result = await reset_copilot_usage(user_id="user-1")
assert result.success is True
assert result.credits_charged == 500
assert result.remaining_balance == 1500
mock_reset.assert_awaited_once()
mock_incr.assert_awaited_once()
async def test_max_daily_resets_exceeded(self):
"""When user has exhausted daily resets, returns 429."""
cfg = _make_config(max_daily_resets=3)
with (
patch(f"{_MODULE}.config", cfg),
patch(f"{_MODULE}.settings", _mock_settings()),
patch(f"{_MODULE}.get_daily_reset_count", AsyncMock(return_value=3)),
):
with pytest.raises(HTTPException) as exc_info:
await reset_copilot_usage(user_id="user-1")
assert exc_info.value.status_code == 429
async def test_credit_system_disabled_returns_400(self):
"""When enable_credit=False, endpoint returns 400."""
with (
patch(f"{_MODULE}.config", _make_config()),
patch(f"{_MODULE}.settings", _mock_settings(enable_credit=False)),
):
with pytest.raises(HTTPException) as exc_info:
await reset_copilot_usage(user_id="user-1")
assert exc_info.value.status_code == 400
assert "credit system is disabled" in exc_info.value.detail.lower()
async def test_weekly_limit_exhausted_returns_400(self):
"""When the weekly limit is also exhausted, resetting daily won't help."""
cfg = _make_config()
weekly_exhausted = CoPilotUsageStatus(
daily=UsageWindow(
used=3_000_000,
limit=2_500_000,
resets_at=datetime.now(UTC) + timedelta(hours=6),
),
weekly=UsageWindow(
used=12_500_000,
limit=12_500_000,
resets_at=datetime.now(UTC) + timedelta(days=3),
),
)
with (
patch(f"{_MODULE}.config", cfg),
patch(f"{_MODULE}.settings", _mock_settings()),
patch(f"{_MODULE}.get_daily_reset_count", AsyncMock(return_value=0)),
patch(f"{_MODULE}.acquire_reset_lock", AsyncMock(return_value=True)),
patch(f"{_MODULE}.release_reset_lock", AsyncMock()) as mock_release,
patch(
f"{_MODULE}.get_usage_status",
AsyncMock(return_value=weekly_exhausted),
),
):
with pytest.raises(HTTPException) as exc_info:
await reset_copilot_usage(user_id="user-1")
assert exc_info.value.status_code == 400
assert "weekly" in exc_info.value.detail.lower()
mock_release.assert_awaited_once()
async def test_redis_failure_for_reset_count_returns_503(self):
"""When Redis is unavailable for get_daily_reset_count, returns 503."""
with (
patch(f"{_MODULE}.config", _make_config()),
patch(f"{_MODULE}.settings", _mock_settings()),
patch(f"{_MODULE}.get_daily_reset_count", AsyncMock(return_value=None)),
):
with pytest.raises(HTTPException) as exc_info:
await reset_copilot_usage(user_id="user-1")
assert exc_info.value.status_code == 503
assert "verify" in exc_info.value.detail.lower()
async def test_redis_reset_failure_refunds_credits(self):
"""When reset_daily_usage fails, credits are refunded and 503 returned."""
mock_credit_model = AsyncMock()
mock_credit_model.spend_credits.return_value = 1500
cfg = _make_config()
with (
patch(f"{_MODULE}.config", cfg),
patch(f"{_MODULE}.settings", _mock_settings()),
patch(f"{_MODULE}.get_daily_reset_count", AsyncMock(return_value=0)),
patch(f"{_MODULE}.acquire_reset_lock", AsyncMock(return_value=True)),
patch(f"{_MODULE}.release_reset_lock", AsyncMock()),
patch(
f"{_MODULE}.get_usage_status",
AsyncMock(return_value=_usage()),
),
patch(
f"{_MODULE}.get_user_credit_model",
AsyncMock(return_value=mock_credit_model),
),
patch(f"{_MODULE}.reset_daily_usage", AsyncMock(return_value=False)),
):
with pytest.raises(HTTPException) as exc_info:
await reset_copilot_usage(user_id="user-1")
assert exc_info.value.status_code == 503
assert "not been charged" in exc_info.value.detail
mock_credit_model.top_up_credits.assert_awaited_once()
async def test_redis_reset_failure_refund_also_fails(self):
"""When both reset and refund fail, error message reflects the truth."""
mock_credit_model = AsyncMock()
mock_credit_model.spend_credits.return_value = 1500
mock_credit_model.top_up_credits.side_effect = RuntimeError("db down")
cfg = _make_config()
with (
patch(f"{_MODULE}.config", cfg),
patch(f"{_MODULE}.settings", _mock_settings()),
patch(f"{_MODULE}.get_daily_reset_count", AsyncMock(return_value=0)),
patch(f"{_MODULE}.acquire_reset_lock", AsyncMock(return_value=True)),
patch(f"{_MODULE}.release_reset_lock", AsyncMock()),
patch(
f"{_MODULE}.get_usage_status",
AsyncMock(return_value=_usage()),
),
patch(
f"{_MODULE}.get_user_credit_model",
AsyncMock(return_value=mock_credit_model),
),
patch(f"{_MODULE}.reset_daily_usage", AsyncMock(return_value=False)),
):
with pytest.raises(HTTPException) as exc_info:
await reset_copilot_usage(user_id="user-1")
assert exc_info.value.status_code == 503
assert "contact support" in exc_info.value.detail.lower()

View File

@@ -3,6 +3,29 @@
You can create, edit, and customize agents directly. You ARE the brain —
generate the agent JSON yourself using block schemas, then validate and save.
### Clarifying — Before or During Building
Use `ask_question` whenever the user's intent is ambiguous — whether
that's before starting or midway through the workflow. Common moments:
- **Before building**: output format, delivery channel, data source, or
trigger is unspecified.
- **During block discovery**: multiple blocks could fit and the user
should choose.
- **During JSON generation**: a wiring decision depends on user
preference.
Steps:
1. Call `find_block` (or another discovery tool) to learn what the
platform actually supports for the ambiguous dimension.
2. Call `ask_question` with a concrete question listing the discovered
options (e.g. "The platform supports Gmail, Slack, and Google Docs —
which should the agent use for delivery?").
3. **Wait for the user's answer** before continuing.
**Skip this** when the goal already specifies all dimensions (e.g.
"scrape prices from Amazon and email me daily").
### Workflow for Creating/Editing Agents
1. **Discover blocks**: Call `find_block(query, include_schemas=true)` to
@@ -67,9 +90,17 @@ These define the agent's interface — what it accepts and what it produces.
**AgentInputBlock** (ID: `c0a8e994-ebf1-4a9c-a4d8-89d09c86741b`):
- Defines a user-facing input field on the agent
- Required `input_default` fields: `name` (str), `value` (default: null)
- Optional: `title`, `description`, `placeholder_values` (for dropdowns)
- Optional: `title`, `description`
- Output: `result` — the user-provided value at runtime
- Create one AgentInputBlock per distinct input the agent needs
- For dropdown/select inputs, use **AgentDropdownInputBlock** instead (see below)
**AgentDropdownInputBlock** (ID: `655d6fdf-a334-421c-b733-520549c07cd1`):
- Specialized input block that presents a dropdown/select to the user
- Required `input_default` fields: `name` (str)
- Optional: `options` (list of dropdown values; when omitted/empty, input behaves as free-text), `title`, `description`, `value` (default selection)
- Output: `result` — the user-selected value at runtime
- Use this instead of AgentInputBlock when the user should pick from a fixed set of options
**AgentOutputBlock** (ID: `363ae599-353e-4804-937e-b2ee3cef3da4`):
- Defines a user-facing output displayed after the agent runs
@@ -143,11 +174,11 @@ To use an MCP (Model Context Protocol) tool as a node in the agent:
tool_arguments.
6. Output: `result` (the tool's return value) and `error` (error message)
### Using SmartDecisionMakerBlock (AI Orchestrator with Agent Mode)
### Using OrchestratorBlock (AI Orchestrator with Agent Mode)
To create an agent where AI autonomously decides which tools or sub-agents to
call in a loop until the task is complete:
1. Create a `SmartDecisionMakerBlock` node
1. Create a `OrchestratorBlock` node
(ID: `3b191d9f-356f-482d-8238-ba04b6d18381`)
2. Set `input_default`:
- `agent_mode_max_iterations`: Choose based on task complexity:
@@ -169,8 +200,8 @@ call in a loop until the task is complete:
3. Wire the `prompt` input from an `AgentInputBlock` (the user's task)
4. Create downstream tool blocks — regular blocks **or** `AgentExecutorBlock`
nodes that call sub-agents
5. Link each tool to the SmartDecisionMaker: set `source_name: "tools"` on
the SmartDecisionMaker side and `sink_name: <input_field>` on each tool
5. Link each tool to the Orchestrator: set `source_name: "tools"` on
the Orchestrator side and `sink_name: <input_field>` on each tool
block's input. Create one link per input field the tool needs.
6. Wire the `finished` output to an `AgentOutputBlock` for the final result
7. Credentials (LLM API key) are configured by the user in the platform UI
@@ -178,35 +209,49 @@ call in a loop until the task is complete:
**Example — Orchestrator calling two sub-agents:**
- Node 1: `AgentInputBlock` (input_default: `{"name": "task"}`)
- Node 2: `SmartDecisionMakerBlock` (input_default:
- Node 2: `OrchestratorBlock` (input_default:
`{"agent_mode_max_iterations": 10, "conversation_compaction": true}`)
- Node 3: `AgentExecutorBlock` (sub-agent A — set `graph_id`, `graph_version`,
`input_schema`, `output_schema` from library agent)
- Node 4: `AgentExecutorBlock` (sub-agent B — same pattern)
- Node 5: `AgentOutputBlock` (input_default: `{"name": "result"}`)
- Links:
- Input→SDM: `source_name: "result"`, `sink_name: "prompt"`
- SDM→Agent A (per input field): `source_name: "tools"`,
- Input→Orchestrator: `source_name: "result"`, `sink_name: "prompt"`
- Orchestrator→Agent A (per input field): `source_name: "tools"`,
`sink_name: "<agent_a_input_field>"`
- SDM→Agent B (per input field): `source_name: "tools"`,
- Orchestrator→Agent B (per input field): `source_name: "tools"`,
`sink_name: "<agent_b_input_field>"`
- SDM→Output: `source_name: "finished"`, `sink_name: "value"`
- Orchestrator→Output: `source_name: "finished"`, `sink_name: "value"`
**Example — Orchestrator calling regular blocks as tools:**
- Node 1: `AgentInputBlock` (input_default: `{"name": "task"}`)
- Node 2: `SmartDecisionMakerBlock` (input_default:
- Node 2: `OrchestratorBlock` (input_default:
`{"agent_mode_max_iterations": 5, "conversation_compaction": true}`)
- Node 3: `GetWebpageBlock` (regular block — the AI calls it as a tool)
- Node 4: `AITextGeneratorBlock` (another regular block as a tool)
- Node 5: `AgentOutputBlock` (input_default: `{"name": "result"}`)
- Links:
- Input→SDM: `source_name: "result"`, `sink_name: "prompt"`
- SDM→GetWebpage: `source_name: "tools"`, `sink_name: "url"`
- SDM→AITextGenerator: `source_name: "tools"`, `sink_name: "prompt"`
- SDM→Output: `source_name: "finished"`, `sink_name: "value"`
- Input→Orchestrator: `source_name: "result"`, `sink_name: "prompt"`
- Orchestrator→GetWebpage: `source_name: "tools"`, `sink_name: "url"`
- Orchestrator→AITextGenerator: `source_name: "tools"`, `sink_name: "prompt"`
- Orchestrator→Output: `source_name: "finished"`, `sink_name: "value"`
Regular blocks work exactly like sub-agents as tools — wire each input
field from `source_name: "tools"` on the SmartDecisionMaker side.
field from `source_name: "tools"` on the Orchestrator side.
### Testing with Dry Run
After saving an agent, suggest a dry run to validate wiring without consuming
real API calls, credentials, or credits:
1. **Run**: Call `run_agent` or `run_block` with `dry_run=True` and provide
sample inputs. This executes the graph with mock outputs, verifying that
links resolve correctly and required inputs are satisfied.
2. **Check results**: Call `view_agent_output` with `show_execution_details=True`
to inspect the full node-by-node execution trace. This shows what each node
received as input and produced as output, making it easy to spot wiring issues.
3. **Iterate**: If the dry run reveals wiring issues or missing inputs, fix
the agent JSON and re-save before suggesting a real execution.
### Example: Simple AI Text Processor

View File

@@ -7,7 +7,35 @@ without implementing their own event loop.
from __future__ import annotations
from typing import Any
import logging
import uuid
from collections.abc import AsyncIterator
from contextlib import asynccontextmanager
from typing import TYPE_CHECKING, Any
if TYPE_CHECKING:
from backend.copilot.permissions import CopilotPermissions
from pydantic import BaseModel, Field
from redis.exceptions import RedisError
from .. import stream_registry
from ..response_model import (
StreamError,
StreamTextDelta,
StreamToolInputAvailable,
StreamToolOutputAvailable,
StreamUsage,
)
from .service import stream_chat_completion_sdk
logger = logging.getLogger(__name__)
# Identifiers used when registering AutoPilot-originated streams in the
# stream registry. Distinct from "chat_stream"/"chat" used by the HTTP SSE
# endpoint, making it easy to filter AutoPilot streams in logs/observability.
AUTOPILOT_TOOL_CALL_ID = "autopilot_stream"
AUTOPILOT_TOOL_NAME = "autopilot"
class CopilotResult:
@@ -33,26 +61,131 @@ class CopilotResult:
self.total_tokens: int = 0
class _RegistryHandle(BaseModel):
"""Tracks stream registry session state for cleanup."""
publish_turn_id: str = ""
error_msg: str | None = None
error_already_published: bool = False
@asynccontextmanager
async def _registry_session(
session_id: str, user_id: str, turn_id: str
) -> AsyncIterator[_RegistryHandle]:
"""Create a stream registry session and ensure it is finalized."""
handle = _RegistryHandle(publish_turn_id=turn_id)
try:
await stream_registry.create_session(
session_id=session_id,
user_id=user_id,
tool_call_id=AUTOPILOT_TOOL_CALL_ID,
tool_name=AUTOPILOT_TOOL_NAME,
turn_id=turn_id,
)
except (RedisError, ConnectionError, OSError):
logger.warning(
"[collect] Failed to create stream registry session for %s, "
"frontend will not receive real-time updates",
session_id[:12],
exc_info=True,
)
# Disable chunk publishing but keep finalization enabled so
# mark_session_completed can clean up any partial registry state.
handle.publish_turn_id = ""
try:
yield handle
finally:
try:
await stream_registry.mark_session_completed(
session_id,
error_message=handle.error_msg,
skip_error_publish=handle.error_already_published,
)
except (RedisError, ConnectionError, OSError):
logger.warning(
"[collect] Failed to mark stream completed for %s",
session_id[:12],
exc_info=True,
)
class _ToolCallEntry(BaseModel):
"""A single tool call observed during stream consumption."""
tool_call_id: str
tool_name: str
input: Any
output: Any = None
success: bool | None = None
class _EventAccumulator(BaseModel):
"""Mutable accumulator for stream events."""
response_parts: list[str] = Field(default_factory=list)
tool_calls: list[_ToolCallEntry] = Field(default_factory=list)
tool_calls_by_id: dict[str, _ToolCallEntry] = Field(default_factory=dict)
prompt_tokens: int = 0
completion_tokens: int = 0
total_tokens: int = 0
def _process_event(event: object, acc: _EventAccumulator) -> str | None:
"""Process a single stream event and return error_msg if StreamError.
Uses structural pattern matching for dispatch per project guidelines.
"""
match event:
case StreamTextDelta(delta=delta):
acc.response_parts.append(delta)
case StreamToolInputAvailable() as e:
entry = _ToolCallEntry(
tool_call_id=e.toolCallId,
tool_name=e.toolName,
input=e.input,
)
acc.tool_calls.append(entry)
acc.tool_calls_by_id[e.toolCallId] = entry
case StreamToolOutputAvailable() as e:
if tc := acc.tool_calls_by_id.get(e.toolCallId):
tc.output = e.output
tc.success = e.success
else:
logger.debug(
"Received tool output for unknown tool_call_id: %s",
e.toolCallId,
)
case StreamUsage() as e:
acc.prompt_tokens += e.prompt_tokens
acc.completion_tokens += e.completion_tokens
acc.total_tokens += e.total_tokens
case StreamError(errorText=err):
return err
return None
async def collect_copilot_response(
*,
session_id: str,
message: str,
user_id: str,
is_user_message: bool = True,
permissions: "CopilotPermissions | None" = None,
) -> CopilotResult:
"""Consume :func:`stream_chat_completion_sdk` and return aggregated results.
This is the recommended entry-point for callers that need a simple
request-response interface (e.g. the AutoPilot block) rather than
streaming individual events. It avoids duplicating the event-collection
logic and does NOT wrap the stream in ``asyncio.timeout`` — the SDK
manages its own heartbeat-based timeouts internally.
Registers with the stream registry so the frontend can connect via SSE
and receive real-time updates while the AutoPilot block is executing.
Args:
session_id: Chat session to use.
message: The user message / prompt.
user_id: Authenticated user ID.
is_user_message: Whether this is a user-initiated message.
permissions: Optional capability filter. When provided, restricts
which tools and blocks the copilot may use during this execution.
Returns:
A :class:`CopilotResult` with the aggregated response text,
@@ -61,48 +194,39 @@ async def collect_copilot_response(
Raises:
RuntimeError: If the stream yields a ``StreamError`` event.
"""
from backend.copilot.response_model import (
StreamError,
StreamTextDelta,
StreamToolInputAvailable,
StreamToolOutputAvailable,
StreamUsage,
)
turn_id = str(uuid.uuid4())
async with _registry_session(session_id, user_id, turn_id) as handle:
try:
raw_stream = stream_chat_completion_sdk(
session_id=session_id,
message=message,
is_user_message=is_user_message,
user_id=user_id,
permissions=permissions,
)
published_stream = stream_registry.stream_and_publish(
session_id=session_id,
turn_id=handle.publish_turn_id,
stream=raw_stream,
)
from .service import stream_chat_completion_sdk
acc = _EventAccumulator()
async for event in published_stream:
if err := _process_event(event, acc):
handle.error_msg = err
# stream_and_publish skips StreamError events, so
# mark_session_completed must publish the error to Redis.
handle.error_already_published = False
raise RuntimeError(f"Copilot error: {err}")
except Exception:
if handle.error_msg is None:
handle.error_msg = "AutoPilot execution failed"
raise
result = CopilotResult()
response_parts: list[str] = []
tool_calls_by_id: dict[str, dict[str, Any]] = {}
async for event in stream_chat_completion_sdk(
session_id=session_id,
message=message,
is_user_message=is_user_message,
user_id=user_id,
):
if isinstance(event, StreamTextDelta):
response_parts.append(event.delta)
elif isinstance(event, StreamToolInputAvailable):
entry: dict[str, Any] = {
"tool_call_id": event.toolCallId,
"tool_name": event.toolName,
"input": event.input,
"output": None,
"success": None,
}
result.tool_calls.append(entry)
tool_calls_by_id[event.toolCallId] = entry
elif isinstance(event, StreamToolOutputAvailable):
if tc := tool_calls_by_id.get(event.toolCallId):
tc["output"] = event.output
tc["success"] = event.success
elif isinstance(event, StreamUsage):
result.prompt_tokens += event.prompt_tokens
result.completion_tokens += event.completion_tokens
result.total_tokens += event.total_tokens
elif isinstance(event, StreamError):
raise RuntimeError(f"Copilot error: {event.errorText}")
result.response_text = "".join(response_parts)
result.response_text = "".join(acc.response_parts)
result.tool_calls = [tc.model_dump() for tc in acc.tool_calls]
result.prompt_tokens = acc.prompt_tokens
result.completion_tokens = acc.completion_tokens
result.total_tokens = acc.total_tokens
return result

View File

@@ -0,0 +1,177 @@
"""Tests for collect_copilot_response stream registry integration."""
from unittest.mock import AsyncMock, patch
import pytest
from backend.copilot.response_model import (
StreamError,
StreamFinish,
StreamTextDelta,
StreamToolInputAvailable,
StreamToolOutputAvailable,
StreamUsage,
)
from backend.copilot.sdk.collect import collect_copilot_response
def _mock_stream_fn(*events):
"""Return a callable that returns an async generator."""
async def _gen(**_kwargs):
for e in events:
yield e
return _gen
@pytest.fixture
def mock_registry():
"""Patch stream_registry module used by collect."""
with patch("backend.copilot.sdk.collect.stream_registry") as m:
m.create_session = AsyncMock()
m.publish_chunk = AsyncMock()
m.mark_session_completed = AsyncMock()
# stream_and_publish: pass-through that also publishes (real logic)
# We re-implement the pass-through here so the event loop works,
# but still track publish_chunk calls via the mock.
async def _stream_and_publish(session_id, turn_id, stream):
async for event in stream:
if turn_id and not isinstance(event, (StreamFinish, StreamError)):
await m.publish_chunk(turn_id, event)
yield event
m.stream_and_publish = _stream_and_publish
yield m
@pytest.fixture
def stream_fn_patch():
"""Helper to patch stream_chat_completion_sdk."""
def _patch(events):
return patch(
"backend.copilot.sdk.collect.stream_chat_completion_sdk",
new=_mock_stream_fn(*events),
)
return _patch
@pytest.mark.asyncio
async def test_stream_registry_called_on_success(mock_registry, stream_fn_patch):
"""Stream registry create/publish/complete are called correctly on success."""
events = [
StreamTextDelta(id="t1", delta="Hello "),
StreamTextDelta(id="t1", delta="world"),
StreamUsage(prompt_tokens=10, completion_tokens=5, total_tokens=15),
StreamFinish(),
]
with stream_fn_patch(events):
result = await collect_copilot_response(
session_id="test-session",
message="hi",
user_id="user-1",
)
assert result.response_text == "Hello world"
assert result.total_tokens == 15
mock_registry.create_session.assert_awaited_once()
# StreamFinish should NOT be published (mark_session_completed does it)
published_types = [
type(call.args[1]).__name__
for call in mock_registry.publish_chunk.call_args_list
]
assert "StreamFinish" not in published_types
assert "StreamTextDelta" in published_types
mock_registry.mark_session_completed.assert_awaited_once()
_, kwargs = mock_registry.mark_session_completed.call_args
assert kwargs.get("error_message") is None
@pytest.mark.asyncio
async def test_stream_registry_error_on_stream_error(mock_registry, stream_fn_patch):
"""mark_session_completed receives error message when StreamError occurs."""
events = [
StreamTextDelta(id="t1", delta="partial"),
StreamError(errorText="something broke"),
]
with stream_fn_patch(events):
with pytest.raises(RuntimeError, match="something broke"):
await collect_copilot_response(
session_id="test-session",
message="hi",
user_id="user-1",
)
_, kwargs = mock_registry.mark_session_completed.call_args
assert kwargs.get("error_message") == "something broke"
# stream_and_publish skips StreamError, so mark_session_completed must
# publish it (skip_error_publish=False).
assert kwargs.get("skip_error_publish") is False
# StreamError should NOT be published via publish_chunk — mark_session_completed
# handles it to avoid double-publication.
published_types = [
type(call.args[1]).__name__
for call in mock_registry.publish_chunk.call_args_list
]
assert "StreamError" not in published_types
@pytest.mark.asyncio
async def test_graceful_degradation_when_create_session_fails(
mock_registry, stream_fn_patch
):
"""AutoPilot still works when stream registry create_session raises."""
events = [
StreamTextDelta(id="t1", delta="works"),
StreamFinish(),
]
mock_registry.create_session = AsyncMock(side_effect=ConnectionError("Redis down"))
with stream_fn_patch(events):
result = await collect_copilot_response(
session_id="test-session",
message="hi",
user_id="user-1",
)
assert result.response_text == "works"
# publish_chunk should NOT be called because turn_id was cleared
mock_registry.publish_chunk.assert_not_awaited()
# mark_session_completed IS still called to clean up any partial state
mock_registry.mark_session_completed.assert_awaited_once()
@pytest.mark.asyncio
async def test_tool_calls_published_and_collected(mock_registry, stream_fn_patch):
"""Tool call events are both published to registry and collected in result."""
events = [
StreamToolInputAvailable(
toolCallId="tc-1", toolName="read_file", input={"path": "/tmp"}
),
StreamToolOutputAvailable(
toolCallId="tc-1", output="file contents", success=True
),
StreamTextDelta(id="t1", delta="done"),
StreamFinish(),
]
with stream_fn_patch(events):
result = await collect_copilot_response(
session_id="test-session",
message="hi",
user_id="user-1",
)
assert len(result.tool_calls) == 1
assert result.tool_calls[0]["tool_name"] == "read_file"
assert result.tool_calls[0]["output"] == "file contents"
assert result.tool_calls[0]["success"] is True
assert result.response_text == "done"

View File

@@ -25,7 +25,7 @@ from backend.copilot.sdk.compaction import (
def _make_session() -> ChatSession:
return ChatSession.new(user_id="test-user")
return ChatSession.new(user_id="test-user", dry_run=False)
# ---------------------------------------------------------------------------

View File

@@ -25,24 +25,64 @@ def build_test_transcript(pairs: list[tuple[str, str]]) -> str:
Use this helper in any copilot SDK test that needs a well-formed
transcript without hitting the real storage layer.
Delegates to ``build_structured_transcript`` — plain content strings
are automatically wrapped in ``[{"type": "text", "text": ...}]`` for
assistant messages.
"""
# Cast widening: tuple[str, str] is structurally compatible with
# tuple[str, str | list[dict]] but list invariance requires explicit
# annotation.
widened: list[tuple[str, str | list[dict]]] = list(pairs)
return build_structured_transcript(widened)
def build_structured_transcript(
entries: list[tuple[str, str | list[dict]]],
) -> str:
"""Build a JSONL transcript with structured content blocks.
Each entry is (role, content) where content is either a plain string
(for user messages) or a list of content block dicts (for assistant
messages with thinking/tool_use/text blocks).
Example::
build_structured_transcript([
("user", "Hello"),
("assistant", [
{"type": "thinking", "thinking": "...", "signature": "sig1"},
{"type": "text", "text": "Hi there"},
]),
])
"""
lines: list[str] = []
last_uuid: str | None = None
for role, content in pairs:
for role, content in entries:
uid = str(uuid4())
entry_type = "assistant" if role == "assistant" else "user"
msg: dict = {"role": role, "content": content}
if role == "assistant":
msg.update(
{
"model": "",
"id": f"msg_{uid[:8]}",
"type": "message",
"content": [{"type": "text", "text": content}],
"stop_reason": "end_turn",
"stop_sequence": None,
}
)
if role == "assistant" and isinstance(content, list):
msg: dict = {
"role": "assistant",
"model": "claude-test",
"id": f"msg_{uid[:8]}",
"type": "message",
"content": content,
"stop_reason": "end_turn",
"stop_sequence": None,
}
elif role == "assistant":
msg = {
"role": "assistant",
"model": "claude-test",
"id": f"msg_{uid[:8]}",
"type": "message",
"content": [{"type": "text", "text": content}],
"stop_reason": "end_turn",
"stop_sequence": None,
}
else:
msg = {"role": role, "content": content}
entry = {
"type": entry_type,
"uuid": uid,

View File

@@ -2,7 +2,7 @@
When E2B is active, these tools replace the SDK built-in Read/Write/Edit/
Glob/Grep so that all file operations share the same ``/home/user``
filesystem as ``bash_exec``.
and ``/tmp`` filesystems as ``bash_exec``.
SDK-internal paths (``~/.claude/projects/…/tool-results/``) are handled
by the separate ``Read`` MCP tool registered in ``tool_adapter.py``.
@@ -16,10 +16,13 @@ import shlex
from typing import Any, Callable
from backend.copilot.context import (
E2B_ALLOWED_DIRS,
E2B_ALLOWED_DIRS_STR,
E2B_WORKDIR,
get_current_sandbox,
get_sdk_cwd,
is_allowed_local_path,
is_within_allowed_dirs,
resolve_sandbox_path,
)
@@ -36,7 +39,7 @@ async def _check_sandbox_symlink_escape(
``readlink -f`` follows actual symlinks on the sandbox filesystem.
Returns the canonical parent path, or ``None`` if the path escapes
``E2B_WORKDIR``.
the allowed sandbox directories.
Note: There is an inherent TOCTOU window between this check and the
subsequent ``sandbox.files.write()``. A symlink could theoretically be
@@ -52,10 +55,7 @@ async def _check_sandbox_symlink_escape(
if (
canonical_res.exit_code != 0
or not canonical_parent
or (
canonical_parent != E2B_WORKDIR
and not canonical_parent.startswith(E2B_WORKDIR + "/")
)
or not is_within_allowed_dirs(canonical_parent)
):
return None
return canonical_parent
@@ -89,6 +89,38 @@ def _get_sandbox_and_path(
return sandbox, remote
async def _sandbox_write(sandbox: Any, path: str, content: str) -> None:
"""Write *content* to *path* inside the sandbox.
The E2B filesystem API (``sandbox.files.write``) and the command API
(``sandbox.commands.run``) run as **different users**. On ``/tmp``
(which has the sticky bit set) this means ``sandbox.files.write`` can
create new files but cannot overwrite files previously created by
``sandbox.commands.run`` (or itself), because the sticky bit restricts
deletion/rename to the file owner.
To work around this, writes targeting ``/tmp`` are performed via
``tee`` through the command API, which runs as the sandbox ``user``
and can therefore always overwrite user-owned files.
"""
if path == "/tmp" or path.startswith("/tmp/"):
import base64 as _b64
encoded = _b64.b64encode(content.encode()).decode()
result = await sandbox.commands.run(
f"echo {shlex.quote(encoded)} | base64 -d > {shlex.quote(path)}",
cwd=E2B_WORKDIR,
timeout=10,
)
if result.exit_code != 0:
raise RuntimeError(
f"shell write failed (exit {result.exit_code}): "
+ (result.stderr or "").strip()
)
else:
await sandbox.files.write(path, content)
# Tool handlers
@@ -139,13 +171,16 @@ async def _handle_write_file(args: dict[str, Any]) -> dict[str, Any]:
try:
parent = os.path.dirname(remote)
if parent and parent != E2B_WORKDIR:
if parent and parent not in E2B_ALLOWED_DIRS:
await sandbox.files.make_dir(parent)
canonical_parent = await _check_sandbox_symlink_escape(sandbox, parent)
if canonical_parent is None:
return _mcp(f"Path must be within {E2B_WORKDIR}: {parent}", error=True)
return _mcp(
f"Path must be within {E2B_ALLOWED_DIRS_STR}: {os.path.basename(parent)}",
error=True,
)
remote = os.path.join(canonical_parent, os.path.basename(remote))
await sandbox.files.write(remote, content)
await _sandbox_write(sandbox, remote, content)
except Exception as exc:
return _mcp(f"Failed to write {remote}: {exc}", error=True)
@@ -172,7 +207,10 @@ async def _handle_edit_file(args: dict[str, Any]) -> dict[str, Any]:
parent = os.path.dirname(remote)
canonical_parent = await _check_sandbox_symlink_escape(sandbox, parent)
if canonical_parent is None:
return _mcp(f"Path must be within {E2B_WORKDIR}: {parent}", error=True)
return _mcp(
f"Path must be within {E2B_ALLOWED_DIRS_STR}: {os.path.basename(parent)}",
error=True,
)
remote = os.path.join(canonical_parent, os.path.basename(remote))
try:
@@ -197,7 +235,7 @@ async def _handle_edit_file(args: dict[str, Any]) -> dict[str, Any]:
else content.replace(old_string, new_string, 1)
)
try:
await sandbox.files.write(remote, updated)
await _sandbox_write(sandbox, remote, updated)
except Exception as exc:
return _mcp(f"Failed to write {remote}: {exc}", error=True)
@@ -290,14 +328,14 @@ def _read_local(file_path: str, offset: int, limit: int) -> dict[str, Any]:
E2B_FILE_TOOLS: list[tuple[str, str, dict[str, Any], Callable[..., Any]]] = [
(
"read_file",
"Read a file from the cloud sandbox (/home/user). "
"Read a file from the cloud sandbox (/home/user or /tmp). "
"Use offset and limit for large files.",
{
"type": "object",
"properties": {
"file_path": {
"type": "string",
"description": "Path (relative to /home/user, or absolute).",
"description": "Path (relative to /home/user, or absolute under /home/user or /tmp).",
},
"offset": {
"type": "integer",
@@ -314,7 +352,7 @@ E2B_FILE_TOOLS: list[tuple[str, str, dict[str, Any], Callable[..., Any]]] = [
),
(
"write_file",
"Write or create a file in the cloud sandbox (/home/user). "
"Write or create a file in the cloud sandbox (/home/user or /tmp). "
"Parent directories are created automatically. "
"To copy a workspace file into the sandbox, use "
"read_workspace_file with save_to_path instead.",
@@ -323,7 +361,7 @@ E2B_FILE_TOOLS: list[tuple[str, str, dict[str, Any], Callable[..., Any]]] = [
"properties": {
"file_path": {
"type": "string",
"description": "Path (relative to /home/user, or absolute).",
"description": "Path (relative to /home/user, or absolute under /home/user or /tmp).",
},
"content": {"type": "string", "description": "Content to write."},
},
@@ -340,7 +378,7 @@ E2B_FILE_TOOLS: list[tuple[str, str, dict[str, Any], Callable[..., Any]]] = [
"properties": {
"file_path": {
"type": "string",
"description": "Path (relative to /home/user, or absolute).",
"description": "Path (relative to /home/user, or absolute under /home/user or /tmp).",
},
"old_string": {"type": "string", "description": "Text to find."},
"new_string": {"type": "string", "description": "Replacement text."},

View File

@@ -15,6 +15,7 @@ from backend.copilot.context import E2B_WORKDIR, SDK_PROJECTS_DIR, _current_proj
from .e2b_file_tools import (
_check_sandbox_symlink_escape,
_read_local,
_sandbox_write,
resolve_sandbox_path,
)
@@ -39,23 +40,23 @@ class TestResolveSandboxPath:
assert resolve_sandbox_path("./README.md") == f"{E2B_WORKDIR}/README.md"
def test_traversal_blocked(self):
with pytest.raises(ValueError, match=f"must be within {E2B_WORKDIR}"):
with pytest.raises(ValueError, match="must be within"):
resolve_sandbox_path("../../etc/passwd")
def test_absolute_traversal_blocked(self):
with pytest.raises(ValueError, match=f"must be within {E2B_WORKDIR}"):
with pytest.raises(ValueError, match="must be within"):
resolve_sandbox_path(f"{E2B_WORKDIR}/../../etc/passwd")
def test_absolute_outside_sandbox_blocked(self):
with pytest.raises(ValueError, match=f"must be within {E2B_WORKDIR}"):
with pytest.raises(ValueError, match="must be within"):
resolve_sandbox_path("/etc/passwd")
def test_root_blocked(self):
with pytest.raises(ValueError, match=f"must be within {E2B_WORKDIR}"):
with pytest.raises(ValueError, match="must be within"):
resolve_sandbox_path("/")
def test_home_other_user_blocked(self):
with pytest.raises(ValueError, match=f"must be within {E2B_WORKDIR}"):
with pytest.raises(ValueError, match="must be within"):
resolve_sandbox_path("/home/other/file.txt")
def test_deep_nested_allowed(self):
@@ -68,6 +69,24 @@ class TestResolveSandboxPath:
"""Path that resolves back within E2B_WORKDIR is allowed."""
assert resolve_sandbox_path("a/b/../c.txt") == f"{E2B_WORKDIR}/a/c.txt"
def test_tmp_absolute_allowed(self):
assert resolve_sandbox_path("/tmp/data.txt") == "/tmp/data.txt"
def test_tmp_nested_allowed(self):
assert resolve_sandbox_path("/tmp/a/b/c.txt") == "/tmp/a/b/c.txt"
def test_tmp_itself_allowed(self):
assert resolve_sandbox_path("/tmp") == "/tmp"
def test_tmp_escape_blocked(self):
with pytest.raises(ValueError, match="must be within"):
resolve_sandbox_path("/tmp/../etc/passwd")
def test_tmp_prefix_collision_blocked(self):
"""A path like /tmp_evil should be blocked (not a prefix match)."""
with pytest.raises(ValueError, match="must be within"):
resolve_sandbox_path("/tmp_evil/malicious.txt")
# ---------------------------------------------------------------------------
# _read_local — host filesystem reads with allowlist enforcement
@@ -227,3 +246,92 @@ class TestCheckSandboxSymlinkEscape:
sandbox = _make_sandbox(stdout=f"{E2B_WORKDIR}/a/b/c/d\n", exit_code=0)
result = await _check_sandbox_symlink_escape(sandbox, f"{E2B_WORKDIR}/a/b/c/d")
assert result == f"{E2B_WORKDIR}/a/b/c/d"
@pytest.mark.asyncio
async def test_tmp_path_allowed(self):
"""Paths resolving to /tmp are allowed."""
sandbox = _make_sandbox(stdout="/tmp/workdir\n", exit_code=0)
result = await _check_sandbox_symlink_escape(sandbox, "/tmp/workdir")
assert result == "/tmp/workdir"
@pytest.mark.asyncio
async def test_tmp_itself_allowed(self):
"""The /tmp directory itself is allowed."""
sandbox = _make_sandbox(stdout="/tmp\n", exit_code=0)
result = await _check_sandbox_symlink_escape(sandbox, "/tmp")
assert result == "/tmp"
# ---------------------------------------------------------------------------
# _sandbox_write — routing writes through shell for /tmp paths
# ---------------------------------------------------------------------------
class TestSandboxWrite:
@pytest.mark.asyncio
async def test_tmp_path_uses_shell_command(self):
"""Writes to /tmp should use commands.run (shell) instead of files.write."""
run_result = SimpleNamespace(stdout="", stderr="", exit_code=0)
commands = SimpleNamespace(run=AsyncMock(return_value=run_result))
files = SimpleNamespace(write=AsyncMock())
sandbox = SimpleNamespace(commands=commands, files=files)
await _sandbox_write(sandbox, "/tmp/test.py", "print('hello')")
commands.run.assert_called_once()
files.write.assert_not_called()
@pytest.mark.asyncio
async def test_home_user_path_uses_files_api(self):
"""Writes to /home/user should use sandbox.files.write."""
run_result = SimpleNamespace(stdout="", stderr="", exit_code=0)
commands = SimpleNamespace(run=AsyncMock(return_value=run_result))
files = SimpleNamespace(write=AsyncMock())
sandbox = SimpleNamespace(commands=commands, files=files)
await _sandbox_write(sandbox, "/home/user/test.py", "print('hello')")
files.write.assert_called_once_with("/home/user/test.py", "print('hello')")
commands.run.assert_not_called()
@pytest.mark.asyncio
async def test_tmp_nested_path_uses_shell_command(self):
"""Writes to nested /tmp paths should use commands.run."""
run_result = SimpleNamespace(stdout="", stderr="", exit_code=0)
commands = SimpleNamespace(run=AsyncMock(return_value=run_result))
files = SimpleNamespace(write=AsyncMock())
sandbox = SimpleNamespace(commands=commands, files=files)
await _sandbox_write(sandbox, "/tmp/subdir/file.txt", "content")
commands.run.assert_called_once()
files.write.assert_not_called()
@pytest.mark.asyncio
async def test_tmp_write_shell_failure_raises(self):
"""Shell write failure should raise RuntimeError."""
run_result = SimpleNamespace(stdout="", stderr="No space left", exit_code=1)
commands = SimpleNamespace(run=AsyncMock(return_value=run_result))
sandbox = SimpleNamespace(commands=commands)
with pytest.raises(RuntimeError, match="shell write failed"):
await _sandbox_write(sandbox, "/tmp/test.txt", "content")
@pytest.mark.asyncio
async def test_tmp_write_preserves_content_with_special_chars(self):
"""Content with special shell characters should be preserved via base64."""
import base64
run_result = SimpleNamespace(stdout="", stderr="", exit_code=0)
commands = SimpleNamespace(run=AsyncMock(return_value=run_result))
sandbox = SimpleNamespace(commands=commands)
content = "print(\"Hello $USER\")\n# a `backtick` and 'quotes'\n"
await _sandbox_write(sandbox, "/tmp/special.py", content)
# Verify the command contains base64-encoded content
call_args = commands.run.call_args[0][0]
# Extract the base64 string from the command
encoded_in_cmd = call_args.split("echo ")[1].split(" |")[0].strip("'")
decoded = base64.b64decode(encoded_in_cmd).decode()
assert decoded == content

View File

@@ -275,7 +275,7 @@ class TestCompactionE2E:
# --- Step 7: CompactionTracker receives PreCompact hook ---
tracker = CompactionTracker()
session = ChatSession.new(user_id="test-user")
session = ChatSession.new(user_id="test-user", dry_run=False)
tracker.on_compact(str(session_file))
# --- Step 8: Next SDK message arrives → emit_start ---
@@ -376,7 +376,7 @@ class TestCompactionE2E:
monkeypatch.setenv("CLAUDE_CONFIG_DIR", str(config_dir))
tracker = CompactionTracker()
session = ChatSession.new(user_id="test")
session = ChatSession.new(user_id="test", dry_run=False)
builder = TranscriptBuilder()
# --- First query with compaction ---

View File

@@ -0,0 +1,68 @@
"""SDK environment variable builder — importable without circular deps.
Extracted from ``service.py`` so that ``backend.blocks.orchestrator``
can reuse the same subscription / OpenRouter / direct-Anthropic logic
without pulling in the full copilot service module (which would create a
circular import through ``executor`` → ``credit`` → ``block_cost_config``).
"""
from __future__ import annotations
from backend.copilot.config import ChatConfig
from backend.copilot.sdk.subscription import validate_subscription
# ChatConfig is stateless (reads env vars) — a separate instance is fine.
# A singleton would require importing service.py which causes the circular dep
# this module was created to avoid.
config = ChatConfig()
def build_sdk_env(
session_id: str | None = None,
user_id: str | None = None,
) -> dict[str, str]:
"""Build env vars for the SDK CLI subprocess.
Three modes (checked in order):
1. **Subscription** — clears all keys; CLI uses ``claude login`` auth.
2. **Direct Anthropic** — returns ``{}``; subprocess inherits
``ANTHROPIC_API_KEY`` from the parent environment.
3. **OpenRouter** (default) — overrides base URL and auth token to
route through the proxy, with Langfuse trace headers.
"""
# --- Mode 1: Claude Code subscription auth ---
if config.use_claude_code_subscription:
validate_subscription()
return {
"ANTHROPIC_API_KEY": "",
"ANTHROPIC_AUTH_TOKEN": "",
"ANTHROPIC_BASE_URL": "",
}
# --- Mode 2: Direct Anthropic (no proxy hop) ---
if not config.openrouter_active:
return {}
# --- Mode 3: OpenRouter proxy ---
base = (config.base_url or "").rstrip("/")
if base.endswith("/v1"):
base = base[:-3]
env: dict[str, str] = {
"ANTHROPIC_BASE_URL": base,
"ANTHROPIC_AUTH_TOKEN": config.api_key or "",
"ANTHROPIC_API_KEY": "", # force CLI to use AUTH_TOKEN
}
# Inject broadcast headers so OpenRouter forwards traces to Langfuse.
def _safe(v: str) -> str:
return v.replace("\r", "").replace("\n", "").strip()[:128]
parts = []
if session_id:
parts.append(f"x-session-id: {_safe(session_id)}")
if user_id:
parts.append(f"x-user-id: {_safe(user_id)}")
if parts:
env["ANTHROPIC_CUSTOM_HEADERS"] = "\n".join(parts)
return env

View File

@@ -0,0 +1,242 @@
"""Tests for build_sdk_env() — the SDK subprocess environment builder."""
from unittest.mock import patch
import pytest
from backend.copilot.config import ChatConfig
# ---------------------------------------------------------------------------
# Helpers — build a ChatConfig with explicit field values so tests don't
# depend on real environment variables.
# ---------------------------------------------------------------------------
def _make_config(**overrides) -> ChatConfig:
"""Create a ChatConfig with safe defaults, applying *overrides*."""
defaults = {
"use_claude_code_subscription": False,
"use_openrouter": False,
"api_key": None,
"base_url": None,
}
defaults.update(overrides)
return ChatConfig(**defaults)
# ---------------------------------------------------------------------------
# Mode 1 — Subscription auth
# ---------------------------------------------------------------------------
class TestBuildSdkEnvSubscription:
"""When ``use_claude_code_subscription`` is True, keys are blanked."""
@patch("backend.copilot.sdk.env.validate_subscription")
def test_returns_blanked_keys(self, mock_validate):
"""Subscription mode clears API_KEY, AUTH_TOKEN, and BASE_URL."""
cfg = _make_config(use_claude_code_subscription=True)
with patch("backend.copilot.sdk.env.config", cfg):
from backend.copilot.sdk.env import build_sdk_env
result = build_sdk_env()
assert result == {
"ANTHROPIC_API_KEY": "",
"ANTHROPIC_AUTH_TOKEN": "",
"ANTHROPIC_BASE_URL": "",
}
mock_validate.assert_called_once()
@patch(
"backend.copilot.sdk.env.validate_subscription",
side_effect=RuntimeError("CLI not found"),
)
def test_propagates_validation_error(self, mock_validate):
"""If validate_subscription fails, the error bubbles up."""
cfg = _make_config(use_claude_code_subscription=True)
with patch("backend.copilot.sdk.env.config", cfg):
from backend.copilot.sdk.env import build_sdk_env
with pytest.raises(RuntimeError, match="CLI not found"):
build_sdk_env()
# ---------------------------------------------------------------------------
# Mode 2 — Direct Anthropic (no OpenRouter)
# ---------------------------------------------------------------------------
class TestBuildSdkEnvDirectAnthropic:
"""When OpenRouter is inactive, return empty dict (inherit parent env)."""
def test_returns_empty_dict_when_openrouter_inactive(self):
cfg = _make_config(use_openrouter=False)
with patch("backend.copilot.sdk.env.config", cfg):
from backend.copilot.sdk.env import build_sdk_env
result = build_sdk_env()
assert result == {}
def test_returns_empty_dict_when_openrouter_flag_true_but_no_key(self):
"""OpenRouter flag is True but no api_key => openrouter_active is False."""
cfg = _make_config(use_openrouter=True, base_url="https://openrouter.ai/api/v1")
# Force api_key to None after construction (field_validator may pick up env vars)
object.__setattr__(cfg, "api_key", None)
assert not cfg.openrouter_active
with patch("backend.copilot.sdk.env.config", cfg):
from backend.copilot.sdk.env import build_sdk_env
result = build_sdk_env()
assert result == {}
# ---------------------------------------------------------------------------
# Mode 3 — OpenRouter proxy
# ---------------------------------------------------------------------------
class TestBuildSdkEnvOpenRouter:
"""When OpenRouter is active, return proxy env vars."""
def _openrouter_config(self, **overrides):
defaults = {
"use_openrouter": True,
"api_key": "sk-or-test-key",
"base_url": "https://openrouter.ai/api/v1",
}
defaults.update(overrides)
return _make_config(**defaults)
def test_basic_openrouter_env(self):
cfg = self._openrouter_config()
with patch("backend.copilot.sdk.env.config", cfg):
from backend.copilot.sdk.env import build_sdk_env
result = build_sdk_env()
assert result["ANTHROPIC_BASE_URL"] == "https://openrouter.ai/api"
assert result["ANTHROPIC_AUTH_TOKEN"] == "sk-or-test-key"
assert result["ANTHROPIC_API_KEY"] == ""
assert "ANTHROPIC_CUSTOM_HEADERS" not in result
def test_strips_trailing_v1(self):
"""The /v1 suffix is stripped from the base URL."""
cfg = self._openrouter_config(base_url="https://openrouter.ai/api/v1")
with patch("backend.copilot.sdk.env.config", cfg):
from backend.copilot.sdk.env import build_sdk_env
result = build_sdk_env()
assert result["ANTHROPIC_BASE_URL"] == "https://openrouter.ai/api"
def test_strips_trailing_v1_and_slash(self):
"""Trailing slash before /v1 strip is handled."""
cfg = self._openrouter_config(base_url="https://openrouter.ai/api/v1/")
with patch("backend.copilot.sdk.env.config", cfg):
from backend.copilot.sdk.env import build_sdk_env
result = build_sdk_env()
# rstrip("/") first, then remove /v1
assert result["ANTHROPIC_BASE_URL"] == "https://openrouter.ai/api"
def test_no_v1_suffix_left_alone(self):
"""A base URL without /v1 is used as-is."""
cfg = self._openrouter_config(base_url="https://custom-proxy.example.com")
with patch("backend.copilot.sdk.env.config", cfg):
from backend.copilot.sdk.env import build_sdk_env
result = build_sdk_env()
assert result["ANTHROPIC_BASE_URL"] == "https://custom-proxy.example.com"
def test_session_id_header(self):
cfg = self._openrouter_config()
with patch("backend.copilot.sdk.env.config", cfg):
from backend.copilot.sdk.env import build_sdk_env
result = build_sdk_env(session_id="sess-123")
assert "ANTHROPIC_CUSTOM_HEADERS" in result
assert "x-session-id: sess-123" in result["ANTHROPIC_CUSTOM_HEADERS"]
def test_user_id_header(self):
cfg = self._openrouter_config()
with patch("backend.copilot.sdk.env.config", cfg):
from backend.copilot.sdk.env import build_sdk_env
result = build_sdk_env(user_id="user-456")
assert "x-user-id: user-456" in result["ANTHROPIC_CUSTOM_HEADERS"]
def test_both_headers(self):
cfg = self._openrouter_config()
with patch("backend.copilot.sdk.env.config", cfg):
from backend.copilot.sdk.env import build_sdk_env
result = build_sdk_env(session_id="s1", user_id="u2")
headers = result["ANTHROPIC_CUSTOM_HEADERS"]
assert "x-session-id: s1" in headers
assert "x-user-id: u2" in headers
# They should be newline-separated
assert "\n" in headers
def test_header_sanitisation_strips_newlines(self):
"""Newlines/carriage-returns in header values are stripped."""
cfg = self._openrouter_config()
with patch("backend.copilot.sdk.env.config", cfg):
from backend.copilot.sdk.env import build_sdk_env
result = build_sdk_env(session_id="bad\r\nvalue")
header_val = result["ANTHROPIC_CUSTOM_HEADERS"]
# The _safe helper removes \r and \n
assert "\r" not in header_val.split(": ", 1)[1]
assert "badvalue" in header_val
def test_header_value_truncated_to_128_chars(self):
"""Header values are truncated to 128 characters."""
cfg = self._openrouter_config()
with patch("backend.copilot.sdk.env.config", cfg):
from backend.copilot.sdk.env import build_sdk_env
long_id = "x" * 200
result = build_sdk_env(session_id=long_id)
# The value after "x-session-id: " should be at most 128 chars
header_line = result["ANTHROPIC_CUSTOM_HEADERS"]
value = header_line.split(": ", 1)[1]
assert len(value) == 128
# ---------------------------------------------------------------------------
# Mode priority
# ---------------------------------------------------------------------------
class TestBuildSdkEnvModePriority:
"""Subscription mode takes precedence over OpenRouter."""
@patch("backend.copilot.sdk.env.validate_subscription")
def test_subscription_overrides_openrouter(self, mock_validate):
cfg = _make_config(
use_claude_code_subscription=True,
use_openrouter=True,
api_key="sk-or-key",
base_url="https://openrouter.ai/api/v1",
)
with patch("backend.copilot.sdk.env.config", cfg):
from backend.copilot.sdk.env import build_sdk_env
result = build_sdk_env()
# Should get subscription result, not OpenRouter
assert result == {
"ANTHROPIC_API_KEY": "",
"ANTHROPIC_AUTH_TOKEN": "",
"ANTHROPIC_BASE_URL": "",
}

View File

@@ -38,7 +38,7 @@ class TestFlattenAssistantContent:
def test_tool_use_blocks(self):
blocks = [{"type": "tool_use", "name": "read_file", "input": {}}]
assert _flatten_assistant_content(blocks) == "[tool_use: read_file]"
assert _flatten_assistant_content(blocks) == ""
def test_mixed_blocks(self):
blocks = [
@@ -47,19 +47,22 @@ class TestFlattenAssistantContent:
]
result = _flatten_assistant_content(blocks)
assert "Let me read that." in result
assert "[tool_use: Read]" in result
# tool_use blocks are dropped entirely to prevent model mimicry
assert "Read" not in result
def test_raw_strings(self):
assert _flatten_assistant_content(["hello", "world"]) == "hello\nworld"
def test_unknown_block_type_preserved_as_placeholder(self):
def test_unknown_block_type_dropped(self):
blocks = [
{"type": "text", "text": "See this image:"},
{"type": "image", "source": {"type": "base64", "data": "..."}},
]
result = _flatten_assistant_content(blocks)
assert "See this image:" in result
assert "[__image__]" in result
# Unknown block types are dropped to prevent model mimicry
assert "[__image__]" not in result
assert "base64" not in result
def test_empty(self):
assert _flatten_assistant_content([]) == ""
@@ -279,7 +282,8 @@ class TestTranscriptToMessages:
messages = _transcript_to_messages(content)
assert len(messages) == 2
assert "Let me check." in messages[0]["content"]
assert "[tool_use: read_file]" in messages[0]["content"]
# tool_use blocks are dropped entirely to prevent model mimicry
assert "read_file" not in messages[0]["content"]
assert messages[1]["content"] == "file contents"
@@ -442,8 +446,11 @@ class TestCompactTranscript:
assert result is not None
assert validate_transcript(result)
msgs = _transcript_to_messages(result)
assert len(msgs) == 2
# 3 messages: compressed prefix (2) + preserved last assistant (1)
assert len(msgs) == 3
assert msgs[1]["content"] == "Summarized response"
# The last assistant entry is preserved verbatim from original
assert msgs[2]["content"] == "Details"
@pytest.mark.asyncio
async def test_returns_none_on_compression_failure(self, mock_chat_config):

View File

@@ -49,22 +49,22 @@ def test_format_assistant_tool_calls():
)
]
result = _format_conversation_context(msgs)
assert result is not None
assert 'You called tool: search({"q": "test"})' in result
# Assistant with no content and tool_calls omitted produces no lines
assert result is None
def test_format_tool_result():
msgs = [ChatMessage(role="tool", content='{"result": "ok"}')]
result = _format_conversation_context(msgs)
assert result is not None
assert 'Tool result: {"result": "ok"}' in result
assert 'Tool output: {"result": "ok"}' in result
def test_format_tool_result_none_content():
msgs = [ChatMessage(role="tool", content=None)]
result = _format_conversation_context(msgs)
assert result is not None
assert "Tool result: " in result
assert "Tool output: " in result
def test_format_full_conversation():
@@ -84,8 +84,8 @@ def test_format_full_conversation():
assert result is not None
assert "User: find agents" in result
assert "You responded: I'll search for agents." in result
assert "You called tool: find_agents" in result
assert "Tool result:" in result
# tool_calls are omitted to prevent model mimicry
assert "Tool output:" in result
assert "You responded: Found Agent1." in result

View File

@@ -15,6 +15,7 @@ from claude_agent_sdk import (
ResultMessage,
SystemMessage,
TextBlock,
ThinkingBlock,
ToolResultBlock,
ToolUseBlock,
UserMessage,
@@ -26,6 +27,7 @@ from backend.copilot.response_model import (
StreamError,
StreamFinish,
StreamFinishStep,
StreamHeartbeat,
StreamStart,
StreamStartStep,
StreamTextDelta,
@@ -75,6 +77,12 @@ class SDKResponseAdapter:
# Open the first step (matches non-SDK: StreamStart then StreamStartStep)
responses.append(StreamStartStep())
self.step_open = True
elif sdk_message.subtype == "task_progress":
# Emit a heartbeat so publish_chunk is called during long
# sub-agent runs. Without this, the Redis stream and meta
# key TTLs expire during gaps where no real chunks are
# produced (task_progress events were previously silent).
responses.append(StreamHeartbeat())
elif isinstance(sdk_message, AssistantMessage):
# Flush any SDK built-in tool calls that didn't get a UserMessage
@@ -100,6 +108,11 @@ class SDKResponseAdapter:
StreamTextDelta(id=self.text_block_id, delta=block.text)
)
elif isinstance(block, ThinkingBlock):
# Thinking blocks are preserved in the transcript but
# not streamed to the frontend — skip silently.
pass
elif isinstance(block, ToolUseBlock):
self._end_text_if_open(responses)

View File

@@ -18,6 +18,7 @@ from backend.copilot.response_model import (
StreamError,
StreamFinish,
StreamFinishStep,
StreamHeartbeat,
StreamStart,
StreamStartStep,
StreamTextDelta,
@@ -28,6 +29,7 @@ from backend.copilot.response_model import (
StreamToolOutputAvailable,
)
from .compaction import compaction_events
from .response_adapter import SDKResponseAdapter
from .tool_adapter import MCP_TOOL_PREFIX
from .tool_adapter import _pending_tool_outputs as _pto
@@ -59,6 +61,14 @@ def test_system_non_init_emits_nothing():
assert results == []
def test_task_progress_emits_heartbeat():
"""task_progress events emit a StreamHeartbeat to keep Redis TTL alive."""
adapter = _adapter()
results = adapter.convert_message(SystemMessage(subtype="task_progress", data={}))
assert len(results) == 1
assert isinstance(results[0], StreamHeartbeat)
# -- AssistantMessage with TextBlock -----------------------------------------
@@ -680,3 +690,102 @@ def test_already_resolved_tool_skipped_in_user_message():
assert (
len(output_events) == 0
), "Already-resolved tool should not emit duplicate output"
# -- _end_text_if_open before compaction -------------------------------------
def test_end_text_if_open_emits_text_end_before_finish_step():
"""StreamTextEnd must be emitted before StreamFinishStep during compaction.
When ``emit_end_if_ready`` fires compaction events while a text block is
still open, ``_end_text_if_open`` must close it first. If StreamFinishStep
arrives before StreamTextEnd, the Vercel AI SDK clears ``activeTextParts``
and raises "Received text-end for missing text part".
"""
adapter = _adapter()
# Open a text block by processing an AssistantMessage with text
msg = AssistantMessage(content=[TextBlock(text="partial response")], model="test")
adapter.convert_message(msg)
assert adapter.has_started_text
assert not adapter.has_ended_text
# Simulate what service.py does before yielding compaction events
pre_close: list[StreamBaseResponse] = []
adapter._end_text_if_open(pre_close)
combined = pre_close + list(compaction_events("Compacted transcript"))
text_end_idx = next(
(i for i, e in enumerate(combined) if isinstance(e, StreamTextEnd)), None
)
finish_step_idx = next(
(i for i, e in enumerate(combined) if isinstance(e, StreamFinishStep)), None
)
assert text_end_idx is not None, "StreamTextEnd must be present"
assert finish_step_idx is not None, "StreamFinishStep must be present"
assert text_end_idx < finish_step_idx, (
f"StreamTextEnd (idx={text_end_idx}) must precede "
f"StreamFinishStep (idx={finish_step_idx}) — otherwise the Vercel AI SDK "
"clears activeTextParts before text-end arrives"
)
def test_step_open_must_reset_after_compaction_finish_step():
"""Adapter step_open must be reset when compaction emits StreamFinishStep.
Compaction events bypass the adapter, so service.py must explicitly clear
step_open after yielding a StreamFinishStep from compaction. Without this,
the next AssistantMessage skips StreamStartStep because the adapter still
thinks a step is open.
"""
adapter = _adapter()
# Open a step + text block via an AssistantMessage
msg = AssistantMessage(content=[TextBlock(text="thinking...")], model="test")
adapter.convert_message(msg)
assert adapter.step_open is True
# Simulate what service.py does: close text, then check compaction events
pre_close: list[StreamBaseResponse] = []
adapter._end_text_if_open(pre_close)
events = list(compaction_events("Compacted transcript"))
if any(isinstance(ev, StreamFinishStep) for ev in events):
adapter.step_open = False
assert (
adapter.step_open is False
), "step_open must be False after compaction emits StreamFinishStep"
# Next AssistantMessage must open a new step
msg2 = AssistantMessage(content=[TextBlock(text="continued")], model="test")
results = adapter.convert_message(msg2)
assert any(
isinstance(r, StreamStartStep) for r in results
), "A new StreamStartStep must be emitted after compaction closed the step"
def test_end_text_if_open_no_op_when_no_text_open():
"""_end_text_if_open emits nothing when no text block is open."""
adapter = _adapter()
results: list[StreamBaseResponse] = []
adapter._end_text_if_open(results)
assert results == []
def test_end_text_if_open_no_op_after_text_already_ended():
"""_end_text_if_open emits nothing when the text block is already closed."""
adapter = _adapter()
msg = AssistantMessage(content=[TextBlock(text="hello")], model="test")
adapter.convert_message(msg)
# Close it once
first: list[StreamBaseResponse] = []
adapter._end_text_if_open(first)
assert len(first) == 1
assert isinstance(first[0], StreamTextEnd)
# Second call must be a no-op
second: list[StreamBaseResponse] = []
adapter._end_text_if_open(second)
assert second == []

View File

@@ -124,8 +124,11 @@ class TestScenarioCompactAndRetry:
assert result != original # Must be different
assert validate_transcript(result)
msgs = _transcript_to_messages(result)
assert len(msgs) == 2
# 3 messages: compressed prefix (2) + preserved last assistant (1)
assert len(msgs) == 3
assert msgs[0]["content"] == "[summary of conversation]"
# Last assistant preserved verbatim
assert msgs[2]["content"] == "Long answer 2"
def test_compacted_transcript_loads_into_builder(self):
"""TranscriptBuilder can load a compacted transcript and continue."""
@@ -737,7 +740,10 @@ class TestRetryEdgeCases:
assert result is not None
assert result != transcript
msgs = _transcript_to_messages(result)
assert len(msgs) == 2
# 3 messages: compressed prefix (2) + preserved last assistant (1)
assert len(msgs) == 3
# Last assistant preserved verbatim
assert msgs[2]["content"] == "Answer 19"
def test_messages_to_transcript_roundtrip_preserves_content(self):
"""Verify messages → transcript → messages preserves all content."""
@@ -898,14 +904,14 @@ class TestTranscriptEdgeCases:
assert restored[1]["content"] == "Second"
def test_flatten_assistant_with_only_tool_use(self):
"""Assistant message with only tool_use blocks (no text)."""
"""Assistant message with only tool_use blocks (no text) flattens to empty."""
blocks = [
{"type": "tool_use", "name": "bash", "input": {"cmd": "ls"}},
{"type": "tool_use", "name": "read", "input": {"path": "/f"}},
]
result = _flatten_assistant_content(blocks)
assert "[tool_use: bash]" in result
assert "[tool_use: read]" in result
# tool_use blocks are dropped entirely to prevent model mimicry
assert result == ""
def test_flatten_tool_result_nested_image(self):
"""Tool result containing image blocks uses placeholder."""
@@ -1004,7 +1010,7 @@ def _make_sdk_patches(
(f"{_SVC}.create_security_hooks", dict(return_value=MagicMock())),
(f"{_SVC}.get_copilot_tool_names", dict(return_value=[])),
(f"{_SVC}.get_sdk_disallowed_tools", dict(return_value=[])),
(f"{_SVC}._build_sdk_env", dict(return_value=None)),
(f"{_SVC}.build_sdk_env", dict(return_value=None)),
(f"{_SVC}._resolve_sdk_model", dict(return_value=None)),
(f"{_SVC}.set_execution_context", {}),
(
@@ -1408,3 +1414,261 @@ class TestStreamChatCompletionRetryIntegration:
# Verify user-friendly message (not raw SDK text)
assert "Authentication" in errors[0].errorText
assert any(isinstance(e, StreamStart) for e in events)
@pytest.mark.asyncio
async def test_result_message_prompt_too_long_triggers_compaction(self):
"""CLI returns ResultMessage(subtype="error") with "Prompt is too long".
When the Claude CLI rejects the prompt pre-API (model=<synthetic>,
duration_api_ms=0), it sends a ResultMessage with is_error=True
instead of raising a Python exception. The retry loop must still
detect this as a context-length error and trigger compaction.
"""
import contextlib
from claude_agent_sdk import ResultMessage
from backend.copilot.response_model import StreamError, StreamStart
from backend.copilot.sdk.service import stream_chat_completion_sdk
session = self._make_session()
success_result = self._make_result_message()
attempt_count = [0]
error_result = ResultMessage(
subtype="error",
result="Prompt is too long",
duration_ms=100,
duration_api_ms=0,
is_error=True,
num_turns=0,
session_id="test-session-id",
)
def _client_factory(*args, **kwargs):
attempt_count[0] += 1
if attempt_count[0] == 1:
# First attempt: CLI returns error ResultMessage
return self._make_client_mock(result_message=error_result)
# Second attempt (after compaction): succeeds
return self._make_client_mock(result_message=success_result)
original_transcript = _build_transcript(
[("user", "prior question"), ("assistant", "prior answer")]
)
compacted_transcript = _build_transcript(
[("user", "[summary]"), ("assistant", "summary reply")]
)
patches = _make_sdk_patches(
session,
original_transcript=original_transcript,
compacted_transcript=compacted_transcript,
client_side_effect=_client_factory,
)
events = []
with contextlib.ExitStack() as stack:
for target, kwargs in patches:
stack.enter_context(patch(target, **kwargs))
async for event in stream_chat_completion_sdk(
session_id="test-session-id",
message="hello",
is_user_message=True,
user_id="test-user",
session=session,
):
events.append(event)
assert attempt_count[0] == 2, (
f"Expected 2 SDK attempts (CLI error ResultMessage "
f"should trigger compaction retry), got {attempt_count[0]}"
)
errors = [e for e in events if isinstance(e, StreamError)]
assert not errors, f"Unexpected StreamError: {errors}"
assert any(isinstance(e, StreamStart) for e in events)
@pytest.mark.asyncio
async def test_result_message_success_subtype_prompt_too_long_triggers_compaction(
self,
):
"""CLI returns ResultMessage(subtype="success") with result="Prompt is too long".
The SDK internally compacts but the transcript is still too long. It
returns subtype="success" (process completed) with result="Prompt is
too long" (the actual rejection message). The retry loop must detect
this as a context-length error and trigger compaction — the subtype
"success" must not fool it into treating this as a real response.
"""
import contextlib
from claude_agent_sdk import ResultMessage
from backend.copilot.response_model import StreamError, StreamStart
from backend.copilot.sdk.service import stream_chat_completion_sdk
session = self._make_session()
success_result = self._make_result_message()
attempt_count = [0]
error_result = ResultMessage(
subtype="success",
result="Prompt is too long",
duration_ms=100,
duration_api_ms=0,
is_error=False,
num_turns=1,
session_id="test-session-id",
)
def _client_factory(*args, **kwargs):
attempt_count[0] += 1
async def _receive_error():
yield error_result
async def _receive_success():
yield success_result
client = MagicMock()
client._transport = MagicMock()
client._transport.write = AsyncMock()
client.query = AsyncMock()
if attempt_count[0] == 1:
client.receive_response = _receive_error
else:
client.receive_response = _receive_success
cm = AsyncMock()
cm.__aenter__.return_value = client
cm.__aexit__.return_value = None
return cm
original_transcript = _build_transcript(
[("user", "prior question"), ("assistant", "prior answer")]
)
compacted_transcript = _build_transcript(
[("user", "[summary]"), ("assistant", "summary reply")]
)
patches = _make_sdk_patches(
session,
original_transcript=original_transcript,
compacted_transcript=compacted_transcript,
client_side_effect=_client_factory,
)
events = []
with contextlib.ExitStack() as stack:
for target, kwargs in patches:
stack.enter_context(patch(target, **kwargs))
async for event in stream_chat_completion_sdk(
session_id="test-session-id",
message="hello",
is_user_message=True,
user_id="test-user",
session=session,
):
events.append(event)
assert attempt_count[0] == 2, (
f"Expected 2 SDK attempts (subtype='success' with 'Prompt is too long' "
f"result should trigger compaction retry), got {attempt_count[0]}"
)
errors = [e for e in events if isinstance(e, StreamError)]
assert not errors, f"Unexpected StreamError: {errors}"
assert any(isinstance(e, StreamStart) for e in events)
@pytest.mark.asyncio
async def test_assistant_message_error_content_prompt_too_long_triggers_compaction(
self,
):
"""AssistantMessage.error="invalid_request" with content "Prompt is too long".
The SDK returns error type "invalid_request" but puts the actual
rejection message ("Prompt is too long") in the content blocks.
The retry loop must detect this via content inspection (sdk_error
being set confirms it's an error message, not user content).
"""
import contextlib
from claude_agent_sdk import AssistantMessage, ResultMessage, TextBlock
from backend.copilot.response_model import StreamError, StreamStart
from backend.copilot.sdk.service import stream_chat_completion_sdk
session = self._make_session()
success_result = self._make_result_message()
attempt_count = [0]
def _client_factory(*args, **kwargs):
attempt_count[0] += 1
async def _receive_error():
# SDK returns invalid_request with "Prompt is too long" in content.
# ResultMessage.result is a non-PTL value ("done") to isolate
# the AssistantMessage content detection path exclusively.
yield AssistantMessage(
content=[TextBlock(text="Prompt is too long")],
model="<synthetic>",
error="invalid_request",
)
yield ResultMessage(
subtype="success",
result="done",
duration_ms=100,
duration_api_ms=0,
is_error=False,
num_turns=1,
session_id="test-session-id",
)
async def _receive_success():
yield success_result
client = MagicMock()
client._transport = MagicMock()
client._transport.write = AsyncMock()
client.query = AsyncMock()
if attempt_count[0] == 1:
client.receive_response = _receive_error
else:
client.receive_response = _receive_success
cm = AsyncMock()
cm.__aenter__.return_value = client
cm.__aexit__.return_value = None
return cm
original_transcript = _build_transcript(
[("user", "prior question"), ("assistant", "prior answer")]
)
compacted_transcript = _build_transcript(
[("user", "[summary]"), ("assistant", "summary reply")]
)
patches = _make_sdk_patches(
session,
original_transcript=original_transcript,
compacted_transcript=compacted_transcript,
client_side_effect=_client_factory,
)
events = []
with contextlib.ExitStack() as stack:
for target, kwargs in patches:
stack.enter_context(patch(target, **kwargs))
async for event in stream_chat_completion_sdk(
session_id="test-session-id",
message="hello",
is_user_message=True,
user_id="test-user",
session=session,
):
events.append(event)
assert attempt_count[0] == 2, (
f"Expected 2 SDK attempts (AssistantMessage error content 'Prompt is "
f"too long' should trigger compaction retry), got {attempt_count[0]}"
)
errors = [e for e in events if isinstance(e, StreamError)]
assert not errors, f"Unexpected StreamError: {errors}"
assert any(isinstance(e, StreamStart) for e in events)

View File

@@ -313,8 +313,7 @@ def create_security_hooks(
.replace("\r", "")
)
logger.info(
"[SDK] Context compaction triggered: %s, user=%s, "
"transcript_path=%s",
"[SDK] Context compaction triggered: %s, user=%s, transcript_path=%s",
trigger,
user_id,
transcript_path,

View File

@@ -11,7 +11,11 @@ import pytest
from backend.copilot.context import _current_project_dir
from .security_hooks import _validate_tool_access, _validate_user_isolation
from .security_hooks import (
_validate_tool_access,
_validate_user_isolation,
create_security_hooks,
)
SDK_CWD = "/tmp/copilot-abc123"
@@ -220,8 +224,6 @@ def test_bash_builtin_blocked_message_clarity():
@pytest.fixture()
def _hooks():
"""Create security hooks and return (pre, post, post_failure) handlers."""
from .security_hooks import create_security_hooks
hooks = create_security_hooks(user_id="u1", sdk_cwd=SDK_CWD, max_subtasks=2)
pre = hooks["PreToolUse"][0].hooks[0]
post = hooks["PostToolUse"][0].hooks[0]

View File

@@ -2,19 +2,20 @@
import asyncio
import base64
import functools
import json
import logging
import os
import re
import shutil
import subprocess
import sys
import time
import uuid
from collections.abc import AsyncGenerator, AsyncIterator
from dataclasses import dataclass
from typing import Any, NamedTuple, cast
from typing import TYPE_CHECKING, Any, NamedTuple, cast
if TYPE_CHECKING:
from backend.copilot.permissions import CopilotPermissions
from claude_agent_sdk import (
AssistantMessage,
@@ -31,6 +32,7 @@ from langsmith.integrations.claude_agent_sdk import configure_claude_agent_sdk
from pydantic import BaseModel
from backend.copilot.context import get_workspace_manager
from backend.copilot.permissions import apply_tool_permissions
from backend.data.redis_client import get_redis_async
from backend.executor.cluster_lock import AsyncClusterLock
from backend.util.exceptions import NotFoundError
@@ -57,11 +59,14 @@ from ..response_model import (
StreamBaseResponse,
StreamError,
StreamFinish,
StreamFinishStep,
StreamHeartbeat,
StreamStart,
StreamStartStep,
StreamStatus,
StreamTextDelta,
StreamToolInputAvailable,
StreamToolInputStart,
StreamToolOutputAvailable,
StreamUsage,
)
@@ -75,12 +80,15 @@ from ..tools.e2b_sandbox import get_or_create_sandbox, pause_sandbox_direct
from ..tools.sandbox import WORKSPACE_PREFIX, make_session_path
from ..tracking import track_user_message
from .compaction import CompactionTracker, filter_compaction_messages
from .env import build_sdk_env # noqa: F401 — re-export for backward compat
from .response_adapter import SDKResponseAdapter
from .security_hooks import create_security_hooks
from .tool_adapter import (
create_copilot_mcp_server,
get_copilot_tool_names,
get_sdk_disallowed_tools,
reset_stash_event,
reset_tool_failure_counters,
set_execution_context,
wait_for_stash,
)
@@ -106,6 +114,21 @@ config = ChatConfig()
# Non-context errors (network, auth, rate-limit) are NOT retried.
_MAX_STREAM_ATTEMPTS = 3
# Hard circuit breaker: abort the stream if the model sends this many
# consecutive tool calls with empty parameters (a sign of context
# saturation or serialization failure). The MCP wrapper now returns
# guidance on the first empty call, giving the model a chance to
# self-correct. The limit is generous to allow recovery attempts.
_EMPTY_TOOL_CALL_LIMIT = 5
# User-facing error shown when the empty-tool-call circuit breaker trips.
_CIRCUIT_BREAKER_ERROR_MSG = (
"AutoPilot was unable to complete the tool call "
"— this usually happens when the response is "
"too large to fit in a single tool call. "
"Try breaking your request into smaller parts."
)
# Patterns that indicate the prompt/request exceeds the model's context limit.
# Matched case-insensitively against the full exception chain.
_PROMPT_TOO_LONG_PATTERNS: tuple[str, ...] = (
@@ -164,6 +187,37 @@ def _is_prompt_too_long(err: BaseException) -> bool:
return False
def _is_sdk_disconnect_error(exc: BaseException) -> bool:
"""Return True if *exc* is an expected SDK cleanup error from client disconnect.
Two known patterns occur when ``GeneratorExit`` tears down the async
generator and the SDK's ``__aexit__`` runs in a different context/task:
* ``RuntimeError``: cancel scope exited in wrong task (anyio)
* ``ValueError``: ContextVar token created in a different Context (OTEL)
These are suppressed to avoid polluting Sentry with non-actionable noise.
"""
if isinstance(exc, RuntimeError) and "cancel scope" in str(exc):
return True
if isinstance(exc, ValueError) and "was created in a different Context" in str(exc):
return True
return False
def _is_tool_only_message(sdk_msg: object) -> bool:
"""Return True if *sdk_msg* is an AssistantMessage containing only ToolUseBlocks.
Such a message represents a parallel tool-call batch (no text output yet).
The ``bool(…content)`` guard prevents vacuous-truth evaluation on an empty list.
"""
return (
isinstance(sdk_msg, AssistantMessage)
and bool(sdk_msg.content)
and all(isinstance(b, ToolUseBlock) for b in sdk_msg.content)
)
class ReducedContext(NamedTuple):
builder: TranscriptBuilder
use_resume: bool
@@ -375,6 +429,63 @@ _HEARTBEAT_INTERVAL = 10.0 # seconds
STREAM_LOCK_PREFIX = "copilot:stream:lock:"
async def _safe_close_sdk_client(
sdk_client: ClaudeSDKClient,
log_prefix: str,
) -> None:
"""Close a ClaudeSDKClient, suppressing errors from client disconnect.
When the SSE client disconnects mid-stream, ``GeneratorExit`` propagates
through the async generator stack and causes ``ClaudeSDKClient.__aexit__``
to run in a different async context or task than where the client was
opened. This triggers two known error classes:
* ``ValueError``: ``<Token var=<ContextVar name='current_context'>>
was created in a different Context`` — OpenTelemetry's
``context.detach()`` fails because the OTEL context token was
created in the original generator coroutine but detach runs in
the GC / cleanup coroutine (Sentry: AUTOGPT-SERVER-8BT).
* ``RuntimeError``: ``Attempted to exit cancel scope in a different
task than it was entered in`` — anyio's ``TaskGroup.__aexit__``
detects that the cancel scope was entered in one task but is
being exited in another (Sentry: AUTOGPT-SERVER-8BW).
Both are harmless — the TCP connection is already dead and no
resources leak. Logging them at ``debug`` level keeps observability
without polluting Sentry.
"""
try:
await sdk_client.__aexit__(None, None, None)
except (ValueError, RuntimeError) as exc:
if _is_sdk_disconnect_error(exc):
# Expected during client disconnect — suppress to avoid Sentry noise.
logger.debug(
"%s SDK client cleanup error suppressed (client disconnect): %s: %s",
log_prefix,
type(exc).__name__,
exc,
)
else:
raise
except GeneratorExit:
# GeneratorExit can propagate through __aexit__ — suppress it here
# since the generator is already being torn down.
logger.debug(
"%s SDK client cleanup GeneratorExit suppressed (client disconnect)",
log_prefix,
)
except Exception:
# Unexpected cleanup error — log at error level so Sentry captures it
# (via its logging integration), but don't propagate since we're in
# teardown and the caller cannot meaningfully handle this.
logger.error(
"%s Unexpected SDK client cleanup error",
log_prefix,
exc_info=True,
)
async def _iter_sdk_messages(
client: ClaudeSDKClient,
) -> AsyncGenerator[Any, None]:
@@ -458,91 +569,6 @@ def _resolve_sdk_model() -> str | None:
return model
@functools.cache
def _validate_claude_code_subscription() -> None:
"""Validate Claude CLI is installed and responds to `--version`.
Cached so the blocking subprocess check runs at most once per process
lifetime. A failure (CLI not installed) is a config error that requires
a process restart anyway.
"""
claude_path = shutil.which("claude")
if not claude_path:
raise RuntimeError(
"Claude Code CLI not found. Install it with: "
"npm install -g @anthropic-ai/claude-code"
)
result = subprocess.run(
[claude_path, "--version"],
capture_output=True,
text=True,
timeout=10,
)
if result.returncode != 0:
raise RuntimeError(
f"Claude CLI check failed (exit {result.returncode}): "
f"{result.stderr.strip()}"
)
logger.info(
"Claude Code subscription mode: CLI version %s",
result.stdout.strip(),
)
def _build_sdk_env(
session_id: str | None = None,
user_id: str | None = None,
) -> dict[str, str]:
"""Build env vars for the SDK CLI subprocess.
Three modes (checked in order):
1. **Subscription** — clears all keys; CLI uses `claude login` auth.
2. **Direct Anthropic** — returns `{}`; subprocess inherits
`ANTHROPIC_API_KEY` from the parent environment.
3. **OpenRouter** (default) — overrides base URL and auth token to
route through the proxy, with Langfuse trace headers.
"""
# --- Mode 1: Claude Code subscription auth ---
if config.use_claude_code_subscription:
_validate_claude_code_subscription()
return {
"ANTHROPIC_API_KEY": "",
"ANTHROPIC_AUTH_TOKEN": "",
"ANTHROPIC_BASE_URL": "",
}
# --- Mode 2: Direct Anthropic (no proxy hop) ---
# `openrouter_active` checks the flag *and* credential presence.
if not config.openrouter_active:
return {}
# --- Mode 3: OpenRouter proxy ---
# Strip /v1 suffix — SDK expects the base URL without a version path.
base = (config.base_url or "").rstrip("/")
if base.endswith("/v1"):
base = base[:-3]
env: dict[str, str] = {
"ANTHROPIC_BASE_URL": base,
"ANTHROPIC_AUTH_TOKEN": config.api_key or "",
"ANTHROPIC_API_KEY": "", # force CLI to use AUTH_TOKEN
}
# Inject broadcast headers so OpenRouter forwards traces to Langfuse.
def _safe(v: str) -> str:
"""Sanitise a header value: strip newlines/whitespace and cap length."""
return v.replace("\r", "").replace("\n", "").strip()[:128]
parts = []
if session_id:
parts.append(f"x-session-id: {_safe(session_id)}")
if user_id:
parts.append(f"x-user-id: {_safe(user_id)}")
if parts:
env["ANTHROPIC_CUSTOM_HEADERS"] = "\n".join(parts)
return env
def _make_sdk_cwd(session_id: str) -> str:
"""Create a safe, session-specific working directory path.
@@ -592,7 +618,9 @@ def _format_sdk_content_blocks(blocks: list) -> list[dict[str, Any]]:
"""Convert SDK content blocks to transcript format.
Handles TextBlock, ToolUseBlock, ToolResultBlock, and ThinkingBlock.
Unknown block types are logged and skipped.
Raw dicts (e.g. ``redacted_thinking`` blocks that the SDK may not have
a typed class for) are passed through verbatim to preserve them in the
transcript. Unknown typed block objects are logged and skipped.
"""
result: list[dict[str, Any]] = []
for block in blocks or []:
@@ -624,6 +652,9 @@ def _format_sdk_content_blocks(blocks: list) -> list[dict[str, Any]]:
"signature": block.signature,
}
)
elif isinstance(block, dict) and "type" in block:
# Preserve raw dict blocks (e.g. redacted_thinking) verbatim.
result.append(block)
else:
logger.warning(
f"[SDK] Unknown content block type: {type(block).__name__}. "
@@ -717,15 +748,11 @@ def _format_conversation_context(messages: list[ChatMessage]) -> str | None:
elif msg.role == "assistant":
if msg.content:
lines.append(f"You responded: {msg.content}")
if msg.tool_calls:
for tc in msg.tool_calls:
func = tc.get("function", {})
tool_name = func.get("name", "unknown")
tool_args = func.get("arguments", "")
lines.append(f"You called tool: {tool_name}({tool_args})")
# Omit tool_calls — any text representation gets mimicked
# by the model. Tool results below provide the context.
elif msg.role == "tool":
content = msg.content or ""
lines.append(f"Tool result: {content}")
lines.append(f"Tool output: {content[:500]}")
if not lines:
return None
@@ -1028,15 +1055,122 @@ def _dispatch_response(
return response
class _TransientErrorHandled(Exception):
class _HandledStreamError(Exception):
"""Raised by `_run_stream_attempt` after it has already yielded a
`StreamError` for a transient API error.
`StreamError` to the client (e.g. transient API error, circuit breaker).
This signals the outer retry loop that the attempt failed so it can
perform session-message rollback and set the `ended_with_stream_error`
flag, **without** yielding a duplicate `StreamError` to the client.
Attributes:
error_msg: The user-facing error message to persist.
code: Machine-readable error code (e.g. ``circuit_breaker_empty_tool_calls``).
retryable: Whether the frontend should offer a retry button.
"""
def __init__(
self,
message: str,
error_msg: str | None = None,
code: str | None = None,
retryable: bool = True,
):
super().__init__(message)
self.error_msg = error_msg
self.code = code
self.retryable = retryable
@dataclass
class _EmptyToolBreakResult:
"""Result of checking for empty tool calls in a single AssistantMessage."""
count: int # Updated consecutive counter
tripped: bool # Whether the circuit breaker fired
error: StreamError | None # StreamError to yield (if tripped)
error_msg: str | None # Error message (if tripped)
error_code: str | None # Error code (if tripped)
def _check_empty_tool_breaker(
sdk_msg: object,
consecutive: int,
ctx: _StreamContext,
state: _RetryState,
) -> _EmptyToolBreakResult:
"""Detect consecutive empty tool calls and trip the circuit breaker.
Returns an ``_EmptyToolBreakResult`` with the updated counter and, if the
breaker tripped, the ``StreamError`` to yield plus the error metadata.
"""
if not isinstance(sdk_msg, AssistantMessage):
return _EmptyToolBreakResult(consecutive, False, None, None, None)
empty_tools = [
b.name for b in sdk_msg.content if isinstance(b, ToolUseBlock) and not b.input
]
if not empty_tools:
# Reset on any non-empty-tool AssistantMessage (including text-only
# messages — any() over empty content is False).
return _EmptyToolBreakResult(0, False, None, None, None)
consecutive += 1
# Log full diagnostics on first occurrence only; subsequent hits just
# log the counter to reduce noise.
if consecutive == 1:
logger.warning(
"%s Empty tool call detected (%d/%d): "
"tools=%s, model=%s, error=%s, "
"block_types=%s, cumulative_usage=%s",
ctx.log_prefix,
consecutive,
_EMPTY_TOOL_CALL_LIMIT,
empty_tools,
sdk_msg.model,
sdk_msg.error,
[type(b).__name__ for b in sdk_msg.content],
{
"prompt": state.usage.prompt_tokens,
"completion": state.usage.completion_tokens,
"cache_read": state.usage.cache_read_tokens,
},
)
else:
logger.warning(
"%s Empty tool call detected (%d/%d): tools=%s",
ctx.log_prefix,
consecutive,
_EMPTY_TOOL_CALL_LIMIT,
empty_tools,
)
if consecutive < _EMPTY_TOOL_CALL_LIMIT:
return _EmptyToolBreakResult(consecutive, False, None, None, None)
logger.error(
"%s Circuit breaker: aborting stream after %d "
"consecutive empty tool calls. "
"This is likely caused by the model attempting "
"to write content too large for a single tool "
"call's output token limit. The model should "
"write large files in chunks using bash_exec "
"with cat >> (append).",
ctx.log_prefix,
consecutive,
)
error_msg = _CIRCUIT_BREAKER_ERROR_MSG
error_code = "circuit_breaker_empty_tool_calls"
_append_error_marker(ctx.session, error_msg, retryable=True)
return _EmptyToolBreakResult(
count=consecutive,
tripped=True,
error=StreamError(errorText=error_msg, code=error_code),
error_msg=error_msg,
error_code=error_code,
)
async def _run_stream_attempt(
ctx: _StreamContext,
@@ -1071,8 +1205,32 @@ async def _run_stream_attempt(
accumulated_tool_calls=[],
)
ended_with_stream_error = False
# Stores the error message used by _append_error_marker so the outer
# retry loop can re-append the correct message after session rollback.
stream_error_msg: str | None = None
stream_error_code: str | None = None
async with ClaudeSDKClient(options=state.options) as client:
consecutive_empty_tool_calls = 0
# --- Intermediate persistence tracking ---
# Flush session messages to DB periodically so page reloads show progress
# during long-running turns (see incident d2f7cba3: 82-min turn lost on refresh).
_last_flush_time = time.monotonic()
_msgs_since_flush = 0
_FLUSH_INTERVAL_SECONDS = 30.0
_FLUSH_MESSAGE_THRESHOLD = 10
# Use manual __aenter__/__aexit__ instead of ``async with`` so we can
# suppress SDK cleanup errors that occur when the SSE client disconnects
# mid-stream. GeneratorExit causes the SDK's ``__aexit__`` to run in a
# different async context/task than where the client was opened, which
# triggers:
# - ValueError: ContextVar token mismatch (AUTOGPT-SERVER-8BT)
# - RuntimeError: cancel scope in wrong task (AUTOGPT-SERVER-8BW)
# Both are harmless — the TCP connection is already dead.
sdk_client = ClaudeSDKClient(options=state.options)
client = await sdk_client.__aenter__()
try:
logger.info(
"%s Sending query — resume=%s, total_msgs=%d, "
"query_len=%d, attached_files=%d, image_blocks=%d",
@@ -1148,6 +1306,27 @@ async def _run_stream_attempt(
error_preview,
)
# Intercept prompt-too-long errors surfaced as
# AssistantMessage.error (not as a Python exception).
# Re-raise so the outer retry loop can compact the
# transcript and retry with reduced context.
# Check both error_text and error_preview: sdk_error
# being set confirms this is an error message (not user
# content), so checking content is safe. The actual
# error description (e.g. "Prompt is too long") may be
# in the content, not the error type field
# (e.g. error="invalid_request", content="Prompt is
# too long").
if _is_prompt_too_long(Exception(error_text)) or _is_prompt_too_long(
Exception(error_preview)
):
logger.warning(
"%s Prompt-too-long detected via AssistantMessage "
"error — raising for retry",
ctx.log_prefix,
)
raise RuntimeError("Prompt is too long")
# Intercept transient API errors (socket closed,
# ECONNRESET) — replace the raw message with a
# user-friendly error text and use the retryable
@@ -1161,18 +1340,32 @@ async def _run_stream_attempt(
"suppressing raw error text",
ctx.log_prefix,
)
stream_error_msg = FRIENDLY_TRANSIENT_MSG
stream_error_code = "transient_api_error"
_append_error_marker(
ctx.session,
FRIENDLY_TRANSIENT_MSG,
stream_error_msg,
retryable=True,
)
yield StreamError(
errorText=FRIENDLY_TRANSIENT_MSG,
code="transient_api_error",
errorText=stream_error_msg,
code=stream_error_code,
)
ended_with_stream_error = True
break
# Determine if the message is a tool-only batch (all content
# items are ToolUseBlocks) — such messages have no text output yet,
# so we skip the wait_for_stash flush below.
#
# Note: parallel execution of tools is handled natively by the
# SDK CLI via readOnlyHint annotations on tool definitions.
is_tool_only = False
if isinstance(sdk_msg, AssistantMessage) and sdk_msg.content:
is_tool_only = all(
isinstance(item, ToolUseBlock) for item in sdk_msg.content
)
# Race-condition fix: SDK hooks (PostToolUse) are
# executed asynchronously via start_soon() — the next
# message can arrive before the hook stashes output.
@@ -1186,15 +1379,12 @@ async def _run_stream_attempt(
# AssistantMessages (each containing only
# ToolUseBlocks), we must NOT wait/flush — the prior
# tools are still executing concurrently.
is_parallel_continuation = isinstance(sdk_msg, AssistantMessage) and all(
isinstance(b, ToolUseBlock) for b in sdk_msg.content
)
if (
state.adapter.has_unresolved_tool_calls
and isinstance(sdk_msg, (AssistantMessage, ResultMessage))
and not is_parallel_continuation
and not is_tool_only
):
if await wait_for_stash(timeout=0.5):
if await wait_for_stash():
await asyncio.sleep(0)
else:
logger.warning(
@@ -1209,13 +1399,17 @@ async def _run_stream_attempt(
if isinstance(sdk_msg, ResultMessage):
logger.info(
"%s Received: ResultMessage %s "
"(unresolved=%d, current=%d, resolved=%d)",
"(unresolved=%d, current=%d, resolved=%d, "
"num_turns=%d, cost_usd=%s, result=%s)",
ctx.log_prefix,
sdk_msg.subtype,
len(state.adapter.current_tool_calls)
- len(state.adapter.resolved_tool_calls),
len(state.adapter.current_tool_calls),
len(state.adapter.resolved_tool_calls),
sdk_msg.num_turns,
sdk_msg.total_cost_usd,
(sdk_msg.result or "")[:200],
)
if sdk_msg.subtype in (
"error",
@@ -1227,6 +1421,16 @@ async def _run_stream_attempt(
sdk_msg.result or "(no error message provided)",
)
# Check for prompt-too-long regardless of subtype — the
# SDK may return subtype="success" with result="Prompt is
# too long" when the CLI rejects the prompt before calling
# the API (cost_usd=0, no tokens consumed). If we only
# check the "error" subtype path, the stream appears to
# complete normally, the synthetic error text is stored
# in the transcript, and the session grows without bound.
if _is_prompt_too_long(RuntimeError(sdk_msg.result or "")):
raise RuntimeError("Prompt is too long")
# Capture token usage from ResultMessage.
# Anthropic reports cached tokens separately:
# input_tokens = uncached only
@@ -1258,6 +1462,23 @@ async def _run_stream_attempt(
# Emit compaction end if SDK finished compacting.
# Sync TranscriptBuilder with the CLI's active context.
compact_result = await ctx.compaction.emit_end_if_ready(ctx.session)
if compact_result.events:
# Compaction events end with StreamFinishStep, which maps to
# Vercel AI SDK's "finish-step" — that clears activeTextParts.
# Close any open text block BEFORE the compaction events so
# the text-end arrives before finish-step, preventing
# "text-end for missing text part" errors on the frontend.
pre_close: list[StreamBaseResponse] = []
state.adapter._end_text_if_open(pre_close)
# Compaction events bypass the adapter, so sync step state
# when a StreamFinishStep is present — otherwise the adapter
# will skip StreamStartStep on the next AssistantMessage.
if any(
isinstance(ev, StreamFinishStep) for ev in compact_result.events
):
state.adapter.step_open = False
for r in pre_close:
yield r
for ev in compact_result.events:
yield ev
entries_replaced = False
@@ -1272,6 +1493,18 @@ async def _run_stream_attempt(
)
entries_replaced = True
# --- Hard circuit breaker for empty tool calls ---
breaker = _check_empty_tool_breaker(
sdk_msg, consecutive_empty_tool_calls, ctx, state
)
consecutive_empty_tool_calls = breaker.count
if breaker.tripped and breaker.error is not None:
stream_error_msg = breaker.error_msg
stream_error_code = breaker.error_code
yield breaker.error
ended_with_stream_error = True
break
# --- Dispatch adapter responses ---
for response in state.adapter.convert_message(sdk_msg):
dispatched = _dispatch_response(
@@ -1292,8 +1525,38 @@ async def _run_stream_attempt(
model=sdk_msg.model,
)
# --- Intermediate persistence ---
# Flush session messages to DB periodically so page reloads
# show progress during long-running turns.
_msgs_since_flush += 1
now = time.monotonic()
if (
_msgs_since_flush >= _FLUSH_MESSAGE_THRESHOLD
or (now - _last_flush_time) >= _FLUSH_INTERVAL_SECONDS
):
try:
await asyncio.shield(upsert_chat_session(ctx.session))
logger.debug(
"%s Intermediate flush: %d messages "
"(msgs_since=%d, elapsed=%.1fs)",
ctx.log_prefix,
len(ctx.session.messages),
_msgs_since_flush,
now - _last_flush_time,
)
except Exception as flush_err:
logger.warning(
"%s Intermediate flush failed: %s",
ctx.log_prefix,
flush_err,
)
_last_flush_time = now
_msgs_since_flush = 0
if acc.stream_completed:
break
finally:
await _safe_close_sdk_client(sdk_client, ctx.log_prefix)
# --- Post-stream processing (only on success) ---
if state.adapter.has_unresolved_tool_calls:
@@ -1352,8 +1615,10 @@ async def _run_stream_attempt(
# to the client (StreamError yielded above), raise so the outer retry
# loop can rollback session messages and set its error flags properly.
if ended_with_stream_error:
raise _TransientErrorHandled(
"Transient API error handled — StreamError already yielded"
raise _HandledStreamError(
"Stream error handled — StreamError already yielded",
error_msg=stream_error_msg,
code=stream_error_code,
)
@@ -1364,6 +1629,7 @@ async def stream_chat_completion_sdk(
user_id: str | None = None,
session: ChatSession | None = None,
file_ids: list[str] | None = None,
permissions: "CopilotPermissions | None" = None,
**_kwargs: Any,
) -> AsyncIterator[StreamBaseResponse]:
"""Stream chat completion using Claude Agent SDK.
@@ -1609,10 +1875,16 @@ async def stream_chat_completion_sdk(
yield StreamStart(messageId=message_id, sessionId=session_id)
set_execution_context(user_id, session, sandbox=e2b_sandbox, sdk_cwd=sdk_cwd)
set_execution_context(
user_id,
session,
sandbox=e2b_sandbox,
sdk_cwd=sdk_cwd,
permissions=permissions,
)
# Fail fast when no API credentials are available at all.
sdk_env = _build_sdk_env(session_id=session_id, user_id=user_id)
sdk_env = build_sdk_env(session_id=session_id, user_id=user_id)
if not config.api_key and not config.use_claude_code_subscription:
raise RuntimeError(
"No API key configured. Set OPEN_ROUTER_API_KEY, "
@@ -1635,8 +1907,11 @@ async def stream_chat_completion_sdk(
on_compact=compaction.on_compact,
)
allowed = get_copilot_tool_names(use_e2b=use_e2b)
disallowed = get_sdk_disallowed_tools(use_e2b=use_e2b)
if permissions is not None:
allowed, disallowed = apply_tool_permissions(permissions, use_e2b=use_e2b)
else:
allowed = get_copilot_tool_names(use_e2b=use_e2b)
disallowed = get_sdk_disallowed_tools(use_e2b=use_e2b)
def _on_stderr(line: str) -> None:
"""Log a stderr line emitted by the Claude CLI subprocess."""
@@ -1746,6 +2021,12 @@ async def stream_chat_completion_sdk(
)
for attempt in range(_MAX_STREAM_ATTEMPTS):
# Clear any stale stash signal from the previous attempt so
# wait_for_stash() doesn't fire prematurely on a leftover event.
reset_stash_event()
# Reset tool-level circuit breaker so failures from a previous
# (rolled-back) attempt don't carry over to the fresh attempt.
reset_tool_failure_counters()
if attempt > 0:
logger.info(
"%s Retrying with reduced context (%d/%d)",
@@ -1798,7 +2079,20 @@ async def stream_chat_completion_sdk(
try:
async for event in _run_stream_attempt(stream_ctx, state):
if not isinstance(event, StreamHeartbeat):
if not isinstance(
event,
(
StreamHeartbeat,
# Compaction UI events are cosmetic and must not
# block retry — they're emitted before the SDK
# query on compacted attempts.
StreamStartStep,
StreamFinishStep,
StreamToolInputStart,
StreamToolInputAvailable,
StreamToolOutputAvailable,
),
):
events_yielded += 1
yield event
break # Stream completed — exit retry loop
@@ -1810,24 +2104,35 @@ async def stream_chat_completion_sdk(
_MAX_STREAM_ATTEMPTS,
)
raise
except _TransientErrorHandled:
except _HandledStreamError as exc:
# _run_stream_attempt already yielded a StreamError and
# appended an error marker. We only need to rollback
# session messages and set the error flag — do NOT set
# stream_err so the post-loop code won't emit a
# duplicate StreamError.
logger.warning(
"%s Transient error handled in stream attempt "
"(attempt %d/%d, events_yielded=%d)",
"%s Stream error handled in attempt "
"(attempt %d/%d, code=%s, events_yielded=%d)",
log_prefix,
attempt + 1,
_MAX_STREAM_ATTEMPTS,
exc.code or "transient",
events_yielded,
)
session.messages = session.messages[:pre_attempt_msg_count]
# transcript_builder still contains entries from the aborted
# attempt that no longer match session.messages. Skip upload
# so a future --resume doesn't replay rolled-back content.
skip_transcript_upload = True
# Re-append the error marker so it survives the rollback
# and is persisted by the finally block (see #2947655365).
_append_error_marker(session, FRIENDLY_TRANSIENT_MSG, retryable=True)
# Use the specific error message from the attempt (e.g.
# circuit breaker msg) rather than always the generic one.
_append_error_marker(
session,
exc.error_msg or FRIENDLY_TRANSIENT_MSG,
retryable=True,
)
ended_with_stream_error = True
break
except Exception as e:
@@ -1854,11 +2159,13 @@ async def stream_chat_completion_sdk(
log_prefix,
events_yielded,
)
skip_transcript_upload = True
ended_with_stream_error = True
break
if not is_context_error:
# Non-context errors (network, auth, rate-limit) should
# not trigger compaction — surface the error immediately.
skip_transcript_upload = True
ended_with_stream_error = True
break
continue
@@ -1954,6 +2261,16 @@ async def stream_chat_completion_sdk(
log_prefix,
len(session.messages),
)
except GeneratorExit:
# GeneratorExit is raised when the async generator is closed by the
# caller (e.g. client disconnect, page refresh). We MUST release the
# stream lock here because the ``finally`` block at the end of this
# function may not execute when GeneratorExit propagates through nested
# async generators. Without this, the lock stays held for its full TTL
# and the user sees "Another stream is already active" on every retry.
logger.warning("%s GeneratorExit — releasing stream lock", log_prefix)
await lock.release()
raise
except BaseException as e:
# Catch BaseException to handle both Exception and CancelledError
# (CancelledError inherits from BaseException in Python 3.8+)
@@ -1962,9 +2279,16 @@ async def stream_chat_completion_sdk(
error_msg = "Operation cancelled"
else:
error_msg = str(e) or type(e).__name__
# SDK cleanup RuntimeError is expected during cancellation, log as warning
if isinstance(e, RuntimeError) and "cancel scope" in str(e):
logger.warning("%s SDK cleanup error: %s", log_prefix, error_msg)
# SDK cleanup errors are expected during client disconnect —
# log as warning rather than error to reduce Sentry noise.
# These are normally caught by _safe_close_sdk_client but
# can escape in edge cases (e.g. GeneratorExit timing).
if _is_sdk_disconnect_error(e):
logger.warning(
"%s SDK cleanup error (client disconnect): %s",
log_prefix,
error_msg,
)
else:
logger.error("%s Error: %s", log_prefix, error_msg, exc_info=True)
@@ -1986,10 +2310,11 @@ async def stream_chat_completion_sdk(
)
# Yield StreamError for immediate feedback (only for non-cancellation errors)
# Skip for CancelledError and RuntimeError cleanup issues (both are cancellations)
is_cancellation = isinstance(e, asyncio.CancelledError) or (
isinstance(e, RuntimeError) and "cancel scope" in str(e)
)
# Skip for CancelledError and SDK disconnect cleanup errors — these
# are not actionable by the user and the SSE connection is already dead.
is_cancellation = isinstance(
e, asyncio.CancelledError
) or _is_sdk_disconnect_error(e)
if not is_cancellation:
yield StreamError(errorText=display_msg, code=code)

View File

@@ -1,21 +1,23 @@
"""Unit tests for extracted service helpers.
Covers ``_is_prompt_too_long``, ``_reduce_context``, ``_iter_sdk_messages``,
and the ``ReducedContext`` named tuple.
``ReducedContext``, and the ``is_parallel_continuation`` logic.
"""
from __future__ import annotations
import asyncio
from collections.abc import AsyncGenerator
from unittest.mock import AsyncMock, patch
from unittest.mock import AsyncMock, MagicMock, patch
import pytest
from claude_agent_sdk import AssistantMessage, TextBlock, ToolUseBlock
from .conftest import build_test_transcript as _build_transcript
from .service import (
ReducedContext,
_is_prompt_too_long,
_is_tool_only_message,
_iter_sdk_messages,
_reduce_context,
)
@@ -281,3 +283,55 @@ class TestIterSdkMessages:
first = await gen.__anext__()
assert first == "first"
await gen.aclose() # should cancel pending task cleanly
# ---------------------------------------------------------------------------
# is_parallel_continuation logic
# ---------------------------------------------------------------------------
class TestIsParallelContinuation:
"""Unit tests for the is_parallel_continuation expression in the streaming loop.
Verifies the vacuous-truth guard (empty content must return False) and the
boundary cases for mixed TextBlock+ToolUseBlock messages.
"""
def _make_tool_block(self) -> MagicMock:
block = MagicMock(spec=ToolUseBlock)
return block
def test_all_tool_use_blocks_is_parallel(self):
"""AssistantMessage with only ToolUseBlocks is a parallel continuation."""
msg = MagicMock(spec=AssistantMessage)
msg.content = [self._make_tool_block(), self._make_tool_block()]
assert _is_tool_only_message(msg) is True
def test_empty_content_is_not_parallel(self):
"""AssistantMessage with empty content must NOT be treated as parallel.
Without the bool(sdk_msg.content) guard, all() on an empty iterable
returns True via vacuous truth — this test ensures the guard is present.
"""
msg = MagicMock(spec=AssistantMessage)
msg.content = []
assert _is_tool_only_message(msg) is False
def test_mixed_text_and_tool_blocks_not_parallel(self):
"""AssistantMessage with text + tool blocks is NOT a parallel continuation."""
msg = MagicMock(spec=AssistantMessage)
text_block = MagicMock(spec=TextBlock)
msg.content = [text_block, self._make_tool_block()]
assert _is_tool_only_message(msg) is False
def test_non_assistant_message_not_parallel(self):
"""Non-AssistantMessage types are never parallel continuations."""
assert _is_tool_only_message("not a message") is False
assert _is_tool_only_message(None) is False
assert _is_tool_only_message(42) is False
def test_single_tool_block_is_parallel(self):
"""Single ToolUseBlock AssistantMessage is a parallel continuation."""
msg = MagicMock(spec=AssistantMessage)
msg.content = [self._make_tool_block()]
assert _is_tool_only_message(msg) is True

View File

@@ -8,7 +8,12 @@ from unittest.mock import AsyncMock, MagicMock, patch
import pytest
from .service import _prepare_file_attachments, _resolve_sdk_model
from .service import (
_is_sdk_disconnect_error,
_prepare_file_attachments,
_resolve_sdk_model,
_safe_close_sdk_client,
)
@dataclass
@@ -499,3 +504,111 @@ class TestResolveSdkModel:
)
monkeypatch.setattr("backend.copilot.sdk.service.config", cfg)
assert _resolve_sdk_model() == "claude-opus-4-6"
# ---------------------------------------------------------------------------
# _is_sdk_disconnect_error — classify client disconnect cleanup errors
# ---------------------------------------------------------------------------
class TestIsSdkDisconnectError:
"""Tests for _is_sdk_disconnect_error — identifies expected SDK cleanup errors."""
def test_cancel_scope_runtime_error(self):
"""RuntimeError about cancel scope in wrong task is a disconnect error."""
exc = RuntimeError(
"Attempted to exit cancel scope in a different task than it was entered in"
)
assert _is_sdk_disconnect_error(exc) is True
def test_context_var_value_error(self):
"""ValueError about ContextVar token mismatch is a disconnect error."""
exc = ValueError(
"<Token var=<ContextVar name='current_context'>> "
"was created in a different Context"
)
assert _is_sdk_disconnect_error(exc) is True
def test_unrelated_runtime_error(self):
"""Unrelated RuntimeError should NOT be classified as disconnect error."""
exc = RuntimeError("something else went wrong")
assert _is_sdk_disconnect_error(exc) is False
def test_unrelated_value_error(self):
"""Unrelated ValueError should NOT be classified as disconnect error."""
exc = ValueError("invalid argument")
assert _is_sdk_disconnect_error(exc) is False
def test_other_exception_types(self):
"""Non-RuntimeError/ValueError should NOT be classified as disconnect error."""
assert _is_sdk_disconnect_error(TypeError("bad type")) is False
assert _is_sdk_disconnect_error(OSError("network down")) is False
assert _is_sdk_disconnect_error(asyncio.CancelledError()) is False
# ---------------------------------------------------------------------------
# _safe_close_sdk_client — suppress cleanup errors during disconnect
# ---------------------------------------------------------------------------
class TestSafeCloseSdkClient:
"""Tests for _safe_close_sdk_client — suppresses expected SDK cleanup errors."""
@pytest.mark.asyncio
async def test_clean_exit(self):
"""Normal __aexit__ (no error) should succeed silently."""
client = AsyncMock()
client.__aexit__ = AsyncMock(return_value=None)
await _safe_close_sdk_client(client, "[test]")
client.__aexit__.assert_awaited_once_with(None, None, None)
@pytest.mark.asyncio
async def test_cancel_scope_runtime_error_suppressed(self):
"""RuntimeError from cancel scope mismatch should be suppressed."""
client = AsyncMock()
client.__aexit__ = AsyncMock(
side_effect=RuntimeError(
"Attempted to exit cancel scope in a different task"
)
)
# Should NOT raise
await _safe_close_sdk_client(client, "[test]")
@pytest.mark.asyncio
async def test_context_var_value_error_suppressed(self):
"""ValueError from ContextVar token mismatch should be suppressed."""
client = AsyncMock()
client.__aexit__ = AsyncMock(
side_effect=ValueError(
"<Token var=<ContextVar name='current_context'>> "
"was created in a different Context"
)
)
# Should NOT raise
await _safe_close_sdk_client(client, "[test]")
@pytest.mark.asyncio
async def test_unexpected_exception_suppressed_with_error_log(self):
"""Unexpected exceptions should be caught (not propagated) but logged at error."""
client = AsyncMock()
client.__aexit__ = AsyncMock(side_effect=OSError("unexpected"))
# Should NOT raise — unexpected errors are also suppressed to
# avoid crashing the generator during teardown. Logged at error
# level so Sentry captures them via its logging integration.
await _safe_close_sdk_client(client, "[test]")
@pytest.mark.asyncio
async def test_unrelated_runtime_error_propagates(self):
"""Non-cancel-scope RuntimeError should propagate (not suppressed)."""
client = AsyncMock()
client.__aexit__ = AsyncMock(side_effect=RuntimeError("something unrelated"))
with pytest.raises(RuntimeError, match="something unrelated"):
await _safe_close_sdk_client(client, "[test]")
@pytest.mark.asyncio
async def test_unrelated_value_error_propagates(self):
"""Non-disconnect ValueError should propagate (not suppressed)."""
client = AsyncMock()
client.__aexit__ = AsyncMock(side_effect=ValueError("invalid argument"))
with pytest.raises(ValueError, match="invalid argument"):
await _safe_close_sdk_client(client, "[test]")

View File

@@ -0,0 +1,144 @@
"""Claude Code subscription auth helpers.
Handles locating the SDK-bundled CLI binary, provisioning credentials from
environment variables, and validating that subscription auth is functional.
"""
import functools
import json
import logging
import os
import shutil
import subprocess
logger = logging.getLogger(__name__)
def find_bundled_cli() -> str:
"""Locate the Claude CLI binary bundled inside ``claude_agent_sdk``.
Falls back to ``shutil.which("claude")`` if the SDK bundle is absent.
"""
try:
from claude_agent_sdk._internal.transport.subprocess_cli import (
SubprocessCLITransport,
)
path = SubprocessCLITransport._find_bundled_cli(None) # type: ignore[arg-type]
if path:
return str(path)
except Exception:
pass
system_path = shutil.which("claude")
if system_path:
return system_path
raise RuntimeError(
"Claude CLI not found — neither the SDK-bundled binary nor a "
"system-installed `claude` could be located."
)
def provision_credentials_file() -> None:
"""Write ``~/.claude/.credentials.json`` from env when running headless.
If ``CLAUDE_CODE_OAUTH_TOKEN`` is set (an OAuth *access* token obtained
from ``claude auth status`` or extracted from the macOS keychain), this
helper writes a minimal credentials file so the bundled CLI can
authenticate without an interactive ``claude login``.
A ``CLAUDE_CODE_REFRESH_TOKEN`` env var is optional but recommended —
it lets the CLI silently refresh an expired access token.
"""
access_token = os.environ.get("CLAUDE_CODE_OAUTH_TOKEN", "").strip()
if not access_token:
return
creds_dir = os.path.expanduser("~/.claude")
creds_path = os.path.join(creds_dir, ".credentials.json")
# Don't overwrite an existing credentials file (e.g. from a volume mount).
if os.path.exists(creds_path):
logger.debug("Credentials file already exists at %s — skipping", creds_path)
return
os.makedirs(creds_dir, exist_ok=True)
creds = {
"claudeAiOauth": {
"accessToken": access_token,
"refreshToken": os.environ.get("CLAUDE_CODE_REFRESH_TOKEN", "").strip(),
"expiresAt": 0,
"scopes": [
"user:inference",
"user:profile",
"user:sessions:claude_code",
],
}
}
with open(creds_path, "w") as f:
json.dump(creds, f)
logger.info("Provisioned Claude credentials file at %s", creds_path)
@functools.cache
def validate_subscription() -> None:
"""Validate the bundled Claude CLI is reachable and authenticated.
Cached so the blocking subprocess check runs at most once per process
lifetime. On first call, also provisions ``~/.claude/.credentials.json``
from the ``CLAUDE_CODE_OAUTH_TOKEN`` env var when available.
"""
provision_credentials_file()
cli = find_bundled_cli()
result = subprocess.run(
[cli, "--version"],
capture_output=True,
text=True,
timeout=10,
)
if result.returncode != 0:
raise RuntimeError(
f"Claude CLI check failed (exit {result.returncode}): "
f"{result.stderr.strip()}"
)
logger.info(
"Claude Code subscription mode: CLI version %s",
result.stdout.strip(),
)
# Verify the CLI is actually authenticated.
auth_result = subprocess.run(
[cli, "auth", "status"],
capture_output=True,
text=True,
timeout=10,
env={
**os.environ,
"ANTHROPIC_API_KEY": "",
"ANTHROPIC_AUTH_TOKEN": "",
"ANTHROPIC_BASE_URL": "",
},
)
if auth_result.returncode != 0:
raise RuntimeError(
"Claude CLI is not authenticated. Either:\n"
" • Set CLAUDE_CODE_OAUTH_TOKEN env var (from `claude auth status` "
"or macOS keychain), or\n"
" • Mount ~/.claude/.credentials.json into the container, or\n"
" • Run `claude login` inside the container."
)
try:
status = json.loads(auth_result.stdout)
if not status.get("loggedIn"):
raise RuntimeError(
"Claude CLI reports loggedIn=false. Set CLAUDE_CODE_OAUTH_TOKEN "
"or run `claude login`."
)
logger.info(
"Claude subscription auth: method=%s, email=%s",
status.get("authMethod"),
status.get("email"),
)
except json.JSONDecodeError:
logger.warning("Could not parse `claude auth status` output")

View File

@@ -0,0 +1,96 @@
"""Tests for the tool call circuit breaker in tool_adapter.py."""
import pytest
from backend.copilot.sdk.tool_adapter import (
_MAX_CONSECUTIVE_TOOL_FAILURES,
_check_circuit_breaker,
_clear_tool_failures,
_consecutive_tool_failures,
_record_tool_failure,
)
@pytest.fixture(autouse=True)
def _reset_tracker():
"""Reset the circuit breaker tracker for each test."""
token = _consecutive_tool_failures.set({})
yield
_consecutive_tool_failures.reset(token)
class TestCircuitBreaker:
def test_no_trip_below_threshold(self):
"""Circuit breaker should not trip before reaching the limit."""
args = {"file_path": "/tmp/test.txt"}
for _ in range(_MAX_CONSECUTIVE_TOOL_FAILURES - 1):
assert _check_circuit_breaker("write_file", args) is None
_record_tool_failure("write_file", args)
# Still under the limit
assert _check_circuit_breaker("write_file", args) is None
def test_trips_at_threshold(self):
"""Circuit breaker should trip after reaching the failure limit."""
args = {"file_path": "/tmp/test.txt"}
for _ in range(_MAX_CONSECUTIVE_TOOL_FAILURES):
assert _check_circuit_breaker("write_file", args) is None
_record_tool_failure("write_file", args)
# Now it should trip
result = _check_circuit_breaker("write_file", args)
assert result is not None
assert "STOP" in result
assert "write_file" in result
def test_different_args_tracked_separately(self):
"""Different args should have separate failure counters."""
args_a = {"file_path": "/tmp/a.txt"}
args_b = {"file_path": "/tmp/b.txt"}
for _ in range(_MAX_CONSECUTIVE_TOOL_FAILURES):
_record_tool_failure("write_file", args_a)
# args_a should trip
assert _check_circuit_breaker("write_file", args_a) is not None
# args_b should NOT trip
assert _check_circuit_breaker("write_file", args_b) is None
def test_different_tools_tracked_separately(self):
"""Different tools should have separate failure counters."""
args = {"file_path": "/tmp/test.txt"}
for _ in range(_MAX_CONSECUTIVE_TOOL_FAILURES):
_record_tool_failure("tool_a", args)
# tool_a should trip
assert _check_circuit_breaker("tool_a", args) is not None
# tool_b with same args should NOT trip
assert _check_circuit_breaker("tool_b", args) is None
def test_empty_args_tracked(self):
"""Empty args ({}) — the exact failure pattern from the bug — should be tracked."""
args = {}
for _ in range(_MAX_CONSECUTIVE_TOOL_FAILURES):
_record_tool_failure("write_file", args)
assert _check_circuit_breaker("write_file", args) is not None
def test_clear_resets_counter(self):
"""Clearing failures should reset the counter."""
args = {}
for _ in range(_MAX_CONSECUTIVE_TOOL_FAILURES):
_record_tool_failure("write_file", args)
_clear_tool_failures("write_file")
assert _check_circuit_breaker("write_file", args) is None
def test_success_clears_failures(self):
"""A successful call should reset the failure counter."""
args = {}
for _ in range(_MAX_CONSECUTIVE_TOOL_FAILURES - 1):
_record_tool_failure("write_file", args)
# Success clears failures
_clear_tool_failures("write_file")
# Should be able to fail again without tripping
for _ in range(_MAX_CONSECUTIVE_TOOL_FAILURES - 1):
_record_tool_failure("write_file", args)
assert _check_circuit_breaker("write_file", args) is None
def test_no_tracker_returns_none(self):
"""If tracker is not initialized, circuit breaker should not trip."""
_consecutive_tool_failures.set(None) # type: ignore[arg-type]
_record_tool_failure("write_file", {}) # should not raise
assert _check_circuit_breaker("write_file", {}) is None

View File

@@ -0,0 +1,823 @@
"""Tests for thinking/redacted_thinking block preservation.
Validates the fix for the Anthropic API error:
"thinking or redacted_thinking blocks in the latest assistant message
cannot be modified. These blocks must remain as they were in the
original response."
The API requires that thinking blocks in the LAST assistant message are
preserved value-identical. Older assistant messages may have thinking blocks
stripped entirely. This test suite covers:
1. _flatten_assistant_content — strips thinking from older messages
2. compact_transcript — preserves last assistant's thinking blocks
3. response_adapter — handles ThinkingBlock without error
4. _format_sdk_content_blocks — preserves redacted_thinking blocks
"""
from __future__ import annotations
from unittest.mock import AsyncMock, patch
import pytest
from claude_agent_sdk import AssistantMessage, TextBlock, ThinkingBlock
from backend.copilot.response_model import (
StreamStartStep,
StreamTextDelta,
StreamTextStart,
)
from backend.util import json
from .conftest import build_structured_transcript
from .response_adapter import SDKResponseAdapter
from .service import _format_sdk_content_blocks
from .transcript import (
_find_last_assistant_entry,
_flatten_assistant_content,
_messages_to_transcript,
_rechain_tail,
_transcript_to_messages,
compact_transcript,
validate_transcript,
)
# ---------------------------------------------------------------------------
# Fixtures: realistic thinking block content
# ---------------------------------------------------------------------------
THINKING_BLOCK = {
"type": "thinking",
"thinking": "Let me analyze the user's request carefully...",
"signature": "ErUBCkYIAxgCIkD0V2MsRXPkuGolGexaW9V1kluijxXGF",
}
REDACTED_THINKING_BLOCK = {
"type": "redacted_thinking",
"data": "EmwKAhgBEgy2VEE8PJaS2oLJCPkaT...",
}
def _make_thinking_transcript() -> str:
"""Build a transcript with thinking blocks in multiple assistant turns.
Layout:
User 1 → Assistant 1 (thinking + text + tool_use)
User 2 (tool_result) → Assistant 2 (thinking + text)
User 3 → Assistant 3 (thinking + redacted_thinking + text) ← LAST
"""
return build_structured_transcript(
[
("user", "What files are in this project?"),
(
"assistant",
[
{
"type": "thinking",
"thinking": "I should list the files.",
"signature": "sig_old_1",
},
{"type": "text", "text": "Let me check the files."},
{
"type": "tool_use",
"id": "tu1",
"name": "list_files",
"input": {"path": "/"},
},
],
),
("user", "Here are the files: a.py, b.py"),
(
"assistant",
[
{
"type": "thinking",
"thinking": "Good, I see two Python files.",
"signature": "sig_old_2",
},
{"type": "text", "text": "I found a.py and b.py."},
],
),
("user", "Tell me about a.py"),
(
"assistant",
[
THINKING_BLOCK,
REDACTED_THINKING_BLOCK,
{"type": "text", "text": "a.py contains the main entry point."},
],
),
]
)
def _last_assistant_content(transcript_jsonl: str) -> list[dict] | None:
"""Extract the content blocks of the last assistant entry in a transcript."""
last_content = None
for line in transcript_jsonl.strip().split("\n"):
entry = json.loads(line)
msg = entry.get("message", {})
if msg.get("role") == "assistant":
last_content = msg.get("content")
return last_content
# ---------------------------------------------------------------------------
# _find_last_assistant_entry — unit tests
# ---------------------------------------------------------------------------
class TestFindLastAssistantEntry:
def test_splits_at_last_assistant(self):
"""Prefix contains everything before last assistant; tail starts at it."""
transcript = build_structured_transcript(
[
("user", "Hello"),
("assistant", [{"type": "text", "text": "Hi"}]),
("user", "More"),
("assistant", [{"type": "text", "text": "Details"}]),
]
)
prefix, tail = _find_last_assistant_entry(transcript)
# 3 entries in prefix (user, assistant, user), 1 in tail (last assistant)
assert len(prefix) == 3
assert len(tail) == 1
def test_no_assistant_returns_all_in_prefix(self):
"""When there's no assistant, all lines are in prefix, tail is empty."""
transcript = build_structured_transcript(
[("user", "Hello"), ("user", "Another question")]
)
prefix, tail = _find_last_assistant_entry(transcript)
assert len(prefix) == 2
assert tail == []
def test_assistant_at_index_zero(self):
"""When assistant is the first entry, prefix is empty."""
transcript = build_structured_transcript(
[("assistant", [{"type": "text", "text": "Start"}])]
)
prefix, tail = _find_last_assistant_entry(transcript)
assert prefix == []
assert len(tail) == 1
def test_trailing_user_included_in_tail(self):
"""User message after last assistant is part of the tail."""
transcript = build_structured_transcript(
[
("user", "Q1"),
("assistant", [{"type": "text", "text": "A1"}]),
("user", "Q2"),
]
)
prefix, tail = _find_last_assistant_entry(transcript)
assert len(prefix) == 1 # first user
assert len(tail) == 2 # last assistant + trailing user
def test_multi_entry_turn_fully_preserved(self):
"""An assistant turn spanning multiple JSONL entries (same message.id)
must be entirely in the tail, not split across prefix and tail."""
# Build manually because build_structured_transcript generates unique ids
lines = [
json.dumps(
{
"type": "user",
"uuid": "u1",
"parentUuid": "",
"message": {"role": "user", "content": "Hello"},
}
),
json.dumps(
{
"type": "assistant",
"uuid": "a1-think",
"parentUuid": "u1",
"message": {
"role": "assistant",
"id": "msg_same_turn",
"type": "message",
"content": [THINKING_BLOCK],
"stop_reason": None,
"stop_sequence": None,
},
}
),
json.dumps(
{
"type": "assistant",
"uuid": "a1-tool",
"parentUuid": "u1",
"message": {
"role": "assistant",
"id": "msg_same_turn",
"type": "message",
"content": [
{
"type": "tool_use",
"id": "tu1",
"name": "Bash",
"input": {},
},
],
"stop_reason": "tool_use",
"stop_sequence": None,
},
}
),
]
transcript = "\n".join(lines) + "\n"
prefix, tail = _find_last_assistant_entry(transcript)
# Both assistant entries share msg_same_turn → both in tail
assert len(prefix) == 1 # only the user entry
assert len(tail) == 2 # both assistant entries (thinking + tool_use)
def test_no_message_id_preserves_last_assistant(self):
"""When the last assistant entry has no message.id, it should still
be preserved in the tail (fail closed) rather than being compressed."""
lines = [
json.dumps(
{
"type": "user",
"uuid": "u1",
"parentUuid": "",
"message": {"role": "user", "content": "Hello"},
}
),
json.dumps(
{
"type": "assistant",
"uuid": "a1",
"parentUuid": "u1",
"message": {
"role": "assistant",
"content": [THINKING_BLOCK, {"type": "text", "text": "Hi"}],
},
}
),
]
transcript = "\n".join(lines) + "\n"
prefix, tail = _find_last_assistant_entry(transcript)
assert len(prefix) == 1 # user entry
assert len(tail) == 1 # assistant entry preserved
# ---------------------------------------------------------------------------
# _rechain_tail — UUID chain patching
# ---------------------------------------------------------------------------
class TestRechainTail:
def test_patches_first_entry_parentuuid(self):
"""First tail entry's parentUuid should point to last prefix uuid."""
prefix = _messages_to_transcript(
[
{"role": "user", "content": "Hello"},
{"role": "assistant", "content": "Hi"},
]
)
# Get the last uuid from the prefix
last_prefix_uuid = None
for line in prefix.strip().split("\n"):
entry = json.loads(line)
last_prefix_uuid = entry.get("uuid")
tail_lines = [
json.dumps(
{
"type": "assistant",
"uuid": "tail-a1",
"parentUuid": "old-parent",
"message": {
"role": "assistant",
"content": [{"type": "text", "text": "Tail msg"}],
},
}
)
]
result = _rechain_tail(prefix, tail_lines)
entry = json.loads(result.strip())
assert entry["parentUuid"] == last_prefix_uuid
assert entry["uuid"] == "tail-a1" # uuid preserved
def test_chains_multiple_tail_entries(self):
"""Subsequent tail entries chain to each other."""
prefix = _messages_to_transcript([{"role": "user", "content": "Hi"}])
tail_lines = [
json.dumps(
{
"type": "assistant",
"uuid": "t1",
"parentUuid": "old1",
"message": {"role": "assistant", "content": []},
}
),
json.dumps(
{
"type": "user",
"uuid": "t2",
"parentUuid": "old2",
"message": {"role": "user", "content": "Follow-up"},
}
),
]
result = _rechain_tail(prefix, tail_lines)
entries = [json.loads(ln) for ln in result.strip().split("\n")]
assert len(entries) == 2
# Second entry's parentUuid should be first entry's uuid
assert entries[1]["parentUuid"] == "t1"
def test_empty_tail_returns_empty(self):
"""No tail entries → empty string."""
prefix = _messages_to_transcript([{"role": "user", "content": "Hi"}])
assert _rechain_tail(prefix, []) == ""
def test_preserves_message_content_verbatim(self):
"""Tail message content (including thinking blocks) must not be modified."""
prefix = _messages_to_transcript([{"role": "user", "content": "Hi"}])
original_content = [
THINKING_BLOCK,
REDACTED_THINKING_BLOCK,
{"type": "text", "text": "Response"},
]
tail_lines = [
json.dumps(
{
"type": "assistant",
"uuid": "t1",
"parentUuid": "old",
"message": {
"role": "assistant",
"content": original_content,
},
}
)
]
result = _rechain_tail(prefix, tail_lines)
entry = json.loads(result.strip())
assert entry["message"]["content"] == original_content
# ---------------------------------------------------------------------------
# _flatten_assistant_content — thinking blocks
# ---------------------------------------------------------------------------
class TestFlattenThinkingBlocks:
def test_thinking_blocks_are_stripped(self):
"""Thinking blocks should not appear in flattened text for compression."""
blocks = [
{"type": "thinking", "thinking": "secret thoughts", "signature": "sig"},
{"type": "text", "text": "Hello user"},
]
result = _flatten_assistant_content(blocks)
assert "secret thoughts" not in result
assert "Hello user" in result
def test_redacted_thinking_blocks_are_stripped(self):
"""Redacted thinking blocks should not appear in flattened text."""
blocks = [
{"type": "redacted_thinking", "data": "encrypted_data"},
{"type": "text", "text": "Response text"},
]
result = _flatten_assistant_content(blocks)
assert "encrypted_data" not in result
assert "Response text" in result
def test_thinking_only_message_flattens_to_empty(self):
"""A message with only thinking blocks flattens to empty string."""
blocks = [
{"type": "thinking", "thinking": "just thinking...", "signature": "sig"},
]
result = _flatten_assistant_content(blocks)
assert result == ""
def test_mixed_thinking_text_tool(self):
"""Mixed blocks: only text survives flattening; thinking and tool_use dropped."""
blocks = [
{"type": "thinking", "thinking": "hmm", "signature": "sig"},
{"type": "redacted_thinking", "data": "xyz"},
{"type": "text", "text": "I'll read the file."},
{"type": "tool_use", "name": "Read", "input": {"path": "/x"}},
]
result = _flatten_assistant_content(blocks)
assert "hmm" not in result
assert "xyz" not in result
assert "I'll read the file." in result
# tool_use blocks are dropped entirely to prevent model mimicry
assert "Read" not in result
# ---------------------------------------------------------------------------
# compact_transcript — thinking block preservation
# ---------------------------------------------------------------------------
class TestCompactTranscriptThinkingBlocks:
"""Verify that compact_transcript preserves thinking blocks in the
last assistant message while stripping them from older messages."""
@pytest.mark.asyncio
async def test_last_assistant_thinking_blocks_preserved(self, mock_chat_config):
"""After compaction, the last assistant entry must retain its
original thinking and redacted_thinking blocks verbatim."""
transcript = _make_thinking_transcript()
compacted_msgs = [
{"role": "user", "content": "[conversation summary]"},
{"role": "assistant", "content": "Summarized response"},
]
mock_result = type(
"CompressResult",
(),
{
"was_compacted": True,
"messages": compacted_msgs,
"original_token_count": 800,
"token_count": 200,
"messages_summarized": 4,
"messages_dropped": 0,
},
)()
with patch(
"backend.copilot.sdk.transcript._run_compression",
new_callable=AsyncMock,
return_value=mock_result,
):
result = await compact_transcript(transcript, model="test-model")
assert result is not None
assert validate_transcript(result)
last_content = _last_assistant_content(result)
assert last_content is not None, "No assistant entry found"
assert isinstance(last_content, list)
# The last assistant must have the thinking blocks preserved
block_types = [b["type"] for b in last_content]
assert (
"thinking" in block_types
), "thinking block missing from last assistant message"
assert (
"redacted_thinking" in block_types
), "redacted_thinking block missing from last assistant message"
assert "text" in block_types
# Verify the thinking block content is value-identical
thinking_blocks = [b for b in last_content if b["type"] == "thinking"]
assert len(thinking_blocks) == 1
assert thinking_blocks[0]["thinking"] == THINKING_BLOCK["thinking"]
assert thinking_blocks[0]["signature"] == THINKING_BLOCK["signature"]
redacted_blocks = [b for b in last_content if b["type"] == "redacted_thinking"]
assert len(redacted_blocks) == 1
assert redacted_blocks[0]["data"] == REDACTED_THINKING_BLOCK["data"]
@pytest.mark.asyncio
async def test_older_assistant_thinking_blocks_stripped(self, mock_chat_config):
"""Older assistant messages should NOT retain thinking blocks
after compaction (they're compressed into summaries)."""
transcript = _make_thinking_transcript()
# The compressor will receive messages where older assistant
# entries have already had thinking blocks stripped.
captured_messages: list[dict] = []
async def mock_compression(messages, model, log_prefix):
captured_messages.extend(messages)
return type(
"CompressResult",
(),
{
"was_compacted": True,
"messages": messages,
"original_token_count": 800,
"token_count": 400,
"messages_summarized": 2,
"messages_dropped": 0,
},
)()
with patch(
"backend.copilot.sdk.transcript._run_compression",
side_effect=mock_compression,
):
await compact_transcript(transcript, model="test-model")
# Check that the messages sent to compression don't contain
# thinking content from older assistant messages
for msg in captured_messages:
if msg["role"] == "assistant":
content = msg.get("content", "")
assert (
"I should list the files." not in content
), "Old thinking block content leaked into compression input"
assert (
"Good, I see two Python files." not in content
), "Old thinking block content leaked into compression input"
@pytest.mark.asyncio
async def test_trailing_user_message_after_last_assistant(self, mock_chat_config):
"""When the last entry is a user message, the last *assistant*
message's thinking blocks should still be preserved."""
transcript = build_structured_transcript(
[
("user", "Hello"),
(
"assistant",
[
THINKING_BLOCK,
{"type": "text", "text": "Hi there"},
],
),
("user", "Follow-up question"),
]
)
# The compressor only receives the prefix (1 user message); the
# tail (assistant + trailing user) is preserved verbatim.
compacted_msgs = [
{"role": "user", "content": "Hello"},
]
mock_result = type(
"CompressResult",
(),
{
"was_compacted": True,
"messages": compacted_msgs,
"original_token_count": 400,
"token_count": 100,
"messages_summarized": 0,
"messages_dropped": 0,
},
)()
with patch(
"backend.copilot.sdk.transcript._run_compression",
new_callable=AsyncMock,
return_value=mock_result,
):
result = await compact_transcript(transcript, model="test-model")
assert result is not None
last_content = _last_assistant_content(result)
assert last_content is not None
assert isinstance(last_content, list)
block_types = [b["type"] for b in last_content]
assert (
"thinking" in block_types
), "thinking block lost from last assistant despite trailing user msg"
@pytest.mark.asyncio
async def test_single_assistant_with_thinking_preserved(self, mock_chat_config):
"""When there's only one assistant message (which is also the last),
its thinking blocks must be preserved."""
transcript = build_structured_transcript(
[
("user", "Hello"),
(
"assistant",
[
THINKING_BLOCK,
{"type": "text", "text": "World"},
],
),
]
)
compacted_msgs = [
{"role": "user", "content": "Hello"},
{"role": "assistant", "content": "World"},
]
mock_result = type(
"CompressResult",
(),
{
"was_compacted": True,
"messages": compacted_msgs,
"original_token_count": 200,
"token_count": 100,
"messages_summarized": 0,
"messages_dropped": 0,
},
)()
with patch(
"backend.copilot.sdk.transcript._run_compression",
new_callable=AsyncMock,
return_value=mock_result,
):
result = await compact_transcript(transcript, model="test-model")
assert result is not None
last_content = _last_assistant_content(result)
assert last_content is not None
assert isinstance(last_content, list)
block_types = [b["type"] for b in last_content]
assert "thinking" in block_types
@pytest.mark.asyncio
async def test_tail_parentuuid_rewired_to_prefix(self, mock_chat_config):
"""After compaction, the first tail entry's parentUuid must point to
the last entry in the compressed prefix — not its original parent."""
transcript = _make_thinking_transcript()
compacted_msgs = [
{"role": "user", "content": "[conversation summary]"},
{"role": "assistant", "content": "Summarized response"},
]
mock_result = type(
"CompressResult",
(),
{
"was_compacted": True,
"messages": compacted_msgs,
"original_token_count": 800,
"token_count": 200,
"messages_summarized": 4,
"messages_dropped": 0,
},
)()
with patch(
"backend.copilot.sdk.transcript._run_compression",
new_callable=AsyncMock,
return_value=mock_result,
):
result = await compact_transcript(transcript, model="test-model")
assert result is not None
lines = [ln for ln in result.strip().split("\n") if ln.strip()]
entries = [json.loads(ln) for ln in lines]
# Find the boundary: the compressed prefix ends just before the
# first tail entry (last assistant in original transcript).
tail_start = None
for i, entry in enumerate(entries):
msg = entry.get("message", {})
if isinstance(msg.get("content"), list):
# Structured content = preserved tail entry
tail_start = i
break
assert tail_start is not None, "Could not find preserved tail entry"
assert tail_start > 0, "Tail should not be the first entry"
# The tail entry's parentUuid must be the uuid of the preceding entry
prefix_last_uuid = entries[tail_start - 1]["uuid"]
tail_first_parent = entries[tail_start]["parentUuid"]
assert tail_first_parent == prefix_last_uuid, (
f"Tail parentUuid {tail_first_parent!r} != "
f"last prefix uuid {prefix_last_uuid!r}"
)
@pytest.mark.asyncio
async def test_no_thinking_blocks_still_works(self, mock_chat_config):
"""Compaction should still work normally when there are no thinking
blocks in the transcript."""
transcript = build_structured_transcript(
[
("user", "Hello"),
("assistant", [{"type": "text", "text": "Hi"}]),
("user", "More"),
("assistant", [{"type": "text", "text": "Details"}]),
]
)
compacted_msgs = [
{"role": "user", "content": "[summary]"},
{"role": "assistant", "content": "Summary"},
]
mock_result = type(
"CompressResult",
(),
{
"was_compacted": True,
"messages": compacted_msgs,
"original_token_count": 200,
"token_count": 50,
"messages_summarized": 2,
"messages_dropped": 0,
},
)()
with patch(
"backend.copilot.sdk.transcript._run_compression",
new_callable=AsyncMock,
return_value=mock_result,
):
result = await compact_transcript(transcript, model="test-model")
assert result is not None
assert validate_transcript(result)
# Verify last assistant content is preserved even without thinking blocks
last_content = _last_assistant_content(result)
assert last_content is not None
assert last_content == [{"type": "text", "text": "Details"}]
# ---------------------------------------------------------------------------
# _transcript_to_messages — thinking block handling
# ---------------------------------------------------------------------------
class TestTranscriptToMessagesThinking:
def test_thinking_blocks_excluded_from_flattened_content(self):
"""When _transcript_to_messages flattens content, thinking block
text should not leak into the message content string."""
transcript = build_structured_transcript(
[
("user", "Hello"),
(
"assistant",
[
{
"type": "thinking",
"thinking": "SECRET_THOUGHT",
"signature": "sig",
},
{"type": "text", "text": "Visible response"},
],
),
]
)
messages = _transcript_to_messages(transcript)
assistant_msg = [m for m in messages if m["role"] == "assistant"][0]
assert "SECRET_THOUGHT" not in assistant_msg["content"]
assert "Visible response" in assistant_msg["content"]
# ---------------------------------------------------------------------------
# response_adapter — ThinkingBlock handling
# ---------------------------------------------------------------------------
class TestResponseAdapterThinkingBlock:
def test_thinking_block_does_not_crash(self):
"""ThinkingBlock in AssistantMessage should not cause an error."""
adapter = SDKResponseAdapter(message_id="msg-1", session_id="sess-1")
msg = AssistantMessage(
content=[
ThinkingBlock(
thinking="Let me think about this...",
signature="sig_test_123",
),
TextBlock(text="Here is my response."),
],
model="claude-test",
)
results = adapter.convert_message(msg)
# Should produce stream events for text only, no crash
types = [type(r) for r in results]
assert StreamStartStep in types
assert StreamTextStart in types or StreamTextDelta in types
def test_thinking_block_does_not_emit_stream_events(self):
"""ThinkingBlock should NOT produce any StreamTextDelta events
containing thinking content."""
adapter = SDKResponseAdapter(message_id="msg-1", session_id="sess-1")
msg = AssistantMessage(
content=[
ThinkingBlock(
thinking="My secret thoughts",
signature="sig_test_456",
),
TextBlock(text="Public response"),
],
model="claude-test",
)
results = adapter.convert_message(msg)
text_deltas = [r for r in results if isinstance(r, StreamTextDelta)]
for delta in text_deltas:
assert "secret thoughts" not in (delta.delta or "")
# ---------------------------------------------------------------------------
# _format_sdk_content_blocks — redacted_thinking handling
# ---------------------------------------------------------------------------
class TestFormatSdkContentBlocks:
def test_thinking_block_preserved(self):
"""ThinkingBlock should be serialized with type, thinking, and signature."""
blocks = [
ThinkingBlock(thinking="My thoughts", signature="sig123"),
TextBlock(text="Response"),
]
result = _format_sdk_content_blocks(blocks)
assert len(result) == 2
assert result[0] == {
"type": "thinking",
"thinking": "My thoughts",
"signature": "sig123",
}
assert result[1] == {"type": "text", "text": "Response"}
def test_raw_dict_redacted_thinking_preserved(self):
"""Raw dict blocks (e.g. redacted_thinking) pass through unchanged."""
raw_block = {"type": "redacted_thinking", "data": "EmwKAh...encrypted"}
blocks = [
raw_block,
TextBlock(text="Response"),
]
result = _format_sdk_content_blocks(blocks)
assert len(result) == 2
assert result[0] == raw_block
assert result[1] == {"type": "text", "text": "Response"}

Some files were not shown because too many files have changed in this diff Show More