mirror of
https://github.com/Significant-Gravitas/AutoGPT.git
synced 2026-04-30 03:00:41 -04:00
9710dc76fca91bc47aa572dd451e41083a9a1aed
8228 Commits
| Author | SHA1 | Message | Date | |
|---|---|---|---|---|
|
|
9710dc76fc |
fix(backend): use f-strings for DataError warning logs per backend convention
Switch %s placeholder style to f-strings in both logger.warning calls (platform_cost.py and token_tracking.py) to match the backend logging convention (f-strings for non-debug levels). |
||
|
|
04ed0e6a4d |
test(backend): add DataError coverage for platform cost safe-log paths
Adds explicit tests verifying that prisma.errors.DataError is caught and logged at WARNING (not ERROR) in both log_platform_cost_safe and _schedule_cost_log, and that neither function raises when a DataError occurs. |
||
|
|
2a47ecc129 |
fix(backend): lower PlatformCostLog DataError to warning during schema mismatch
The autogpt-database-manager pod can run a stale Prisma client immediately after a schema migration (e.g. rolling deploy of PR #12696 that added PlatformCostLog). This caused every copilot token-tracking write to raise prisma.errors.DataError ('userId'/'metadata' field not found), which was caught by logger.exception() — firing Sentry events at ERROR level. Catch DataError specifically in both log_platform_cost_safe (platform_cost.py) and the _safe_log closure in token_tracking.py, and demote to WARNING so Sentry is not spammed during deploy windows. All other exceptions still escalate to ERROR/Sentry as before. |
||
|
|
ff8cdda4e8 |
feat(platform/admin): cost tracking for system credentials (#12696)
## Why
When system-managed credentials are used (AutoGPT pays the API bills),
there was no visibility into which providers were being called, how much
each costs, or which users were driving usage. This makes it impossible
to set appropriate per-user limits or reconcile expenses with actual API
invoices.
## What
End-to-end platform cost tracking for all 22 system-credential providers
+ both copilot modes:
- Every block execution that uses system credentials records a
`PlatformCostLog` row (provider, cost, tokens, user, execution IDs)
- Copilot turns (SDK + baseline) are tracked with model name, token
counts, and actual USD cost
- Admin dashboard at `/admin/platform-costs` shows cost breakdown by
provider and user with date/provider/user filters and paginated raw logs
- Admin API endpoints with 30s TTL cache: `GET
/platform-costs/dashboard` and `GET /platform-costs/logs`
## How
### Core hook
`cost_tracking.py` calls `log_system_credential_cost()` after each block
node execution. It reads `NodeExecutionStats.provider_cost` (set by
`merge_stats()` inside each block) and dispatches a fire-and-forget
`INSERT` via `log_platform_cost_safe()`.
### Per-block tracking
Each block calls `self.merge_stats(NodeExecutionStats(provider_cost=...,
provider_cost_type=...))`:
| Tracking type | Providers | Amount |
|---|---|---|
| `cost_usd` | OpenRouter, Exa | Actual USD from API response |
| `tokens` | OpenAI, Anthropic, Groq, Ollama, Jina | Token count from
response.usage |
| `characters` | Unreal Speech, ElevenLabs, D-ID | Input text length |
| `sandbox_seconds` | E2B | Walltime |
| `walltime_seconds` | FAL, Revid, Replicate | Walltime |
| `per_run` | Google Maps, Apollo, SmartLead, etc. | 1 per execution |
OpenRouter cost: extracted via `with_raw_response.create()` and
`raw.headers.get("x-total-cost")` with `math.isfinite` + `>= 0`
validation (replaces private `_response` access).
### Copilot tracking
`token_tracking.py` writes a `PlatformCostLog` row per copilot LLM turn
via an async fire-and-forget queue bounded by a `Semaphore(50)`. SDK
path uses `sdk_msg.total_cost_usd`; baseline path uses the
`x-total-cost` header from OpenRouter streaming responses.
### Executor drain
`drain_pending_cost_logs()` is called before `executor.shutdown()` using
a module-level loop registry (`_active_node_execution_loops`) so that
pending log tasks from each worker thread's event loop are awaited
before the process exits. Tasks are filtered by `task.get_loop() is
current_loop` to avoid cross-loop `RuntimeError` in Python ≥ 3.10.
### CoPilot executor lifecycle
Worker threads connect Prisma on startup and disconnect on cleanup (even
on failure). If `db.connect()` fails during `@func_retry`, the event
loop is stopped and joined before re-raising so no loop is leaked across
retry attempts.
### Schema
```prisma
model PlatformCostLog {
id String @id @default(uuid())
createdAt DateTime @default(now())
userId String?
graphExecId String?
nodeExecId String?
blockName String
provider String
trackingType String
costMicrodollars BigInt @default(0)
inputTokens Int?
outputTokens Int?
duration Float?
model String?
}
```
### Admin dashboard
React page with three tabs (By Provider / By User / Raw Logs) driven by
two generated Orval hooks (`useGetV2GetPlatformCostDashboard`,
`useGetV2GetPlatformCostLogs`). Filters are URL-based (`searchParams`)
for bookmarkability. Pagination for raw logs. Per-provider estimated
totals using configurable cost-per-unit multipliers.
## Test plan
- [x] Migration applies cleanly
- [x] Block execution with system credentials creates PlatformCostLog
row
- [x] Copilot conversation records cost log with tokens + model
- [x] `/admin/platform-costs` dashboard renders with correct data
- [x] Date/provider/user filters work correctly
- [x] Non-admin users get 403 on cost endpoints
- [x] Executor drain completes before process exit (no lost logs)
---------
Co-authored-by: Zamil Majdy <majdyz@users.noreply.github.com>
Co-authored-by: Nicholas Tindle <nicholas.tindle@agpt.co>
|
||
|
|
c51097d8ac |
dx(orchestrate): harden agent fleet scripts — idle detection, pagination, fake-resolution guard, parallelism (#12704)
### Why / What / How **Why:** A series of production failures exposed gaps in the agent fleet tooling: 1. Agents using `_wait_idle`/`wait_for_claude_idle` would time out waiting for `❯` while a settings-error dialog blocked progress — because the dialog can appear above the last 3 captured lines. 2. The run-loop's adaptive backoff used `POLL_CURRENT * 3 / 2` which stalls at 1 forever in bash integer arithmetic, and printed the interval *before* recomputing it. 3. `pr-address` agents were silently missing review threads when a PR had >100 threads across multiple pages — they'd stop at page 1, address 69/111 threads, and falsely report "done". 4. `resolveReviewThread` was being called without a committed fix — producing false "0 unresolved" signals that bypassed verification. 5. The onboarding bypass in `/pr-test` had no timeout on curl calls, so the step could hang forever if the backend wasn't ready yet. 6. The orchestrator's own verification query used `first: 1` which can't reliably count unresolved threads across all pages. **What:** - Idle detection hardened in both `spawn-agent.sh` and `run-loop.sh` — full-pane check for 'Enter to confirm' so the dialog is never missed - Adaptive backoff arithmetic fixed (`POLL_CURRENT + POLL_CURRENT/2 + 1` always increments); log ordering corrected; `POLL_IDLE_MAX` made env-configurable - `pr-address/SKILL.md`: mandatory cursor-pagination loop collecting ALL thread IDs before addressing anything; prominent ⚠️ warning with the PR #12636 incident (142 threads, 2 pages, agent stopped at 69) - `pr-address/SKILL.md`: new "Parallel thread resolution" section — batch by file, one commit per file group, concurrent reply subshells with 3s gaps, sequential resolves - `pr-address/SKILL.md`: "Verify actual count" section now uses paginated loop (not single first:100 query) - `orchestrate/SKILL.md`: verification query fixed to paginate all pages; new "Thread resolution integrity" section with anti-patterns; fake-resolution detection query; state-staleness recovery; RUNNING-count confusion explained - `/pr-test` onboarding bypass: `--max-time 30` on curl calls; hard-fail on bypass failure **How:** All changes are to DX skill files and orchestration scripts — no production code modified. Each fix is a separate commit so the change history is readable. ### Changes 🏗️ **Scripts:** - `run-loop.sh`: `wait_for_claude_idle` — add 'Enter to confirm' dialog check (reset elapsed on dialog); fix backoff arithmetic stall; fix log ordering; make `POLL_IDLE_MAX` env-configurable; reset poll interval when `waiting_approval` agents present - `spawn-agent.sh`: `_wait_idle` — capture full pane (not just `tail -3`) for 'Enter to confirm' check; wait-for-idle before sending agent objective to prevent stuck pasted-text **SKILL.md files:** - `pr-address/SKILL.md`: - ⚠️ WARNING + totalCount step + cursor-pagination loop before addressing any threads - "Parallel thread resolution" section: group by file, batch commits, concurrent replies, sequential resolves - "Verify actual count" section: full paginated loop instead of single first:100 query - "What counts as a valid resolution" with explicit anti-patterns (Acknowledged, Accepted, no-commit resolves) - Rate limits table (403 secondary vs 429 primary), 2-3 min recovery - `git rev-parse HEAD` pattern with `${FULL_SHA:0:9}` short SHA - `orchestrate/SKILL.md`: - Thread resolution integrity section + fake-resolution detection query - Verification query fixed to paginate all pages - State file staleness recovery (stale `loop_window`, closed windows, repair recipes) - RUNNING count confusion: explains `waiting_approval` included in regex - Idle check before re-briefing agents - `pr-test/SKILL.md`: - `--max-time 30` on onboarding bypass curl calls - Hard-fail (`exit 1`) if bypass verification fails ### Checklist 📋 #### For code changes: - [x] I have clearly listed my changes in the PR description - [x] I have made a test plan - [x] I have tested my changes according to the test plan: - [x] Verified adaptive backoff increments correctly (no longer stalls at 1) - [x] Verified 'Enter to confirm' dialog handled in both wait functions - [x] Verified pagination loop collects all thread IDs across pages - [x] Verified PR #12636 onboarding bypass works end-to-end (11/11 scenarios PASS) --------- Co-authored-by: Zamil Majdy <majdy.zamil@gmail.com> |
||
|
|
f3306d9211 | Merge branch 'master' of github.com:Significant-Gravitas/AutoGPT into dev | ||
|
|
f5e2eccda7 |
dx(orchestrate): fix stale-review gate and add pr-test evaluation rules to SKILL.md (#12701)
## Changes ### verify-complete.sh - CHANGES_REQUESTED reviews are now compared against the latest commit timestamp. If the review was submitted **before** the latest commit, it is treated as stale and does not block verification. - Added fail-closed guard: if the `gh pr view` fetch fails, the script exits 1 (rather than treating missing data as "no blocking reviews") - Fixed edge case: a `CHANGES_REQUESTED` review with a null `submittedAt` is now counted as fresh/blocking (previously silently skipped) - Combined two separate `gh pr view` calls into one (`--json commits,reviews`) to reduce API calls and ensure consistency ### SKILL.md (orchestrate skill) - Added `### /pr-test result evaluation` section with explicit pass/partial/fail handling table - **PARTIAL on any headline feature scenario = immediate blocker**: re-brief the agent, fix, and re-run from scratch. Never approve or output ORCHESTRATOR:DONE with a PARTIAL headline result. - Concrete incident callout: PR #12699 S5 (Apply suggestions) was PARTIAL — AI never output JSON action blocks — but was nearly approved. This rule prevents recurrence. - Updated `verify-complete.sh` description throughout to include "no fresh CHANGES_REQUESTED" - Added staleness rule documentation: a review only blocks if submitted *after* the latest commit ## Why Two separate incidents prompted these changes: 1. **verify-complete.sh false positive**: An automated bot (autogpt-pr-reviewer) submitted a `CHANGES_REQUESTED` review in April. An agent then pushed fixing commits. The old script still blocked on the stale review, preventing the PR from being verified as done. 2. **Missed PARTIAL signal**: PR #12699 had a PARTIAL result on its headline scenario (S5 Apply button) because the AI emitted direct builder tool calls instead of JSON action blocks. The orchestrator nearly approved it. The new SKILL.md rule makes PARTIAL = blocker explicit. ## Checklist - [x] I have read the contribution guide - [x] My changes follow the code style of this project - [x] Changes are limited to the scope of this PR (< 20% unrelated changes) - [x] All new and existing tests pass |
||
|
|
58b230ff5a |
dx: add /orchestrate skill — Claude Code agent fleet supervisor with spare worktree lifecycle (#12691)
### Why
When running multiple Claude Code agents in parallel worktrees, they
frequently get stuck: an agent exits and sits at a shell prompt, freezes
mid-task, or waits on an approval prompt with no human watching. Fixing
this currently requires manually checking each tmux window.
### What
Adds a `/orchestrate` skill — a meta-agent supervisor that manages a
fleet of Claude Code agents across tmux windows and spare worktrees. It
auto-discovers available worktrees, spawns agents, monitors them, kicks
idle/stuck ones, auto-approves safe confirmations, and recycles
worktrees on completion.
### How to use
**Prerequisites:**
- One tmux session already running (the skill adds windows to it; it
does not create a new session)
- Spare worktrees on `spare/N` branches (e.g. `AutoGPT3` on `spare/3`,
`AutoGPT7` on `spare/7`)
**Basic workflow:**
```
/orchestrate capacity → see how many spare worktrees are free
/orchestrate start → enter task list, agents spawn automatically
/orchestrate status → check what's running
/orchestrate add → add one more task to the next free worktree
/orchestrate stop → mark inactive (agents finish current work)
/orchestrate poll → one manual poll cycle (debug / on-demand)
```
**Worktree lifecycle:**
```text
spare/N branch → /orchestrate add → new window + feat/branch + claude running
↓
ORCHESTRATOR:DONE
↓
kill window + git checkout spare/N
↓
spare/N (free again)
```
Windows are always capped by worktree count — no creep.
### Changes
- `.claude/skills/orchestrate/SKILL.md` — skill definition with 5
subcommands, state file schema, spawn/recycle helpers, approval policy
- `.claude/skills/orchestrate/scripts/classify-pane.sh` — pane state
classifier: `idle` (shell foreground), `running` (non-shell),
`waiting_approval` (pattern match), `complete` (ORCHESTRATOR:DONE)
- `.claude/skills/orchestrate/scripts/poll-cycle.sh` — poll loop:
reads/updates state file atomically, outputs JSON action list, stuck
detection via output-hash sampling
**State detection:**
| State | Detection method |
|---|---|
| `idle` | `pane_current_command` is a shell (zsh/bash/fish) |
| `running` | `pane_current_command` is non-shell (claude/node) |
| `stuck` | pane hash unchanged for N consecutive polls |
| `waiting_approval` | pattern match on last 40 lines of pane output |
| `complete` | `ORCHESTRATOR:DONE` string present in pane output |
**Safety policy for auto-approvals:** git ops, package installs, tests,
docker compose → approve. `rm -rf` outside worktree, force push, `sudo`,
secrets → escalate to user.
State file lives at `~/.claude/orchestrator-state.json` (outside repo,
never committed).
### Checklist
#### For code changes:
- [x] I have clearly listed my changes in the PR description
- [x] I have made a test plan
- [x] I have tested my changes according to the test plan:
- [x] `classify-pane.sh`: idle shell → `idle`, running process →
`running`, `ORCHESTRATOR:DONE` → `complete`, approval prompt →
`waiting_approval`, nonexistent window → `error`
- [x] `poll-cycle.sh`: inactive state → `[]`, empty agents array → `[]`,
spare worktree discovery, stuck detection (3-poll hash cycle)
- [x] Real agent spawn in `autogpt1` tmux session — agent ran, output
`ORCHESTRATOR:DONE`, recycle verified
- [x] Upfront JSON validation before `set -e`-guarded jq reads
- [x] Idle timer reset only on `idle → running` transition (not stuck),
preventing false stuck-detections
- [x] Classify fallback only triggers when output is empty (no
double-JSON on classify exit 1)
|
||
|
|
67bdef13e7 |
feat(platform): load copilot messages from newest first with cursor-based pagination (#12328)
Copilot chat sessions with long histories loaded all messages at once, causing slow initial loads. This PR adds cursor-based pagination so only the most recent messages load initially, with older messages fetched on demand as the user scrolls up. ### Changes 🏗️ **Backend:** - Cursor-based pagination on `GET /sessions/{session_id}` (`limit`, `before_sequence` params) - `user_id` relation filter on the paginated query — ownership check and message fetch now run in parallel - Backward boundary expansion to keep tool-call / assistant message pairs intact at page edges - Unit tests for paginated queries **Frontend:** - `useLoadMoreMessages` hook + `LoadMoreSentinel` (IntersectionObserver) for infinite scroll upward - `ScrollPreserver` to maintain scroll position when older messages are prepended - Session-keyed `Conversation` remount with one-frame opacity hide to eliminate scroll flash on switch - Scrollbar moved to the correct scroll container; loading spinner no longer causes overflow ### Checklist 📋 - [x] Pagination: only recent messages load initially; older pages load on scroll-up - [x] Scroll position preserved on prepend; no flash on session switch - [x] Tool-call boundary pairs stay intact across page edges - [x] Stream reconnection still works on initial load --------- Co-authored-by: Claude Opus 4.6 <noreply@anthropic.com> Co-authored-by: Nicholas Tindle <nicholas.tindle@agpt.co> |
||
|
|
e67dd93ee8 |
refactor(frontend): remove stale feature flags and stabilize share execution (#12697)
## Why Stale feature flags add noise to the codebase and make it harder to understand which flags are actually gating live features. Four flags were defined but never referenced anywhere in the frontend, and the "Share Execution Results" flag has been stable long enough to remove its gate. ## What - Remove 4 unused flags from the `Flag` enum and `defaultFlags`: `NEW_BLOCK_MENU`, `GRAPH_SEARCH`, `ENABLE_ENHANCED_OUTPUT_HANDLING`, `AGENT_FAVORITING` - Remove the `SHARE_EXECUTION_RESULTS` flag and its conditional — the `ShareRunButton` now always renders ## How - Deleted enum entries and default values in `use-get-flag.ts` - Removed the `useGetFlag` call and conditional wrapper around `<ShareRunButton />` in `SelectedRunActions.tsx` ## Changes - `src/services/feature-flags/use-get-flag.ts` — removed 5 flags from enum + defaults - `src/app/(platform)/library/.../SelectedRunActions.tsx` — removed flag import, condition; share button always renders ### Checklist - [x] My PR is small and focused on one change - [x] I've tested my changes locally - [x] `pnpm format && pnpm lint` pass 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-authored-by: Claude Opus 4.6 (1M context) <noreply@anthropic.com> |
||
|
|
3140a60816 |
fix(frontend/builder): allow horizontal scroll for JSON output data (#12638)
Requested by @Abhi1992002 ## Why JSON output data in the "Complete Output Data" dialog and node output panel gets clipped — text overflows and is hidden with no way to scroll right. Reported by Zamil in #frontend. ## What The `ContentRenderer` wrapper divs used `overflow-hidden` which prevented the `JSONRenderer`'s `overflow-x-auto` from working. Changed both wrapper divs from `overflow-hidden` to `overflow-x-auto`. ```diff - overflow-hidden [&>*]:rounded-xlarge [&>*]:!text-xs [&_pre]:whitespace-pre-wrap [&_pre]:break-words + overflow-x-auto [&>*]:rounded-xlarge [&>*]:!text-xs [&_pre]:whitespace-pre-wrap [&_pre]:break-words - overflow-hidden [&>*]:rounded-xlarge [&>*]:!text-xs + overflow-x-auto [&>*]:rounded-xlarge [&>*]:!text-xs ``` ## Scope - 1 file changed (`ContentRenderer.tsx`) - 2 lines: `overflow-hidden` → `overflow-x-auto` - CSS only, no logic changes Resolves SECRT-2206 Co-authored-by: Abhimanyu Yadav <122007096+Abhi1992002@users.noreply.github.com> Co-authored-by: Nicholas Tindle <nicholas.tindle@agpt.co> |
||
|
|
41c2ee9f83 |
feat(platform): add copilot artifact preview panel (#12629)
### Why / What / How
Copilot artifacts were not previewing reliably: PDFs downloaded instead
of rendering, Python code could still render like markdown, JSX/TSX
artifacts were brittle, HTML dashboards/charts could fail to execute,
and users had to manually open artifact panes after generation. The pane
also got stuck at maximized width when trying to drag it smaller.
This PR adds a dedicated copilot artifact panel and preview pipeline
across the backend/frontend boundary. It preserves artifact metadata
needed for classification, adds extension-first preview routing,
introduces dedicated preview/rendering paths for HTML/CSV/code/PDF/React
artifacts, auto-opens new or edited assistant artifacts, and fixes the
maximized-pane resize path so dragging exits maximized mode immediately.
### Changes 🏗️
- add artifact card and artifact panel UI in copilot, including
persisted panel state and resize/maximize/minimize behavior
- add shared artifact extraction/classification helpers and auto-open
behavior for new or edited assistant messages with artifacts
- add preview/rendering support for HTML, CSV, PDF, code, and React
artifact files
- fix code artifacts such as Python to render through the code renderer
with a dark code surface instead of markdown-style output
- improve JSX/TSX preview behavior with provider wrapping, fallback
export selection, and explicit runtime error surfaces
- allow script execution inside HTML previews so embedded chart
dashboards can render
- update workspace artifact/backend API handling and regenerate the
frontend OpenAPI client
- add regression coverage for artifact helpers, React preview runtime,
auto-open behavior, code rendering, and panel store behavior
- post-review hardening: correct download path for cross-origin URLs,
defer scroll restore until content mounts, gate auto-open behind the
ARTIFACTS flag, parse CSVs with RFC 4180-compliant quoted newlines + BOM
handling, distinguish 413 vs 409 on upload, normalize empty session_id,
and keep AnimatePresence mounted so the panel exit animation plays
### Checklist 📋
#### For code changes:
- [x] I have clearly listed my changes in the PR description
- [x] I have made a test plan
- [x] I have tested my changes according to the test plan:
- [x] `pnpm format`
- [x] `pnpm lint`
- [x] `pnpm types`
- [x] `pnpm test:unit`
#### For configuration changes:
- [x] `.env.default` is updated or already compatible with my changes
- [x] `docker-compose.yml` is updated or already compatible with my
changes
- [x] I have included a list of my configuration changes in the PR
description (under **Changes**)
<!-- CURSOR_SUMMARY -->
---
> [!NOTE]
> **Medium Risk**
> Adds a new Copilot artifact preview surface that executes
user/AI-generated HTML/React in sandboxed iframes and changes workspace
file upload/listing behavior, so regressions could affect file handling
and client security assumptions despite sandboxing safeguards.
>
> **Overview**
> Adds an **Artifacts** feature (flagged by `Flag.ARTIFACTS`) to
Copilot: workspace file links/attachments now render as `ArtifactCard`s
and can open a new resizable/minimizable `ArtifactPanel` with history,
auto-open behavior, copy/download actions, and persisted panel width.
>
> Introduces a richer artifact preview pipeline with type classification
and dedicated renderers for **HTML**, **CSV**, **PDF**, **code
(Shiki-highlighted)**, and **React/TSX** (transpiled and executed in a
sandboxed iframe), plus safer download filename handling and content
caching/scroll restore.
>
> Extends the workspace backend API by adding `GET /workspace/files`
pagination, standardizing operation IDs in OpenAPI, attaching
`metadata.origin` on uploads/agent-created files, normalizing empty
`session_id`, improving upload error mapping (409 vs 413), and hardening
post-quota soft-delete error handling; updates and expands test coverage
accordingly.
>
> <sup>Reviewed by [Cursor Bugbot](https://cursor.com/bugbot) for commit
|
||
|
|
ca748ee12a |
feat(frontend): refine AutoPilot onboarding — branding, auto-advance, soft cap, polish (#12686)
### Why / What / How
**Why:** The onboarding flow had inconsistent branding ("Autopilot" vs
"AutoPilot"), a heavy progress bar that dominated the header, an extra
click on the role screen, and no guidance on how many pain points to
select — leading to users selecting everything or nothing useful.
**What:** Copy & brand fixes, UX improvements (auto-advance, soft cap),
and visual polish (progress bar, checkmark badges, purple focus inputs).
**How:**
- Replaced all "Autopilot" with "AutoPilot" (capital P) across screens
1-3
- Removed the `?` tooltip on screen 1 (users will learn about AutoPilot
from the access email)
- Changed name label to conversational "What should I call you?"
- Screen 2: auto-advances 350ms after role selection (except "Other"
which still shows input + button)
- Screen 3: soft cap of 3 selections with green confirmation text and
shake animation on overflow attempt
- Thinned progress bar from ~10px to 3px (Linear/Notion style)
- Added purple checkmark badges on selected cards
- Updated Input atom focus state to purple ring
### Changes 🏗️
- **WelcomeStep**: "AutoPilot" branding, removed tooltip, conversational
label
- **RoleStep**: Updated subtitle, auto-advance on non-"Other" role
select, Continue button only for "Other"
- **PainPointsStep**: Soft cap of 3 with dynamic helper text and shake
animation
- **usePainPointsStep**: Added `atLimit`/`shaking` state, wrapped
`togglePainPoint` with cap logic
- **store.ts**: `togglePainPoint` returns early when at 3 and adding
- **ProgressBar**: 3px height, removed glow shadow
- **SelectableCard**: Added purple checkmark badge on selected state
- **Input atom**: Focus ring changed from zinc to purple
- **tailwind.config.ts**: Added `shake` keyframe and `animate-shake`
utility
### Checklist 📋
#### For code changes:
- [ ] I have clearly listed my changes in the PR description
- [ ] I have made a test plan
- [ ] I have tested my changes according to the test plan:
- [ ] Navigate through full onboarding flow (screens 1→2→3→4)
- [ ] Verify "AutoPilot" branding on all screens (no "Autopilot")
- [ ] Verify screen 2 auto-advances after tapping a role (non-"Other")
- [ ] Verify "Other" role still shows text input and Continue button
- [ ] Verify Back button works correctly from screen 2 and 3
- [ ] Select 3 pain points and verify green "3 selected" text
- [ ] Attempt 4th selection and verify shake animation + swap message
- [ ] Deselect one and verify can select a different one
- [ ] Verify checkmark badges appear on selected cards
- [ ] Verify progress bar is thin (3px) and subtle
- [ ] Verify input focus state is purple across onboarding inputs
- [ ] Verify "Something else" + other text input still works on screen 3
---------
Co-authored-by: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
|
||
|
|
243b12778f |
dx: improve pr-test skill — inline screenshots, flow captions, and test evaluation (#12692)
## Changes ### 1. Inline image enforcement (Step 7) - Added `CRITICAL` warning: never post a bare directory tree link - Added post-comment verification block that greps for `` inline - Screenshot captions were too vague to be useful ("shows the page") - No mechanism to catch incomplete test runs — agent could skip scenarios and still post a passing report ## Checklist - [x] `.claude/skills/pr-test/SKILL.md` updated - [x] No production code changes — skill/dx only - [x] Pre-commit hooks pass |
||
|
|
43c81910ae |
fix(backend/copilot): skip AI blocks without model property in fix_ai_model_parameter (#12688)
### Why / What / How **Why:** Some AI-category blocks do not expose a `"model"` input property in their `inputSchema`. The `fix_ai_model_parameter` fixer was unconditionally injecting a default model value (e.g. `"gpt-4o"`) into any node whose block has category `"AI"`, regardless of whether that block actually accepts a `model` input. This causes the agent JSON to include an invalid field for those blocks. **What:** Guard the model-injection logic with a check that `"model"` exists in the block's `inputSchema.properties` before attempting to set or validate the field. AI blocks that have no model selector are now skipped entirely. **How:** In `fix_ai_model_parameter`, after confirming `is_ai_block`, extract `input_properties` from the block's `inputSchema.properties` and `continue` if `"model"` is absent. The subsequent `model_schema` lookup is also simplified to reuse the already-fetched `input_properties` dict. A regression test is added to cover this case. ### Changes 🏗️ - `backend/copilot/tools/agent_generator/fixer.py`: In `fix_ai_model_parameter`, skip AI-category nodes whose block `inputSchema.properties` does not contain a `"model"` key; reuse `input_properties` for the subsequent `model_schema` lookup. - `backend/copilot/tools/agent_generator/fixer_test.py`: Add `test_ai_block_without_model_property_is_skipped` to `TestFixAiModelParameter`. ### Checklist 📋 #### For code changes: - [ ] I have clearly listed my changes in the PR description - [ ] I have made a test plan - [ ] I have tested my changes according to the test plan: - [ ] Run `poetry run pytest backend/copilot/tools/agent_generator/fixer_test.py` — all 50 tests pass (49 pre-existing + 1 new) Co-authored-by: Claude Sonnet 4.6 <noreply@anthropic.com> |
||
|
|
a11199aa67 |
dx(frontend): set up React integration testing with Vitest + RTL + MSW (#12667)
## Summary
- Establish React integration tests (Vitest + RTL + MSW) as the primary
frontend testing strategy (~90% of tests)
- Update all contributor documentation (TESTING.md, CONTRIBUTING.md,
AGENTS.md) to reflect the integration-first convention
- Add `NuqsTestingAdapter` and `TooltipProvider` to the shared test
wrapper so page-level tests work out of the box
- Write 8 integration tests for the library page as a reference example
for the pattern
## Why
We had the testing infrastructure (Vitest, RTL, MSW, Orval-generated
handlers) but no established convention for page-level integration
tests. Most existing tests were for stores or small components. Since
our frontend is client-first, we need a documented, repeatable pattern
for testing full pages with mocked APIs.
## What
- **Docs**: Rewrote `TESTING.md` as a comprehensive guide. Updated
testing sections in `CONTRIBUTING.md`, `frontend/AGENTS.md`,
`platform/AGENTS.md`, and `autogpt_platform/AGENTS.md`
- **Test infra**: Added `NuqsTestingAdapter` (for `nuqs` query state
hooks) and `TooltipProvider` (for Radix tooltips) to `test-utils.tsx`
- **Reference tests**: `library/__tests__/main.test.tsx` with 8 tests
covering agent rendering, tabs, folders, search bar, and Jump Back In
## How
- Convention: tests live in `__tests__/` next to `page.tsx`, named
descriptively (`main.test.tsx`, `search.test.tsx`)
- Pattern: `setupHandlers()` → `render(<Page />)` → `findBy*` assertions
- MSW handlers from
`@/app/api/__generated__/endpoints/{tag}/{tag}.msw.ts` for API mocking
- Custom `render()` from `@/tests/integrations/test-utils` wraps all
required providers
## Test plan
- [x] All 422 unit/integration tests pass (`pnpm test:unit`)
- [x] `pnpm format` clean
- [x] `pnpm lint` clean (no new errors)
- [x] `pnpm types` — pre-existing onboarding type errors only, no new
errors
🤖 Generated with [Claude Code](https://claude.com/claude-code)
---------
Co-authored-by: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Co-authored-by: Nicholas Tindle <nicholas.tindle@agpt.co>
Co-authored-by: Reinier van der Leer <pwuts@agpt.co>
|
||
|
|
5f82a71d5f |
feat(copilot): add Fast/Thinking mode toggle with full tool parity (#12623)
### Why / What / How Users need a way to choose between fast, cheap responses (Sonnet) and deep reasoning (Opus) in the copilot. Previously only the SDK/Opus path existed, and the baseline path was a degraded fallback with no tool calling, no file attachments, no E2B sandbox, and no permission enforcement. This PR adds a copilot mode toggle and brings the baseline (fast) path to full feature parity with the SDK (extended thinking) path. ### Changes 🏗️ #### 1. Mode toggle (UI → full stack) - Add Fast / Thinking mode toggle to ChatInput footer (Phosphor `Brain`/`Zap` icons via lucide-react) - Thread `mode: "fast" | "extended_thinking" | null` from `StreamChatRequest` → RabbitMQ queue → executor → service selection - Fast → baseline service (Sonnet 4 via OpenRouter), Thinking → SDK service (Opus 4.6) - Toggle gated behind `CHAT_MODE_OPTION` feature flag with server-side enforcement - Mode persists in localStorage with SSR-safe init #### 2. Baseline service full tool parity - **Tool call persistence**: Store structured `ChatMessage` entries (assistant + tool results) instead of flat concatenated text — enables frontend to render tool call details and maintain context across turns - **E2B sandbox**: Wire up `get_or_create_sandbox()` so `bash_exec` routes to E2B (image download, Python/PIL compression, filesystem access) - **File attachments**: Accept `file_ids`, download workspace files, embed images as OpenAI vision blocks, save non-images to working dir - **Permissions**: Filter tool list via `CopilotPermissions` (whitelist/blacklist) - **URL context**: Pass `context` dict to user message for URL-shared content - **Execution context**: Pass `sandbox`, `sdk_cwd`, `permissions` to `set_execution_context()` - **Model**: Changed `fast_model` from `google/gemini-2.5-flash` to `anthropic/claude-sonnet-4` for reliable function calling - **Temp dir cleanup**: Lazy `mkdtemp` (only when files attached) + `shutil.rmtree` in finally #### 3. Transcript support for Fast mode - Baseline service now downloads / validates / loads / appends / uploads transcripts (parity with SDK) - Enables seamless mode switching mid-conversation via shared transcript - Upload shielded from cancellation, bounded at 5s timeout #### 4. Feature-flag infrastructure fixes - `FORCE_FLAG_*` env-var overrides on both backend and frontend for local dev / E2E - LaunchDarkly context parity (frontend mirrors backend user context) - `CHAT_MODE_OPTION` default flipped to `false` to match backend #### 5. Other hardening - Double-submit ref guard in `useChatInput` + reconnect dedup in `useCopilotStream` - `copilotModeRef` pattern to read latest mode without recreating transport - Shared `CopilotMode` type across frontend files - File name collision handling with numeric suffix - Path sanitization in file description hints (`os.path.basename`) ### Test plan - [x] 30 new unit tests: `_env_flag_override` (12), `envFlagOverride` (8), `_filter_tools_by_permissions` (4), `_prepare_baseline_attachments` (6) - [x] E2E tested on dev: fast mode creates E2B sandbox, calls 7-10 tools, generates and renders images - [x] Mode switching mid-session works (shared transcript + session messages) - [x] Server-side flag gate enforced (crafted `mode=fast` stripped when flag off) - [x] All 37 CI checks green - [x] Verified via agent-browser: workspace images render correctly in all message positions 🤖 Generated with [Claude Code](https://claude.com/claude-code) --------- Co-authored-by: Zamil Majdy <majdy.zamil@gmail.com> |
||
|
|
1a305db162 |
ci(frontend): add Playwright E2E coverage reporting to Codecov (#12665)
Co-authored-by: Claude Opus 4.6 (1M context) <noreply@anthropic.com> |
||
|
|
48a653dc63 |
fix(copilot): prevent duplicate side effects from double-submit and stale-cache race (#12660)
## Why #12604 (intermediate persistence) introduced two bugs on dev: 1. **Duplicate user messages** — `set_turn_duration` calls `invalidate_session_cache()` which deletes the Redis key. Concurrent `get_chat_session()` calls re-populate it from DB with stale data. The executor loads this stale cache, misses the user message, and re-appends it. 2. **Tool outputs lost on hydration** — Intermediate flushes save assistant messages to DB before `StreamToolInputAvailable` sets `tool_calls` on them. Since `_save_session_to_db` is append-only (uses `start_sequence`), the `tool_calls` update is lost — subsequent flushes start past that index. On page refresh / SSE reconnect, tool UIs (SetupRequirementsCard, run_block output, etc.) are invisible. 3. **Sessions stuck running** — If a tool call hangs (e.g. WebSearch provider not responding), the stream never completes, `mark_session_completed` never runs, and the `active_stream` flag stays stale in Redis. ## What - **In-place cache update** in `set_turn_duration` — replaces `invalidate_session_cache()` with a read-modify-write that patches the duration on the cached session, eliminating the stale-cache repopulation window - **tool_calls backfill** — tracks the flush watermark and assistant message index; when `StreamToolInputAvailable` sets `tool_calls` on an already-flushed assistant, updates the DB record directly via `update_message_tool_calls()` - **Improved message dedup** — `is_message_duplicate()` / `maybe_append_user_message()` scans trailing same-role messages (current turn) instead of only checking `messages[-1]` - **Idle timeout** — aborts the stream with a retryable error if no meaningful SDK message arrives for 10 minutes, preventing hung tool calls from leaving sessions stuck ## Changes - `copilot/db.py` — `update_message_tool_calls()`, in-place cache update in `set_turn_duration` - `copilot/model.py` — `is_message_duplicate()`, `maybe_append_user_message()` - `copilot/sdk/service.py` — flush watermark tracking, tool_calls backfill, idle timeout - `copilot/baseline/service.py` — use `maybe_append_user_message()` - `copilot/model_test.py` — unit tests for dedup - `copilot/db_test.py` — unit tests for set_turn_duration cache update ## Checklist - [x] My PR title follows [conventional commit](https://www.conventionalcommits.org/) format - [x] Out-of-scope changes are less than 20% of the PR - [x] Changes to `data/*.py` validated for user ID checks (N/A) - [x] Protected routes updated in middleware (N/A) |
||
|
|
f6ddcbc6cb |
feat(platform): Add all 12 Z.ai GLM models via OpenRouter (#12672)
## Summary Add Z.ai (Zhipu AI) GLM model family to the platform LLM blocks, routed through OpenRouter. This enables users to select any of the 12 Z.ai models across all LLM-powered blocks (AI Text Generator, AI Conversation, AI Structured Response, AI Text Summarizer, AI List Generator). ## Gap Analysis All 12 Z.ai models currently available on OpenRouter's API were missing from the AutoGPT platform: | Model | Context Window | Max Output | Price Tier | Cost | |-------|---------------|------------|------------|------| | GLM 4 32B | 128K | N/A | Tier 1 | 1 | | GLM 4.5 | 131K | 98K | Tier 2 | 2 | | GLM 4.5 Air | 131K | 98K | Tier 1 | 1 | | GLM 4.5 Air (Free) | 131K | 96K | Tier 1 | 1 | | GLM 4.5V (vision) | 65K | 16K | Tier 2 | 2 | | GLM 4.6 | 204K | 204K | Tier 1 | 1 | | GLM 4.6V (vision) | 131K | 131K | Tier 1 | 1 | | GLM 4.7 | 202K | 65K | Tier 1 | 1 | | GLM 4.7 Flash | 202K | N/A | Tier 1 | 1 | | GLM 5 | 80K | 131K | Tier 2 | 2 | | GLM 5 Turbo | 202K | 131K | Tier 3 | 4 | | GLM 5V Turbo (vision) | 202K | 131K | Tier 3 | 4 | ## Changes - **`autogpt_platform/backend/backend/blocks/llm.py`**: Added 12 `LlmModel` enum entries and corresponding `MODEL_METADATA` with context windows, max output tokens, display names, and price tiers sourced from OpenRouter API - **`autogpt_platform/backend/backend/data/block_cost_config.py`**: Added `MODEL_COST` entries for all 12 models, with costs scaled to match pricing (1 for budget, 2 for mid-range, 4 for premium) ## How it works All Z.ai models route through the existing OpenRouter provider (`open_router`) — no new provider or API client code needed. Users with an OpenRouter API key can immediately select any Z.ai model from the model dropdown in any LLM block. ## Related - Linear: REQ-83 --------- Co-authored-by: AutoGPT CoPilot <copilot@agpt.co> |
||
|
|
98f13a6e5d |
feat(copilot): add create -> dry-run -> fix loop to agent generation (#12578)
## Summary
- Instructs the copilot LLM to automatically dry-run agents after
creating or editing them, inspect the output for wiring/data-flow
issues, and fix iteratively before presenting the agent as ready to the
user
- Updates tool descriptions (run_agent, get_agent_building_guide),
prompting supplement, and agent generation guide with clear workflow
instructions and error pattern guidance
- Adds Tool Discovery Priority to shared tool notes (find_block ->
run_mcp_tool -> SendAuthenticatedWebRequestBlock -> manual API)
- Adds 37 tests: prompt regression tests + functional tests (tool schema
validation, Pydantic model, guide workflow ordering)
- **Frontend**: Fixes host-scoped credential UX — replaces duplicate
credentials for the same host instead of stacking them, wires up delete
functionality with confirmation modal, updates button text contextually
("Update headers" vs "Add headers")
## Test plan
- [x] All 37 `dry_run_loop_test.py` tests pass (prompt content, tool
schemas, Pydantic model, guide ordering)
- [x] Existing `tool_schema_test.py` passes (110 tests including
character budget gate)
- [x] Ruff lint and format pass
- [x] Pyright type checking passes
- [x] Frontend: `pnpm lint`, `pnpm types` pass
- [x] Manual verification: confirm copilot follows the create -> dry-run
-> fix workflow when asked to build an agent
- [x] Manual verification: confirm host-scoped credentials replace
instead of duplicate
|
||
|
|
613978a611 |
ci: add gitleaks secret scanning to pre-commit hooks (#12649)
### Why / What / How **Why:** We had no local pre-commit protection against accidentally committing secrets. The existing `detect-secrets` hook only ran on `pre-push`, which is too late — secrets are already in git history by that point. GitHub's push protection only covers known provider patterns and runs server-side. **What:** Adds a 3-layer defense against secret leaks: local pre-commit hooks (gitleaks + detect-secrets), and a CI workflow as a safety net. **How:** - Moved `detect-secrets` from `pre-push` to `pre-commit` stage - Added `gitleaks` as a second pre-commit hook (Go binary, faster and more comprehensive rule set) - Added `.gitleaks.toml` config with allowlists for known false positives (test fixtures, dev docker JWTs, Firebase public keys, lock files, docs examples) - Added `repo-secret-scan.yml` CI workflow using `gitleaks-action` on PRs/pushes to master/dev ### Changes 🏗️ - `.pre-commit-config.yaml`: Moved `detect-secrets` to pre-commit stage, added baseline arg, added `gitleaks` hook - `.gitleaks.toml`: New config with tuned allowlists for this repo's false positives - `.secrets.baseline`: Empty baseline for detect-secrets to track known findings - `.github/workflows/repo-secret-scan.yml`: New CI workflow running gitleaks on every PR and push ### Checklist 📋 #### For code changes: - [x] I have clearly listed my changes in the PR description - [x] I have made a test plan - [x] I have tested my changes according to the test plan: - [x] Ran `gitleaks detect --no-git` against the full repo — only `.env` files (gitignored) remain as findings - [x] Verified gitleaks catches a test secret file correctly - [x] Pre-commit hooks pass on commit (both detect-secrets and gitleaks passed) #### For configuration changes: - [x] `.env.default` is updated or already compatible with my changes - [x] `docker-compose.yml` is updated or already compatible with my changes - [x] I have included a list of my configuration changes in the PR description (under **Changes**) |
||
|
|
2b0e8a5a9f |
feat(platform): add rate-limit tiering system for CoPilot (#12581)
## Summary - Adds a four-tier subscription system (FREE/PRO/BUSINESS/ENTERPRISE) for CoPilot with configurable multipliers (1x/5x/20x/60x) applied on top of the base LaunchDarkly/config limits - Stores user tier in the database (`User.subscriptionTier` column as a Prisma enum, defaults to PRO for beta testing) with admin API endpoints for tier management - Includes tier info in usage status responses and OTEL/Langfuse trace metadata for observability ## Tier Structure | Tier | Multiplier | Daily Tokens | Weekly Tokens | Notes | |------|-----------|-------------|--------------|-------| | FREE | 1x | 2.5M | 12.5M | Base tier (unused during beta) | | PRO | 5x | 12.5M | 62.5M | Default on sign-up (beta) | | BUSINESS | 20x | 50M | 250M | Manual upgrade for select users | | ENTERPRISE | 60x | 150M | 750M | Highest tier, custom | ## Changes - **`rate_limit.py`**: `SubscriptionTier` enum (FREE/PRO/BUSINESS/ENTERPRISE), `TIER_MULTIPLIERS`, `get_user_tier()`, `set_user_tier()`, update `get_global_rate_limits()` to apply tier multiplier and return 3-tuple, add `tier` field to `CoPilotUsageStatus` - **`rate_limit_admin_routes.py`**: Add `GET/POST /admin/rate_limit/tier` endpoints, include `tier` in `UserRateLimitResponse` - **`routes.py`** (chat): Include tier in `/usage` endpoint response - **`sdk/service.py`**: Send `subscription_tier` in OTEL/Langfuse trace metadata - **`schema.prisma`**: Add `SubscriptionTier` enum and `subscriptionTier` column to `User` model (default: PRO) - **`config.py`**: Update docs to reflect tier system - **Migration**: `20260326200000_add_rate_limit_tier` — creates enum, migrates STANDARD→PRO, adds BUSINESS, sets default to PRO ## Test plan - [x] 72 unit tests all passing (43 rate_limit + 11 admin routes + 18 chat routes) - [ ] Verify FREE tier users get base limits (2.5M daily, 12.5M weekly) - [ ] Verify PRO tier users get 5x limits (12.5M daily, 62.5M weekly) - [ ] Verify BUSINESS tier users get 20x limits (50M daily, 250M weekly) - [ ] Verify ENTERPRISE tier users get 60x limits (150M daily, 750M weekly) - [ ] Verify admin can read and set user tiers via API - [ ] Verify tier info appears in Langfuse traces - [ ] Verify migration applies cleanly (creates enum, migrates STANDARD users to PRO, adds BUSINESS, default PRO) --------- Co-authored-by: Claude Opus 4.6 (1M context) <noreply@anthropic.com> Co-authored-by: Nicholas Tindle <nicholas.tindle@agpt.co> |
||
|
|
08bb05141c |
dx: enhance pr-address skill with detailed codecov coverage guidance (#12662)
Enhanced pr-address skill codecov section with local coverage commands, priority guide, and troubleshooting steps. |
||
|
|
3ccaa5e103 |
ci(frontend): make frontend coverage checks informational (non-blocking) (#12663)
### Why / What / How
**Why:** Frontend test coverage is still ramping up. The default
component status checks (project + patch at 80%) would block merges for
insufficient coverage on frontend changes, which isn't practical yet.
**What:** Override the platform-frontend component's coverage statuses
to be `informational: true`, so they report but don't block merges.
**How:** Added explicit `statuses` to the `platform-frontend` component
in `codecov.yml` with `informational: true` on both project and patch
checks, overriding the `default_rules`.
### Changes 🏗️
- **`codecov.yml`**: Added `informational: true` to platform-frontend
component's project and patch status checks
### Checklist 📋
#### For code changes:
- [ ] I have clearly listed my changes in the PR description
- [ ] I have made a test plan
- [ ] I have tested my changes according to the test plan:
- [ ] Verify Codecov frontend status checks show as informational
(non-blocking) on PRs touching frontend code
#### For configuration changes:
- [x] `.env.default` is updated or already compatible with my changes
- [x] `docker-compose.yml` is updated or already compatible with my
changes
- [x] I have included a list of my configuration changes in the PR
description (under **Changes**)
<!-- CURSOR_SUMMARY -->
---
> [!NOTE]
> **Low Risk**
> Low risk: Codecov configuration-only change that affects merge gating
for frontend coverage statuses but does not alter runtime code.
>
> **Overview**
> Updates `codecov.yml` to override the `platform-frontend` component’s
coverage `statuses` so both **project** and **patch** checks are marked
`informational: true` (non-blocking), while leaving the default
component coverage rules unchanged for other components.
>
> <sup>Written by [Cursor
Bugbot](https://cursor.com/dashboard?tab=bugbot) for commit
|
||
|
|
09e42041ce |
fix(frontend): AutoPilot notification follow-ups — branding, UX, persistence, and cross-tab sync (#12428)
AutoPilot (copilot) notifications had several follow-up issues after initial implementation: old "Otto" branding, UX quirks, a service-worker crash, notification state that didn't persist or sync across tabs, a broken notification sound, and noisy Sentry alerts from SSR. ### Changes 🏗️ - **Rename "Otto" → "AutoPilot"** in all notification surfaces: browser notifications, document title badge, permission dialog copy, and notification banner copy - **Agent Activity icon**: changed from `Bell` to `Pulse` (Phosphor) in the navbar dropdown - **Centered dialog buttons**: the "Stay in the loop" permission dialog buttons are now centered instead of right-aligned - **Service worker notification fix**: wrapped `new Notification()` in try-catch so it degrades gracefully in service worker / PWA contexts instead of throwing `TypeError: Illegal constructor` - **Persist notification state**: `completedSessionIDs` is now stored in localStorage (`copilot-completed-sessions`) so it survives page refreshes and new tabs - **Cross-tab sync**: a `storage` event listener keeps `completedSessionIDs` and `document.title` in sync across all open tabs — clearing a notification in one tab clears it everywhere - **Fix notification sound**: corrected the sound file path from `/sounds/notification.mp3` to `/notification.mp3` and added a `.gitignore` exception (root `.gitignore` has a blanket `*.mp3` ignore rule from legacy AutoGPT agent days) - **Fix SSR Sentry noise**: guarded the Copilot Zustand store initialization with a client-side check so `storage.get()` is never called during SSR, eliminating spurious Sentry alerts (BUILDER-7CB, 7CC, 7C7) while keeping the Sentry reporting in `local-storage.ts` intact for genuinely unexpected SSR access ### Checklist 📋 #### For code changes: - [x] I have clearly listed my changes in the PR description - [x] I have made a test plan - [x] I have tested my changes according to the test plan: - [x] Verify "AutoPilot" appears (not "Otto") in browser notification, document title, permission dialog, and banner - [x] Verify Pulse icon in navbar Agent Activity dropdown - [x] Verify "Stay in the loop" dialog buttons are centered - [x] Open two tabs on copilot → trigger completion → both tabs show badge/checkmark - [x] Click completed session in tab 1 → badge clears in both tabs - [x] Refresh a tab → completed session state is preserved - [x] Verify notification sound plays on completion - [x] Verify no Sentry alerts from SSR localStorage access --------- Co-authored-by: Claude Opus 4.6 (1M context) <noreply@anthropic.com> |
||
|
|
a50e95f210 |
feat(backend/copilot): add include_graph option to find_library_agent (#12622)
## Why The copilot's `edit_agent` tool requires the LLM to provide a complete agent JSON (all nodes + links), but the LLM had **no way to see the current graph structure** before editing. It was editing blindly — guessing/hallucinating the entire node+link structure and replacing the graph wholesale. ## What - Add `include_graph` boolean parameter (default `false`) to the existing `find_library_agent` tool - When `true`, each returned `AgentInfo` includes a `graph` field with the full graph JSON (nodes, links, `input_default` values) - Update the agent generation guide to instruct the LLM to always fetch the current graph before editing ## How - Added `graph: dict[str, Any] | None` field to `AgentInfo` model - Added `_enrich_agents_with_graph()` helper in `agent_search.py` that calls the existing `get_agent_as_json()` utility to fetch full graph data - Threaded `include_graph` parameter through `find_library_agent` → `search_agents` → `_search_library` - Updated `agent_generation_guide.md` to add an "if editing" step that fetches the graph first No new tools introduced — reuses existing `find_library_agent` with one optional flag. ## Test plan - [x] Unit tests: 2 new tests added (`test_include_graph_fetches_nodes_and_links`, `test_include_graph_false_does_not_fetch`) - [x] All 7 `agent_search_test.py` tests pass - [x] All pre-commit hooks pass (lint, format, typecheck) - [ ] Verify copilot correctly uses `include_graph=true` before editing an agent (manual test) |
||
|
|
92b395d82a |
fix(backend): use OpenRouter client for simulator to support non-OpenAI models (#12656)
## Why Dry-run block simulation is failing in production with `404 - model gemini-2.5-flash does not exist`. The simulator's default model (`google/gemini-2.5-flash`) is a non-OpenAI model that requires OpenRouter routing, but the shared `get_openai_client()` prefers the direct OpenAI key, creating a client that can't handle non-OpenAI models. The old code also stripped the provider prefix, sending `gemini-2.5-flash` to OpenAI's API. ## What - Added `prefer_openrouter` keyword parameter to `get_openai_client()` — when True, prefers the OpenRouter key (returns None if unavailable, rather than falling back to an incompatible direct OpenAI client) - Simulator now calls `get_openai_client(prefer_openrouter=True)` so `google/gemini-2.5-flash` routes correctly through OpenRouter - Removed the redundant `SIMULATION_MODEL` env var override and the now-unnecessary provider prefix stripping from `_simulator_model()` ## How `get_openai_client()` is decorated with `@cached(ttl_seconds=3600)` which keys by args, so `get_openai_client()` and `get_openai_client(prefer_openrouter=True)` are cached independently. When `prefer_openrouter=True` and no OpenRouter key exists, returns `None` instead of falling back — the simulator already handles `None` with a clear error message. ### Checklist - [x] All 24 dry-run tests pass - [x] Test asserts `get_openai_client` is called with `prefer_openrouter=True` - [x] Format, lint, and pyright pass - [x] No changes to user-facing APIs - [ ] Deploy to staging and verify simulation works --------- Co-authored-by: Nicholas Tindle <nicholas.tindle@agpt.co> |
||
|
|
86abfbd394 |
feat(frontend): redesign onboarding wizard with Autopilot-first flow (#12640)
### Why / What / How <img width="800" height="827" alt="Screenshot 2026-04-02 at 15 40 24" src="https://github.com/user-attachments/assets/69a381c1-2884-434b-9406-4a3f7eec87cf" /> <img width="800" height="825" alt="Screenshot 2026-04-02 at 15 40 41" src="https://github.com/user-attachments/assets/c6191a68-a8ba-482b-ba47-c06c71d69f0c" /> <img width="800" height="825" alt="Screenshot 2026-04-02 at 15 40 48" src="https://github.com/user-attachments/assets/31b632b9-59cb-4bf7-a6a0-6158846fcf9a" /> <img width="800" height="812" alt="Screenshot 2026-04-02 at 15 40 54" src="https://github.com/user-attachments/assets/64e38a15-2e56-4c0e-bd84-987bf6076bf7" /> **Why:** The existing onboarding flow was outdated and didn't align with the new Autopilot-first experience. New users need a streamlined, visually polished wizard that collects their role and pain points to personalize Autopilot suggestions. **What:** Complete redesign of the onboarding wizard as a 4-step flow: Welcome → Role selection → Pain points → Preparing workspace. Uses the design system throughout (atoms/molecules), adds animations, and syncs steps with URL search params. **How:** - Zustand store manages wizard state (name, role, pain points, current step) - Steps synced to `?step=N` URL params for browser navigation support - Pain points reordered based on selected role (e.g. Sales sees "Finding leads" first) - Design system components used exclusively (no raw shadcn `ui/` imports) - New reusable components: `FadeIn` (atom), `TypingText` (molecule) with Storybook stories - `AutoGPTLogo` made sizeable via Tailwind className prop, migrated in Navbar - Fixed `SetupAnalytics` crash (client component was rendered inside `<head>`) ### Changes 🏗️ - **New onboarding wizard** (`steps/WelcomeStep`, `RoleStep`, `PainPointsStep`, `PreparingStep`) - **New shared components**: `ProgressBar`, `StepIndicator`, `SelectableCard`, `CardCarousel` - **New design system components**: `FadeIn` atom with stories, `TypingText` molecule with stories - **`AutoGPTLogo`** — size now controlled via `className` prop instead of numeric `size` - **Navbar** — migrated from legacy `IconAutoGPTLogo` to design system `AutoGPTLogo` - **Layout fix** — moved `SetupAnalytics` from `<head>` to `<body>` to fix React hydration crash - **Role-based pain point ordering** — top picks surfaced first based on role selection - **URL-synced steps** — `?step=N` search params for back/forward navigation - Removed old onboarding pages (1-welcome through 6-congrats, reset page) - Emoji/image assets for role selection cards ### Checklist 📋 #### For code changes: - [x] I have clearly listed my changes in the PR description - [x] I have made a test plan - [x] I have tested my changes according to the test plan: - [x] Complete onboarding flow from step 1 through 4 as a new user - [x] Verify back button navigates to previous step - [x] Verify progress bar advances correctly (hidden on step 4) - [x] Verify step indicator dots show for steps 1-3 - [x] Verify role selection reorders pain points on next step - [x] Verify "Other" role/pain point shows text input - [x] Verify typing animation on PreparingStep title - [x] Verify fade-in animations on all steps - [x] Verify URL updates with `?step=N` on navigation - [x] Verify browser back/forward works with step URLs - [x] Verify mobile horizontal scroll on card grids - [x] Verify `pnpm types` passes cleanly --------- Co-authored-by: Claude Opus 4.6 (1M context) <noreply@anthropic.com> |
||
|
|
a7f4093424 |
ci(platform): set up Codecov coverage reporting across platform and classic (#12655)
Co-authored-by: Claude Opus 4.6 (1M context) <noreply@anthropic.com> |
||
|
|
e33b1e2105 |
feat(classic): update classic autogpt a bit to make it more useful for my day to day (#11797)
## Summary
This PR modernizes AutoGPT Classic to make it more useful for day-to-day
autonomous agent development. Major changes include consolidating the
project structure, adding new prompt strategies, modernizing the
benchmark system, and improving the development experience.
**Note: AutoGPT Classic is an experimental, unsupported project
preserved for educational/historical purposes. Dependencies will not be
actively updated.**
## Changes 🏗️
### Project Structure & Build System
- **Consolidated Poetry projects** - Merged `forge/`,
`original_autogpt/`, and benchmark packages into a single
`pyproject.toml` at `classic/` root
- **Removed old benchmark infrastructure** - Deleted the complex
`agbenchmark` package (3000+ lines) in favor of the new
`direct_benchmark` harness
- **Removed frontend** - Deleted `benchmark/frontend/` React app (no
longer needed)
- **Cleaned up CI workflows** - Simplified GitHub Actions workflows for
the consolidated project structure
- **Added CLAUDE.md** - Documentation for working with the codebase
using Claude Code
### New Direct Benchmark System
- **`direct_benchmark` harness** - New streamlined benchmark runner
with:
- Rich TUI with multi-panel layout showing parallel test execution
- Incremental resume and selective reset capabilities
- CI mode for non-interactive environments
- Step-level logging with colored prefixes
- "Would have passed" tracking for timed-out challenges
- Copy-paste completion blocks for sharing results
### Multiple Prompt Strategies
Added pluggable prompt strategy system supporting:
- **one_shot** - Single-prompt completion
- **plan_execute** - Plan first, then execute steps
- **rewoo** - Reasoning without observation (deferred tool execution)
- **react** - Reason + Act iterative loop
- **lats** - Language Agent Tree Search (MCTS-based exploration)
- **sub_agent** - Multi-agent delegation architecture
- **debate** - Multi-agent debate for consensus
### LLM Provider Improvements
- Added support for modern **Anthropic Claude models**
(claude-3.5-sonnet, claude-3-haiku, etc.)
- Added **Groq** provider support
- Improved tool call error feedback for LLM self-correction
- Fixed deprecated API usage
### Web Components
- **Replaced Selenium with Playwright** for web browsing (better async
support, faster)
- Added **lightweight web fetch component** for simple URL fetching
- **Modernized web search** with tiered provider system (Tavily, Serper,
Google)
### Agent Capabilities
- **Workspace permissions system** - Pattern-based allow/deny lists for
agent commands
- **Rich interactive selector** for command approval with scopes
(once/agent/workspace/deny)
- **TodoComponent** with LLM-powered task decomposition
- **Platform blocks integration** - Connect to AutoGPT Platform API for
additional blocks
- **Sub-agent architecture** - Agents can spawn and coordinate
sub-agents
### Developer Experience
- **Python 3.12+ support** with CI testing on 3.12, 3.13, 3.14
- **Current working directory as default workspace** - Run `autogpt`
from any project directory
- Simplified log format (removed timestamps)
- Improved configuration and setup flow
- External benchmark adapters for GAIA, SWE-bench, and AgentBench
### Bug Fixes
- Fixed N/A command loop when using native tool calling
- Fixed auto-advance plan steps in Plan-Execute strategy
- Fixed approve+feedback to execute command then send feedback
- Fixed parallel tool calls in action history
- Always recreate Docker containers for code execution
- Various pyright type errors resolved
- Linting and formatting issues fixed across codebase
## Test Plan
- [x] CI lint, type, and test checks pass
- [x] Run `poetry install` from `classic/` directory
- [x] Run `poetry run autogpt` and verify CLI starts
- [x] Run `poetry run direct-benchmark run --tests ReadFile` to verify
benchmark works
## Notes
- This is a WIP PR for personal use improvements
- The project is marked as **unsupported** - no active maintenance
planned
- Contains known vulnerabilities in dependencies (intentionally not
updated)
<!-- CURSOR_SUMMARY -->
---
> [!NOTE]
> **Medium Risk**
> CI/build workflows are substantially reworked (runner matrix removal,
path/layout changes, new benchmark runner), so breakage is most likely
in automation and packaging rather than runtime behavior.
>
> **Overview**
> **Modernizes the `classic/` project layout and automation around a
single consolidated Poetry project** (root
`classic/pyproject.toml`/`poetry.lock`) and updates docs
(`classic/README.md`, new `classic/CLAUDE.md`) accordingly.
>
> **Replaces the old `agbenchmark` CI usage with `direct-benchmark` in
GitHub Actions**, including new/updated benchmark smoke and regression
workflows, standardized `working-directory: classic`, and a move to
**Python 3.12** on Ubuntu-only runners (plus updated caching, coverage
flags, and required `ANTHROPIC_API_KEY` wiring).
>
> Cleans up repo/dev tooling by removing the classic frontend workflow,
deleting the Forge VCR cassette submodule (`.gitmodules`) and associated
CI steps, consolidating `flake8`/`isort`/`pyright` pre-commit hooks to
run from `classic/`, updating ignores for new report/workspace
artifacts, and updating `classic/Dockerfile.autogpt` to build from
Python 3.12 with the consolidated project structure.
>
> <sup>Written by [Cursor
Bugbot](https://cursor.com/dashboard?tab=bugbot) for commit
|
||
|
|
fff101e037 |
feat(backend): add SQL query block with multi-database support for CoPilot analytics (#12569)
## Summary - Add a read-only SQL query block for CoPilot/AutoPilot analytics access - Supports **multiple databases**: PostgreSQL, MySQL, SQLite, MSSQL via SQLAlchemy - Enforces read-only queries (SELECT only) with defense-in-depth SQL validation using sqlparse - SSRF protection: blocks connections to private/internal IPs - Credentials stored securely via the platform credential system ## Changes - New `SQLQueryBlock` in `backend/blocks/sql_query_block.py` with `DatabaseType` enum - SQLAlchemy-based execution with dialect-specific read-only and timeout settings - Connection URL validation ensuring driver matches selected database type - Comprehensive test suite (62 tests) including URL validation, sanitization, serialization - Documentation in `docs/integrations/block-integrations/data.md` - Added `DATABASE` provider to `ProviderName` enum ### Checklist 📋 - [x] I have clearly listed my changes in the PR description - [x] I have made a test plan - [x] I have tested my changes according to the test plan #### Test plan: - [x] Unit tests pass for query validation, URL validation, error sanitization, value serialization - [x] Read-only enforcement rejects INSERT/UPDATE/DELETE/DROP - [x] Multi-statement injection blocked - [x] SSRF protection blocks private IPs - [x] Connection URL driver validation works for all 4 database types --------- Co-authored-by: Claude Opus 4.6 (1M context) <noreply@anthropic.com> |
||
|
|
f1ac05b2e0 |
fix(backend): propagate dry-run mode to special blocks with LLM-powered simulation (#12575)
## Summary - **OrchestratorBlock & AgentExecutorBlock** now execute for real in dry-run mode so the orchestrator can make LLM calls and agent executors can spawn child graphs. Their downstream tool blocks and child-graph blocks are still simulated via `simulate_block()`. Credential fields from node defaults are restored since `validate_exec()` wipes them in dry-run mode. Agent-mode iterations capped at 1 in dry-run. - **All blocks** (including MCPToolBlock) are simulated via a single generic `simulate_block()` path. The LLM prompt is grounded by `inspect.getsource(block.run)`, giving the simulator access to the exact implementation of each block's `run()` method. This produces realistic mock responses for any block type without needing block-specific simulation logic. - Updated agent generation guide to document special block dry-run behavior. - Minor frontend fixes: exported `formatCents` from `RateLimitResetDialog` for reuse in `UsagePanelContent`, used `useRef` for stable callback references in `useResetRateLimit` to avoid stale closures. - 74 tests (21 existing dry-run + 53 new simulator tests covering prompt building, passthrough logic, and special block dry-run). ## Design The simulator (`backend/executor/simulator.py`) uses a two-tier approach: 1. **Passthrough blocks** (OrchestratorBlock, AgentExecutorBlock): `prepare_dry_run()` returns modified input_data so these blocks execute for real in `manager.py`. OrchestratorBlock gets `max_iterations=1` (agent mode) or 0 (traditional mode). AgentExecutorBlock spawns real child graph executions whose blocks inherit `dry_run=True`. 2. **All other blocks**: `simulate_block()` builds an LLM prompt containing: - Block name and description - Input/output schemas (JSON Schema) - The block's `run()` source code via `inspect.getsource(block.run)` - The actual input values (with credentials stripped and long values truncated) The LLM then role-plays the block's execution, producing realistic outputs grounded in the actual implementation. Special handling for input/output blocks: `AgentInputBlock` and `AgentOutputBlock` are pure passthrough (no LLM call needed). ## Test plan - [x] All 74 tests pass (`pytest backend/copilot/tools/test_dry_run.py backend/executor/simulator_test.py`) - [x] Pre-commit hooks pass (ruff, isort, black, pyright, frontend typecheck) - [x] CI: all checks green - [x] E2E: dry-run execution completes with `is_dry_run=true`, cost=0, no errors - [x] E2E: normal (non-dry-run) execution unchanged - [x] E2E: Create agent with OrchestratorBlock + tool blocks, run with `dry_run=True`, verify orchestrator makes real LLM calls while tool blocks are simulated - [x] E2E: AgentExecutorBlock spawns child graph in dry-run, child blocks are LLM-simulated - [x] E2E: Builder simulate button works end-to-end with special blocks --------- Co-authored-by: Claude Opus 4.6 (1M context) <noreply@anthropic.com> |
||
|
|
f115607779 |
fix(copilot): recognize Agent tool name and route CLI state into workspace (#12635)
### Why / What / How
**Why:** The Claude Agent SDK CLI renamed the sub-agent tool from
`"Task"` to `"Agent"` in v2.x. Our security hooks only checked for
`"Task"`, so all sub-agent security controls were silently bypassed on
production: concurrency limiting didn't apply, and slot tracking was
broken. This was discovered via Langfuse trace analysis of session
`62b1b2b9` where background sub-agents ran unchecked.
Additionally, the CLI writes sub-agent output to `/tmp/claude-<uid>/`
and project state to `$HOME/.claude/` — both outside the per-session
workspace (`/tmp/copilot-<session>/`). This caused `PermissionError` in
E2B sandboxes and silently lost sub-agent results.
The frontend also had no rendering for the `Agent` / `TaskOutput` SDK
built-in tools — they fell through to the generic "other" category with
no context-aware display.
**What:**
1. Fix the sub-agent tool name recognition (`"Task"` → `{"Task",
"Agent"}`)
2. Allow `run_in_background` — the SDK handles async lifecycle cleanly
(returns `isAsync:true`, model polls via `TaskOutput`)
3. Route CLI state into the workspace via `CLAUDE_CODE_TMPDIR` and
`HOME` env vars
4. Add lifecycle hooks (`SubagentStart`/`SubagentStop`) for
observability
5. Add frontend `"agent"` tool category with proper UI rendering
**How:**
- Security hooks check `tool_name in _SUBAGENT_TOOLS` (frozenset of
`"Task"` and `"Agent"`)
- Background agents are allowed but still count against `max_subtasks`
concurrency limit
- Frontend detects `isAsync: true` output → shows "Agent started
(background)" not "Agent completed"
- `TaskOutput` tool shows retrieval status and collected results
- Robot icon and agent-specific accordion rendering for both foreground
and background agents
### Changes 🏗️
**Backend:**
- **`security_hooks.py`**: Replace `tool_name == "Task"` with `tool_name
in _SUBAGENT_TOOLS`. Remove `run_in_background` deny block (SDK handles
async lifecycle). Add `SubagentStart`/`SubagentStop` hooks.
- **`tool_adapter.py`**: Add `"Agent"` to `_SDK_BUILTIN_ALWAYS` list
alongside `"Task"`.
- **`service.py`**: Set `CLAUDE_CODE_TMPDIR=sdk_cwd` and `HOME=sdk_cwd`
in SDK subprocess env.
- **`security_hooks_test.py`**: Update background tests (allowed, not
blocked). Add test for background agents counting against concurrency
limit.
**Frontend:**
- **`GenericTool/helpers.ts`**: Add `"agent"` tool category for `Agent`,
`Task`, `TaskOutput`. Agent-specific animation text detecting `isAsync`
output. Input summaries from description/prompt fields.
- **`GenericTool/GenericTool.tsx`**: Add `RobotIcon` for agent category.
Add `getAgentAccordionData()` with async-aware title/content.
`TaskOutput` shows retrieval status.
- **`useChatSession.ts`**: Fix pre-existing TS error (void mutation
body).
### Checklist 📋
#### For code changes:
- [x] I have clearly listed my changes in the PR description
- [x] I have made a test plan
- [x] I have tested my changes according to the test plan:
- [x] All security hooks tests pass (background allowed + limit
enforced)
- [x] Pre-commit hooks (ruff, black, isort, pyright, tsc) all pass
- [x] E2E test: copilot agent create+run scenario PASS
- [ ] Deploy to dev and test copilot sub-agent spawning with background
mode
#### For configuration changes:
- [x] `.env.default` is updated or already compatible
- [x] `docker-compose.yml` is updated or already compatible
|
||
|
|
1aef8b7155 |
fix(backend/copilot): fix tool output file reading between E2B and host (#12646)
### Why / What / How **Why:** When copilot tools return large outputs (e.g. 3MB+ base64 images from API calls), the agent cannot process them in the E2B sandbox. Three compounding issues prevent seamless file access: 1. The `<tool-output-truncated path="...">` tag uses a bare `path=` attribute that the model confuses with a local filesystem path (it's actually a workspace path) 2. `is_allowed_local_path` rejects `tool-outputs/` directories (only `tool-results/` was allowed) 3. SDK-internal files read via the `Read` tool are not available in the E2B sandbox for `bash_exec` processing **What:** Fixes all three issues so that large tool outputs can be seamlessly read and processed in both host and E2B contexts. **How:** - Changed `path=` → `workspace_path=` in the truncation tag to disambiguate workspace vs filesystem paths - Added `save_to_path` guidance in the retrieval instructions for E2B users - Extended `is_allowed_local_path` to accept both `tool-results` and `tool-outputs` directories - Added automatic bridging: when E2B is active and `Read` accesses an SDK-internal file, the file is automatically copied to `/tmp/<filename>` in the sandbox - Updated system prompting to explain both SDK tool-result bridging and workspace `<tool-output-truncated>` handling ### Changes 🏗️ - **`tools/base.py`**: `_persist_and_summarize` now uses `workspace_path=` attribute and includes `save_to_path` example for E2B processing - **`context.py`**: `is_allowed_local_path` accepts both `tool-results` and `tool-outputs` directory names - **`sdk/e2b_file_tools.py`**: `_handle_read_file` bridges SDK-internal files to `/tmp/` in E2B sandbox; new `_bridge_to_sandbox` helper - **`prompting.py`**: Updated "SDK tool-result files" section and added "Large tool outputs saved to workspace" section - **Tests**: Added `tool-outputs` path validation tests in `context_test.py` and `e2b_file_tools_test.py`; updated `base_test.py` assertion for `workspace_path` ### Checklist 📋 #### For code changes: - [x] I have clearly listed my changes in the PR description - [x] I have made a test plan - [x] I have tested my changes according to the test plan: - [x] `poetry run pytest backend/copilot/tools/base_test.py` — all 9 tests pass (persistence, truncation, binary fields) - [x] `poetry run format` and `poetry run lint` pass clean - [x] All pre-commit hooks pass - [ ] `context_test.py`, `e2b_file_tools_test.py`, `security_hooks_test.py` — blocked by pre-existing DB migration issue on worktree (missing `User.subscriptionTier` column); CI will validate these |
||
|
|
0da949ba42 |
feat(e2b): set git committer identity from user's GitHub profile (#12650)
## Summary
Sets git author/committer identity in E2B sandboxes using the user's
connected GitHub account profile, so commits are properly attributed.
## Changes
### `integration_creds.py`
- Added `get_github_user_git_identity(user_id)` that fetches the user's
name and email from the GitHub `/user` API
- Uses TTL cache (10 min) to avoid repeated API calls
- Falls back to GitHub noreply email
(`{id}+{login}@users.noreply.github.com`) when user has a private email
- Falls back to `login` if `name` is not set
### `bash_exec.py`
- After injecting integration env vars, calls
`get_github_user_git_identity()` and sets `GIT_AUTHOR_NAME`,
`GIT_AUTHOR_EMAIL`, `GIT_COMMITTER_NAME`, `GIT_COMMITTER_EMAIL`
- Only sets these if the user has a connected GitHub account
### `bash_exec_test.py`
- Added tests covering: identity set from GitHub profile, no identity
when GitHub not connected, no injection when no user_id
## Why
Previously, commits made inside E2B sandboxes had no author identity
set, leading to unattributed commits. This dynamically resolves identity
from the user's actual GitHub account rather than hardcoding a default.
<!-- CURSOR_SUMMARY -->
---
> [!NOTE]
> **Medium Risk**
> Adds outbound calls to GitHub’s `/user` API during `bash_exec` runs
and injects returned identity into the sandbox environment, which could
impact reliability (network/timeouts) and attribution behavior. Caching
mitigates repeated calls but incorrect/expired tokens or API failures
may lead to missing identity in commits.
>
> **Overview**
> Sets git author/committer environment variables in the E2B `bash_exec`
path by fetching the connected user’s GitHub profile and injecting
`GIT_AUTHOR_*`/`GIT_COMMITTER_*` into the sandbox env.
>
> Introduces `get_github_user_git_identity()` with TTL caching
(including a short-lived null cache), fallback to GitHub noreply email
when needed, and ensures `invalidate_user_provider_cache()` also clears
identity caches for the `github` provider. Updates tests to cover
identity injection behavior and the new cache invalidation semantics.
>
> <sup>Written by [Cursor
Bugbot](https://cursor.com/dashboard?tab=bugbot) for commit
|
||
|
|
6b031085bd |
feat(platform): add generic ask_question copilot tool (#12647)
### Why / What / How **Why:** The copilot can ask clarifying questions in plain text, but that text gets collapsed into hidden "reasoning" UI when the LLM also calls tools in the same turn. This makes clarification questions invisible to users. The existing `ClarificationNeededResponse` model and `ClarificationQuestionsCard` UI component were built for this purpose but had no tool wiring them up. **What:** Adds a generic `ask_question` tool that produces a visible, interactive clarification card instead of collapsible plain text. Unlike the agent-generation-specific `clarify_agent_request` proposed in #12601, this tool is workflow-agnostic — usable for agent building, editing, troubleshooting, or any flow needing user input. **How:** - Backend: New `AskQuestionTool` reuses existing `ClarificationNeededResponse` model. Registered in `TOOL_REGISTRY` and `ToolName` permissions. - Frontend: New `AskQuestion/` renderer reuses `ClarificationQuestionsCard` from CreateAgent. Registered in `CUSTOM_TOOL_TYPES` (prevents collapse into reasoning) and `MessagePartRenderer`. - Guide: `agent_generation_guide.md` updated to reference `ask_question` for the clarification step. ### Changes 🏗️ - **`copilot/tools/ask_question.py`** — New generic tool: takes `question`, optional `options[]` and `keyword`, returns `ClarificationNeededResponse` - **`copilot/tools/__init__.py`** — Register `ask_question` in `TOOL_REGISTRY` - **`copilot/permissions.py`** — Add `ask_question` to `ToolName` literal - **`copilot/sdk/agent_generation_guide.md`** — Reference `ask_question` tool in clarification step - **`ChatMessagesContainer/helpers.ts`** — Add `tool-ask_question` to `CUSTOM_TOOL_TYPES` - **`MessagePartRenderer.tsx`** — Add switch case for `tool-ask_question` - **`AskQuestion/AskQuestion.tsx`** — Renderer reusing `ClarificationQuestionsCard` - **`AskQuestion/helpers.ts`** — Output parsing and animation text ### Checklist 📋 #### For code changes: - [x] I have clearly listed my changes in the PR description - [x] I have made a test plan - [x] I have tested my changes according to the test plan: - [x] Backend format + pyright pass - [x] Frontend lint + types pass - [x] Pre-commit hooks pass - [ ] Manual test: copilot uses `ask_question` and card renders visibly (not collapsed) |
||
|
|
11b846dd49 |
fix(blocks): rename placeholder_values to options on AgentDropdownInputBlock (#12595)
## Summary Resolves [REQ-78](https://linear.app/autogpt/issue/REQ-78): The `placeholder_values` field on `AgentDropdownInputBlock` is misleadingly named. In every major UI framework "placeholder" means non-binding hint text that disappears on focus, but this field actually creates a dropdown selector that restricts the user to only those values. ## Changes ### Core rename (`autogpt_platform/backend/backend/blocks/io.py`) - Renamed `placeholder_values` → `options` on `AgentDropdownInputBlock.Input` - Added clear field description: *"If provided, renders the input as a dropdown selector restricted to these values. Leave empty for free-text input."* - Updated class docstring to describe actual behavior - Overrode `model_construct()` to remap legacy `placeholder_values` → `options` for **backward compatibility** with existing persisted agent JSON ### Tests (`autogpt_platform/backend/backend/blocks/test/test_block.py`) - Updated existing tests to use canonical `options` field name - Added 2 new backward-compat tests verifying legacy `placeholder_values` still works through both `model_construct()` and `Graph._generate_schema()` paths ### Documentation - Updated `autogpt_platform/backend/backend/copilot/sdk/agent_generation_guide.md` — changed field name in CoPilot SDK guide - Updated `docs/integrations/block-integrations/basic.md` — changed field name and description in public docs ### Load tests (`autogpt_platform/backend/load-tests/tests/api/graph-execution-test.js`) - Removed spurious `placeholder_values: {}` from AgentInputBlock node (this field never existed on AgentInputBlock) - Fixed execution input to use `value` instead of `placeholder_values` ## Backward Compatibility Existing agents with `placeholder_values` in their persisted `input_default` JSON will continue to work — the `model_construct()` override transparently remaps the old key to `options`. No database migration needed since the field is stored inside a JSON blob, not as a dedicated column. ## Testing - All existing tests updated and passing - 2 new backward-compat tests added - No frontend changes needed (frontend reads `enum` from generated JSON Schema, not the field name directly) --------- Co-authored-by: Zamil Majdy <zamil.majdy@agpt.co> |
||
|
|
b9e29c96bd |
fix(backend/copilot): detect prompt-too-long in AssistantMessage content and ResultMessage success subtype (#12642)
## Why PR #12625 fixed the prompt-too-long retry mechanism for most paths, but two SDK-specific paths were still broken. The dev session `d2f7cba3` kept accumulating synthetic "Prompt is too long" error entries on every turn, growing the transcript from 2.5 MB → 3.2 MB, making recovery impossible. Root causes identified from production logs (`[T25]`, `[T28]`): **Path 1 — AssistantMessage content check:** When the Claude API rejects a prompt, the SDK surfaces it as `AssistantMessage(error="invalid_request", content=[TextBlock("Prompt is too long")])`. Our check only inspected `error_text = str(sdk_error)` which is `"invalid_request"` — not a prompt-too-long pattern. The content was then streamed out as `StreamText`, setting `events_yielded = 1`, which blocked retry even when the ResultMessage fired. **Path 2 — ResultMessage success subtype:** After the SDK auto-compacts internally (via `PreCompact` hook) and the compacted transcript is _still_ too long, the SDK returns `ResultMessage(subtype="success", result="Prompt is too long")`. Our check only ran for `subtype="error"`. With `subtype="success"`, the stream "completed normally", appended the synthetic error entry to the transcript via `transcript_builder`, and uploaded it to GCS — causing the transcript to grow on each failed turn. ## What - **AssistantMessage handler**: when `sdk_error` is set, also check the content text. `sdk_error` being non-`None` confirms this is an API error message (not user-generated content), so content inspection is safe. - **ResultMessage handler**: check `result` for prompt-too-long patterns regardless of `subtype`, covering the SDK auto-compact path where `subtype="success"` with `result="Prompt is too long"`. ## How Two targeted one-line condition expansions in `_run_stream_attempt`, plus two new integration tests in `retry_scenarios_test.py` that reproduce each broken path and verify retry fires correctly. ## Changes - `backend/copilot/sdk/service.py`: fix AssistantMessage content check + ResultMessage subtype-independent check - `backend/copilot/sdk/retry_scenarios_test.py`: add 2 integration tests for the new scenarios ## Checklist - [x] Tests added for both new scenarios (45 total, all pass) - [x] Formatted (`poetry run format`) - [x] No false-positive risk: AssistantMessage check gated behind `sdk_error is not None` - [x] Root cause verified from production pod logs |
||
|
|
4ac0ba570a |
fix(backend): fix copilot credential loading across event loops (#12628)
## Why CoPilot autopilot sessions are inconsistently failing to load user credentials (specifically GitHub OAuth). Some sessions proceed normally, some show "provide credentials" prompts despite the user having valid creds, and some are completely blocked. Production logs confirmed the root cause: `RuntimeError: Task got Future <Future pending> attached to a different loop` in the credential refresh path, cascading into null-cache poisoning that blocks credential lookups for 60 seconds. ## What Three interrelated bugs in the credential system: 1. **`refresh_if_needed` always acquired Redis locks even with `lock=False`** — The `lock` parameter only controlled the inner credential lock, but the outer "refresh" scope lock was always acquired. The copilot executor uses multiple worker threads with separate event loops; the `asyncio.Lock` inside `AsyncRedisKeyedMutex` was bound to one loop and failed on others. 2. **Stale event loop in `locks()` singleton** — Both `IntegrationCredentialsManager` and `IntegrationCredentialsStore` cached their `AsyncRedisKeyedMutex` without tracking which event loop created it. When a different worker thread (with a different loop) reused the singleton, it got the "Future attached to different loop" error. 3. **Null-cache poisoning on refresh failure** — When OAuth refresh failed (due to the event loop error), the code fell through to cache "no credentials found" for 60 seconds via `_null_cache`. This blocked ALL subsequent credential lookups for that user+provider, even though the credentials existed and could refresh fine on retry. ## How - Split `refresh_if_needed` into `_refresh_locked` / `_refresh_unlocked` so `lock=False` truly skips ALL Redis locking (safe for copilot's best-effort background injection) - Added event loop tracking to `locks()` in both `IntegrationCredentialsManager` and `IntegrationCredentialsStore` — recreates the mutex when the running loop changes - Only populate `_null_cache` when the user genuinely has no credentials; skip caching when OAuth refresh failed transiently - Updated existing test to verify null-cache is not poisoned on refresh failure ## Test plan - [x] All 14 existing `integration_creds_test.py` tests pass - [x] Updated `test_oauth2_refresh_failure_returns_none_without_null_cache` verifies null-cache is not populated on refresh failure - [x] Format, lint, and typecheck pass - [ ] Deploy to staging and verify copilot sessions consistently load GitHub credentials |
||
|
|
d61a2c6cd0 |
Revert "fix(backend/copilot): detect prompt-too-long in AssistantMessage content and ResultMessage success subtype"
This reverts commit
|
||
|
|
1c301b4b61 |
fix(backend/copilot): detect prompt-too-long in AssistantMessage content and ResultMessage success subtype
The SDK returns AssistantMessage(error="invalid_request", content=[TextBlock("Prompt is too long")])
followed by ResultMessage(subtype="success", result="Prompt is too long") when the transcript is
rejected after internal auto-compaction. Both paths bypassed the retry mechanism:
- AssistantMessage handler only checked error_text ("invalid_request"), not the content which
holds the actual error description. The content was then streamed as text, setting events_yielded=1,
which blocked retry even when ResultMessage fired.
- ResultMessage handler only triggered prompt-too-long detection for subtype="error", not
subtype="success". The stream "completed normally", stored the synthetic error entry in the
transcript, and uploaded it — causing the transcript to grow unboundedly on each failed turn.
Fixes:
1. AssistantMessage handler: when sdk_error is set (confirmed error message), also check content
text. sdk_error being set guarantees this is an API error, not user-generated content, so
content inspection is safe.
2. ResultMessage handler: check result for prompt-too-long regardless of subtype, covering the
case where the SDK auto-compacts internally but the result is still too long.
Adds integration tests for both new scenarios.
|
||
|
|
24d0c35ed3 |
fix(backend/copilot): prompt-too-long retry, compaction churn, model-aware compression, and truncated tool call recovery (#12625)
## Why
CoPilot has several context management issues that degrade long
sessions:
1. "Prompt is too long" errors crash the session instead of triggering
retry/compaction
2. Stale thinking blocks bloat transcripts, causing unnecessary
compaction every turn
3. Compression target is hardcoded regardless of model context window
size
4. Truncated tool calls (empty `{}` args from max_tokens) kill the
session instead of guiding the model to self-correct
## What
**Fix 1: Prompt-too-long retry bypass (SENTRY-1207)**
The SDK surfaces "prompt too long" via `AssistantMessage.error` and
`ResultMessage.result` — neither triggered the retry/compaction loop
(only Python exceptions did). Now both paths are intercepted and
re-raised.
**Fix 2: Strip stale thinking blocks before upload**
Thinking/redacted_thinking blocks in non-last assistant entries are
10-50K tokens each but only needed for API signature verification in the
*last* message. Stripping before upload reduces transcript size and
prevents per-turn compaction.
**Fix 3: Model-aware compression target**
`compress_context()` now computes `target_tokens` from the model's
context window (e.g. 140K for Opus 200K) instead of a hardcoded 120K
default. Larger models retain more history; smaller models compress more
aggressively.
**Fix 4: Self-correcting truncated tool calls**
When the model's response exceeds max_tokens, tool call inputs get
silently truncated to `{}`. Previously this tripped a circuit breaker
after 3 attempts. Now the MCP wrapper detects empty args and returns
guidance: "write in chunks with `cat >>`, pass via
`@@agptfile:filename`". The model can self-correct instead of the
session dying.
## How
- **service.py**: `_is_prompt_too_long` checks in both
`AssistantMessage.error` and `ResultMessage` error handlers. Circuit
breaker limit raised from 3→5.
- **transcript.py**: `strip_stale_thinking_blocks()` reverse-scans for
last assistant `message.id`, strips thinking blocks from all others.
Called in `upload_transcript()`.
- **prompt.py**: `get_compression_target(model)` computes
`context_window - 60K overhead`. `compress_context()` uses it when
`target_tokens` is None.
- **tool_adapter.py**: `_truncating` wrapper intercepts empty args on
tools with required params, returns actionable guidance instead of
failing.
## Related
- Fixes SENTRY-1207
- Sessions: `d2f7cba3` (repeated compaction), `08b807d4` (prompt too
long), `130d527c` (truncated tool calls)
- Extends #12413, consolidates #12626
## Test plan
- [x] 6 unit tests for `strip_stale_thinking_blocks`
- [x] 1 integration test for ResultMessage prompt-too-long → compaction
retry
- [x] Pyright clean (0 errors), all pre-commit hooks pass
- [ ] E2E: Load transcripts from affected sessions and verify behavior
|
||
|
|
8aae7751dc |
fix(backend/copilot): prevent duplicate block execution from pre-launch arg mismatch (#12632)
## Why CoPilot sessions are duplicating Linear tickets and GitHub PRs. Investigation of 5 production sessions (March 31st) found that 3/5 created duplicate Linear issues — each with consecutive IDs at the exact same timestamp, but only one visible in Langfuse traces. Production gcloud logs confirm: **279 arg mismatch warnings per day**, **37 duplicate block execution pairs**, and all LinearCreateIssueBlock failures in pairs. Related: SECRT-2204 ## What Replace the speculative pre-launch mechanism with the SDK's native parallel dispatch via `readOnlyHint` tool annotations. Remove ~580 lines of pre-launch infrastructure code. ## How ### Root cause The pre-launch mechanism had three compounding bugs: 1. **Arg mismatch**: The SDK CLI normalises args between the `AssistantMessage` (used for pre-launch) and the MCP `tools/call` dispatch, causing frequent mismatches (279/day in prod) 2. **FIFO desync on denial**: Security hooks can deny tool calls, causing the CLI to skip the MCP dispatch — but the pre-launched task stays in the FIFO queue, misaligning all subsequent matches 3. **Cancel race**: `task.cancel()` is best-effort in asyncio — if the HTTP call to Linear/GitHub already completed, the side effect is irreversible ### Fix - **Removed** `pre_launch_tool_call()`, `cancel_pending_tool_tasks()`, `_tool_task_queues` ContextVar, all FIFO queue logic, and all 4 `cancel_pending_tool_tasks()` calls in `service.py` - **Added** `readOnlyHint=True` annotations on 15+ read-only tools (`find_block`, `search_docs`, `list_workspace_files`, etc.) — the SDK CLI natively dispatches these in parallel ([ref: anthropics/claude-code#14353](https://github.com/anthropics/claude-code/issues/14353)) - Side-effect tools (`run_block`, `bash_exec`, `create_agent`, etc.) have no annotation → CLI runs them sequentially → no duplicate execution risk ### Net change: -578 lines, +105 lines |
||
|
|
725da7e887 |
dx(backend/copilot): clarify ambiguous agent goals using find_block before generation (#12601)
### Why / What / How **Why:** When a user asks CoPilot to build an agent with an ambiguous goal (output format, delivery channel, data source, or trigger unspecified), the agent generator previously made assumptions and jumped straight into JSON generation. This produced agents that didn't match what the user actually wanted, requiring multiple correction cycles. **What:** Adds a "Clarifying Before Building" section to the agent generation guide. When the goal is ambiguous, CoPilot first calls `find_block` to discover what the platform actually supports for the ambiguous dimension, then asks the user one concrete question grounded in real platform options (e.g. "The platform supports Gmail, Slack, and Google Docs — which should the agent use for delivery?"). Only after the user answers does the full agent generation workflow proceed. **How:** The clarification instruction is added to `agent_generation_guide.md` — the guide loaded on-demand via `get_agent_building_guide` when the LLM is about to build an agent. This avoids polluting the system prompt supplement (which loads for every CoPilot conversation, not just agent building). No dedicated tool is needed — the LLM asks naturally in conversation text after discovering real platform options via `find_block`. ### Changes 🏗️ - `backend/copilot/sdk/agent_generation_guide.md`: Adds "Clarifying Before Building" section before the workflow steps. Instructs the model to call `find_block` for the ambiguous dimension, ask the user one grounded question, wait for the answer, then proceed to generation. - `backend/copilot/prompting_test.py`: New test file verifying the guide contains the clarification section and references `find_block`. ### Checklist 📋 #### For code changes: - [x] I have clearly listed my changes in the PR description - [x] I have made a test plan - [x] I have tested my changes according to the test plan: - [ ] Ask CoPilot to "build an agent to send a report" (ambiguous output) — verify it calls `find_block` for delivery options and asks one grounded question before generating JSON - [ ] Ask CoPilot to "build an agent to scrape prices from Amazon and email me daily" (specific goal) — verify it skips clarification and proceeds directly to agent generation - [ ] Verify the clarification question lists real block options (e.g. Gmail, Slack, Google Docs) rather than abstract options --------- Co-authored-by: Claude Sonnet 4.6 <noreply@anthropic.com> Co-authored-by: Zamil Majdy <zamil.majdy@agpt.co> |
||
|
|
bd9e9ec614 |
fix(frontend): remove LaunchDarkly local storage bootstrapping (#12606)
### Why / What / How <!-- Why: Why does this PR exist? What problem does it solve, or what's broken/missing without it? --> This PR fixes [BUILDER-7HD](https://sentry.io/organizations/significant-gravitas/issues/7374387984/). The issue was that: LaunchDarkly SDK fails to construct streaming URL due to non-string `_url` from malformed `localStorage` bootstrap data. <!-- What: What does this PR change? Summarize the changes at a high level. --> Removed the `bootstrap: "localStorage"` option from the LaunchDarkly provider configuration. <!-- How: How does it work? Describe the approach, key implementation details, or architecture decisions. --> This change ensures that LaunchDarkly no longer attempts to load initial feature flag values from local storage. Flag values will now always be fetched directly from the LaunchDarkly service, preventing potential issues with stale local storage data. ### Changes 🏗️ <!-- List the key changes. Keep it higher level than the diff but specific enough to highlight what's new/modified. --> - Removed the `bootstrap: "localStorage"` option from the LaunchDarkly provider configuration. - LaunchDarkly will now always fetch flag values directly from its service, bypassing local storage. ### Checklist 📋 #### For code changes: - [x] I have clearly listed my changes in the PR description - [ ] I have made a test plan - [ ] I have tested my changes according to the test plan: <!-- Put your test plan here: --> - [ ] Verify that LaunchDarkly flags are loaded correctly without issues. - [ ] Ensure no errors related to `localStorage` or streaming URL construction appear in the console. <details> <summary>Example test plan</summary> - [ ] Create from scratch and execute an agent with at least 3 blocks - [ ] Import an agent from file upload, and confirm it executes correctly - [ ] Upload agent to marketplace - [ ] Import an agent from marketplace and confirm it executes correctly - [ ] Edit an agent from monitor, and confirm it executes correctly </details> #### For configuration changes: - [ ] `.env.default` is updated or already compatible with my changes - [ ] `docker-compose.yml` is updated or already compatible with my changes - [ ] I have included a list of my configuration changes in the PR description (under **Changes**) <details> <summary>Examples of configuration changes</summary> - Changing ports - Adding new services that need to communicate with each other - Secrets or environment variable changes - New or infrastructure changes such as databases </details> --------- Co-authored-by: Zamil Majdy <zamil.majdy@agpt.co> Co-authored-by: seer-by-sentry[bot] <157164994+seer-by-sentry[bot]@users.noreply.github.com> |
||
|
|
88589764b5 |
dx(platform): normalize agent instructions for Claude and Codex (#12592)
### Why / What / How
Why: repo guidance was split between Claude-specific `CLAUDE.md` files
and Codex-specific `AGENTS.md` files, which duplicated instruction
content and made the same repository behave differently across agents.
The repo also had Claude skills under `.claude/skills` but no
Codex-visible repo skill path.
What: this PR bridges the repo's Claude skills into Codex and normalizes
shared instruction files so `AGENTS.md` becomes the canonical source
while each `CLAUDE.md` imports its sibling `AGENTS.md`.
How: add a repo-local `.agents/skills` symlink pointing to
`../.claude/skills`; move nested `CLAUDE.md` content into sibling
`AGENTS.md` files; replace each repo `CLAUDE.md` with a one-line
`@AGENTS.md` shim so Claude and Codex read the same scoped guidance
without duplicating text. The root `CLAUDE.md` now imports the root
`AGENTS.md` rather than symlinking to it.
Note: the instruction-file normalization commit was created with
`--no-verify` because the repo's frontend pre-commit `tsc` hook
currently fails on unrelated existing errors, largely missing
`autogpt_platform/frontend/src/app/api/__generated__/*` modules.
### Changes 🏗️
- Add `.agents/skills` as a repo-local symlink to `../.claude/skills` so
Codex discovers the existing Claude repo skills.
- Add a real root `CLAUDE.md` shim that imports the canonical root
`AGENTS.md`.
- Promote nested scoped instruction content into sibling `AGENTS.md`
files under `autogpt_platform/`, `autogpt_platform/backend/`,
`autogpt_platform/frontend/`, `autogpt_platform/frontend/src/tests/`,
and `docs/`.
- Replace the corresponding nested `CLAUDE.md` files with one-line
`@AGENTS.md` shims.
- Preserve the existing scoped instruction hierarchy while making the
shared content cross-compatible between Claude and Codex.
### Checklist 📋
#### For code changes:
- [x] I have clearly listed my changes in the PR description
- [x] I have made a test plan
- [x] I have tested my changes according to the test plan:
- [x] Verified `.agents/skills` resolves to `../.claude/skills`
- [x] Verified each repo `CLAUDE.md` now contains only `@AGENTS.md`
- [x] Verified the expected `AGENTS.md` files exist at the root and
nested scoped directories
- [x] Verified the branch contains only the intended agent-guidance
commits relative to `dev` and the working tree is clean
#### For configuration changes:
- [x] `.env.default` is updated or already compatible with my changes
- [x] `docker-compose.yml` is updated or already compatible with my
changes
- [x] I have included a list of my configuration changes in the PR
description (under **Changes**)
No runtime configuration changes are included in this PR.
<!-- CURSOR_SUMMARY -->
---
> [!NOTE]
> **Low Risk**
> Low risk: documentation/instruction-file reshuffle plus an
`.agents/skills` pointer; no runtime code paths are modified.
>
> **Overview**
> Unifies agent guidance so **`AGENTS.md` becomes canonical** and all
corresponding `CLAUDE.md` files become 1-line shims (`@AGENTS.md`) at
the repo root, `autogpt_platform/`, backend, frontend, frontend tests,
and `docs/`.
>
> Adds `.agents/skills` pointing to `../.claude/skills` so non-Claude
agents discover the same shared skills/instructions, eliminating
duplicated/agent-specific guidance content.
>
> <sup>Written by [Cursor
Bugbot](https://cursor.com/dashboard?tab=bugbot) for commit
|
||
|
|
c659f3b058 |
fix(copilot): fix dry-run simulation showing INCOMPLETE/error status (#12580)
## Summary - **Backend**: Strip empty `error` pins from dry-run simulation outputs that the simulator always includes (set to `""` meaning "no error"). This was causing the LLM to misinterpret successful simulations as failures and report "INCOMPLETE" status to users - **Backend**: Add explicit "Status: COMPLETED" to dry-run response message to prevent LLM misinterpretation - **Backend**: Update simulation prompt to exclude `error` from the "MUST include" keys list, and instruct LLM to omit error unless simulating a logical failure - **Frontend**: Fix `isRunBlockErrorOutput()` type guard that was too broad (`"error" in output` matched BlockOutputResponse objects, not just ErrorResponse), causing dry-run results to be displayed as errors - **Frontend**: Fix `parseOutput()` fallback matching to not classify BlockOutputResponse as ErrorResponse - **Frontend**: Filter out empty error pins from `BlockOutputCard` display and accordion metadata output key counting - **Frontend**: Clear stale execution results before dry-run/no-input runs so the UI shows fresh output - **Frontend**: Fix first-click simulate race condition by invalidating execution details query after WebSocket subscription confirms ## Test plan - [x] All 12 existing + 5 new dry-run tests pass (`poetry run pytest backend/copilot/tools/test_dry_run.py -x -v`) - [x] All 23 helpers tests pass (`poetry run pytest backend/copilot/tools/helpers_test.py -x -v`) - [x] All 13 run_block tests pass (`poetry run pytest backend/copilot/tools/run_block_test.py -x -v`) - [x] Backend linting passes (ruff check + format) - [x] Frontend linting passes (next lint) - [ ] Manual: trigger dry-run on a block with error output pin (e.g. Komodo Image Generator) — should show "Simulated" status with clean output, no misleading "error" section - [ ] Manual: first click on Simulate button should immediately show results (no race condition) --------- Co-authored-by: Claude Opus 4.6 (1M context) <noreply@anthropic.com> Co-authored-by: Nicholas Tindle <nicholas.tindle@agpt.co> Co-authored-by: copilot-swe-agent[bot] <198982749+Copilot@users.noreply.github.com> |
||
|
|
80581a8364 |
fix(copilot): add tool call circuit breakers and intermediate persistence (#12604)
## Why CoPilot session `d2f7cba3` took **82 minutes** and cost **$20.66** for a single user message. Root causes: 1. Redis session meta key expired after 1h, making the session invisible to the resume endpoint — causing empty page on reload 2. Redis stream key also expired during sub-agent gaps (task_progress events produced no chunks) 3. No intermediate persistence — session messages only saved to DB after the entire turn completes 4. Sub-agents retried similar WebSearch queries (addressed via prompt guidance) ## What ### Redis TTL fixes (root cause of empty session on reload) - `publish_chunk()` now periodically refreshes **both** the session meta key AND stream key TTL (every 60s). - `task_progress` SDK events now emit `StreamHeartbeat` chunks, ensuring `publish_chunk` is called even during long sub-agent gaps where no real chunks are produced. - Without this fix, turns exceeding the 1h `stream_ttl` lose their "running" status and stream data, making `get_active_session()` return False. ### Intermediate DB persistence - Session messages flushed to DB every **30 seconds** or **10 new messages** during the stream loop. - Uses `asyncio.shield(upsert_chat_session())` matching the existing `finally` block pattern. ### Orphaned message cleanup on rollback - On stream attempt rollback, orphaned messages persisted by intermediate flushes are now cleaned up from the DB via `delete_messages_from_sequence`. - Prevents stale messages from resurfacing on page reload after a failed retry. ### Prompt guidance - Added web search best practices to code supplement (search efficiency, sub-agent scope separation). ### Approach: root cause fixes, not capability limits - **No tool call caps** — artificial limits on WebSearch or total tool calls would reduce autopilot capability without addressing why searches were redundant. - **Task tool remains enabled** — sub-agent delegation via Task is a core capability. The existing `max_subtasks` concurrency guard is sufficient. - The real fixes (TTL refresh, persistence, prompt guidance) address the underlying bugs and behavioral issues. ## How ### Files changed - `stream_registry.py` — Redis meta + stream key TTL refresh in `publish_chunk()`, module-level keepalive tracker - `response_adapter.py` — `task_progress` SystemMessage → StreamHeartbeat emission - `service.py` — Intermediate DB persistence in `_run_stream_attempt` stream loop, orphan cleanup on rollback - `db.py` — `delete_messages_from_sequence` for rollback cleanup - `prompting.py` — Web search best practices ### GCP log evidence ``` # Meta key expired during 82-min turn: 09:49 — GET_SESSION: active_session=False, msg_count=1 ← meta gone 10:18 — Session persisted in finally with 189 messages ← turn completed # T13 (1h45min) same bug reproduced live: 16:20 — task_progress events still arriving, but active_session=False # Actual cost: Turn usage: cache_read=347916, cache_create=212472, output=12375, cost_usd=20.66 ``` ### Test plan - [x] task_progress emits StreamHeartbeat - [x] Task background blocked, foreground allowed, slot release on completion/failure - [x] CI green (lint, type-check, tests, e2e, CodeQL) --------- Co-authored-by: Zamil Majdy <majdy.zamil@gmail.com> |
||
|
|
3c046eb291 |
fix(frontend): show all agent outputs instead of only the last one (#12504)
Fixes #9175 ### Changes 🏗️ The Agent Outputs panel only displayed the last execution result per output node, discarding all prior outputs during a run. **Root cause:** In `AgentOutputs.tsx`, the `outputs` useMemo extracted only the last element from `nodeExecutionResults`: ```tsx const latestResult = executionResults[executionResults.length - 1]; ``` **Fix:** Changed `.map()` to `.flatMap()` over output nodes, iterating through all `executionResults` for each node. Each execution result now gets its own renderer lookup and metadata entry, so the panel shows every output produced during the run. ### Checklist 📋 #### For code changes: - [x] I have clearly listed my changes in the PR description - [x] I have made a test plan - [x] I have tested my changes according to the test plan: - [x] Verified TypeScript compiles without errors - [x] Confirmed the flatMap logic correctly iterates all execution results - [x] Verified existing filter for null renderers is preserved - [x] Run an agent with multiple outputs and confirm all show in the panel --------- Signed-off-by: majiayu000 <1835304752@qq.com> Co-authored-by: Zamil Majdy <zamil.majdy@agpt.co> |