test: PR #12841 native E2E v3 — link rendering + click-through

test: PR #12841 native UI proof screenshots (v2)
test: add real-feature E2E screenshots for PR #12841
2026-04-30 03:00:41 -04:00 · 2026-04-18 18:45:54 +07:00 · 2026-04-18 15:51:06 +07:00 · 2026-04-18 13:03:50 +07:00 · 2026-04-18 09:01:08 +07:00 · 2026-04-18 08:37:37 +07:00
37 changed files with 1554 additions and 20 deletions
--- a/.claude/skills/pr-test/SKILL.md
+++ b/.claude/skills/pr-test/SKILL.md
@@ -5,7 +5,7 @@ user-invocable: true
 argument-hint: "[worktree path or PR number] — tests the PR in the given worktree. Optional flags: --fix (auto-fix issues found)"
 metadata:
  author: autogpt-team
-  version: "2.0.0"
+  version: "2.1.0"
 ---

 # Manual E2E Test
@@ -248,7 +248,87 @@ docker ps --format "{{.Names}}" | grep -E "rest_server|executor|copilot|websocke
 done
 ```

-### 3e. Build and start
+**Native mode also:** when running the app natively (see 3e-native), kill any stray host processes and free the app ports before starting — otherwise `poetry run app` and `pnpm dev` will fail to bind.
+
+```bash
+# Kill stray native app processes from prior runs
+pkill -9 -f "python.*backend" 2>/dev/null || true
+pkill -9 -f "poetry run app" 2>/dev/null || true
+pkill -9 -f "next-server|next dev" 2>/dev/null || true
+
+# Free app ports (errors per port are ignored — port may simply be unused)
+for port in 3000 8006 8001 8002 8005 8008; do
+  lsof -ti :$port -sTCP:LISTEN | xargs -r kill -9 2>/dev/null || true
+done
+```
+
+### 3e-native. Run the app natively (PREFERRED for iterative dev)
+
+Native mode runs infra (postgres, supabase, redis, rabbitmq, clamav) in docker but runs the backend and frontend directly on the host. This avoids the 3-8 minute `docker compose build` cycle on every backend change — code edits are picked up on process restart (seconds) instead of a full image rebuild.
+
+**When to prefer native mode (default for this skill):**
+- Iterative dev/debug loops where you're editing backend or frontend code between test runs
+- Any PR that touches Python/TS source but not Dockerfiles, compose config, or infra images
+- Fast repro of a failing scenario — restart `poetry run app` in a couple of seconds
+
+**When to prefer docker mode (3e fallback):**
+- Testing changes to `Dockerfile`, `docker-compose.yml`, or base images
+- Production-parity smoke tests (exact container env, networking, volumes)
+- CI-equivalent runs where you need the exact image that'll ship
+
+**Note on 3b (copilot auth):** in native mode, the runtime `npm install -g @anthropic-ai/claude-code` step is NOT required. The `claude_agent_sdk` bundled CLI ships with the poetry venv and is on `PATH` when you run commands via `poetry run`. The OAuth token extraction still applies (same `refresh_claude_token.sh` call).
+
+**Preamble:** before starting native, run the kill-stray + free-ports block from 3c's "Native mode also" subsection.
+
+**1. Start infra only (one-time per session):**
+
+```bash
+cd $PLATFORM_DIR && docker compose --profile local up deps --detach --remove-orphans --build
+```
+
+This brings up postgres/supabase/redis/rabbitmq/clamav and skips all app services.
+
+**2. Start the backend natively:**
+
+```bash
+cd $BACKEND_DIR && (poetry run app 2>&1 | tee .ign.application.logs) &
+```
+
+`poetry run app` spawns **all** app subprocesses — `rest_server`, `executor`, `copilot_executor`, `websocket`, `scheduler`, `notification_server`, `database_manager` — inside ONE parent process. No separate containers, no separate terminals. The `.ign.application.logs` prefix is already gitignored.
+
+**3. Wait for the backend on :8006 BEFORE starting the frontend.** This ordering matters — the frontend's `pnpm dev` startup invokes `generate-api-queries`, which fetches `/openapi.json` from the backend. If the backend isn't listening yet, `pnpm dev` fails immediately.
+
+```bash
+for i in $(seq 1 60); do
+  if [ "$(curl -s -o /dev/null -w '%{http_code}' http://localhost:8006/docs 2>/dev/null)" = "200" ]; then
+    echo "Backend ready"
+    break
+  fi
+  sleep 2
+done
+```
+
+**4. Start the frontend natively:**
+
+```bash
+cd $FRONTEND_DIR && (pnpm dev 2>&1 | tee .ign.frontend.logs) &
+```
+
+**5. Wait for the frontend on :3000:**
+
+```bash
+for i in $(seq 1 60); do
+  if [ "$(curl -s -o /dev/null -w '%{http_code}' http://localhost:3000 2>/dev/null)" = "200" ]; then
+    echo "Frontend ready"
+    break
+  fi
+  sleep 2
+done
+```
+
+Once both are up, skip 3e/3f and go straight to **3g/3h** (feature flags / test user creation).
+
+### 3e. Build and start (docker — fallback)

 ```bash
 cd $PLATFORM_DIR && docker compose build --no-cache 2>&1 | tail -20
@@ -442,6 +522,22 @@ agent-browser --session-name pr-test snapshot | grep "text:"

 ### Checking logs

+**Native mode:** when running via `poetry run app` + `pnpm dev`, all app logs stream to the `.ign.*.logs` files written by the `tee` pipes in 3e-native. `rest_server`, `executor`, `copilot_executor`, `websocket`, `scheduler`, `notification_server`, and `database_manager` are all subprocesses of the single `poetry run app` parent, so their output is interleaved in `.ign.application.logs`.
+
+```bash
+# Backend (all app subprocesses interleaved)
+tail -f $BACKEND_DIR/.ign.application.logs
+
+# Frontend (Next.js dev server)
+tail -f $FRONTEND_DIR/.ign.frontend.logs
+
+# Filter for errors across either log
+grep -iE "error|exception|traceback" $BACKEND_DIR/.ign.application.logs | tail -20
+grep -iE "error|exception|traceback" $FRONTEND_DIR/.ign.frontend.logs | tail -20
+```
+
+**Docker mode:**
+
 ```bash
 # Backend REST server
 docker logs autogpt_platform-rest_server-1 2>&1 | tail -30
--- a/autogpt_platform/backend/backend/api/features/chat/routes.py
+++ b/autogpt_platform/backend/backend/api/features/chat/routes.py
@@ -50,6 +50,8 @@ from backend.copilot.tools.models import (
    AgentPreviewResponse,
    AgentSavedResponse,
    AgentsFoundResponse,
+    BackgroundToolList,
+    BackgroundToolStatus,
    BlockDetailsResponse,
    BlockListResponse,
    BlockOutputResponse,
@@ -1323,6 +1325,8 @@ ToolResponseUnion = (
    | MemorySearchResponse
    | MemoryForgetCandidatesResponse
    | MemoryForgetConfirmResponse
+    | BackgroundToolStatus
+    | BackgroundToolList
 )


--- a/autogpt_platform/backend/backend/copilot/permissions.py
+++ b/autogpt_platform/backend/backend/copilot/permissions.py
@@ -71,6 +71,7 @@ ToolName = Literal[
    "browser_act",
    "browser_navigate",
    "browser_screenshot",
+    "check_background_tool",
    "connect_integration",
    "continue_run_block",
    "create_agent",
--- a/autogpt_platform/backend/backend/copilot/prompting.py
+++ b/autogpt_platform/backend/backend/copilot/prompting.py
@@ -163,6 +163,21 @@ perform multi-step work autonomously.
 Use this when a task is complex enough to benefit from a separate
 autopilot context, e.g. "research X and write a report" while the
 parent autopilot handles orchestration.
+
+### Long-running tool calls (backgrounded)
+If any tool call exceeds its per-call time budget, the MCP handler
+parks it in the background (the work keeps running) and returns a
+result with ``"type": "background"``, a ``background_id`` (e.g.
+``bg-abc123``), the original tool name, and a message.
+
+Use **check_background_tool** to control the task:
+- ``wait_seconds`` (0-540): wait up to N seconds for completion.
+- ``cancel: true``: abort the background task and discard its result.
+
+For legitimate long-running work (sub-autopilot, agent execution,
+large code builds) **keep calling check_background_tool with a
+longer wait_seconds** — do not cancel unless the task is clearly
+stuck or no longer useful.
 """

 # E2B-only notes — E2B has full internet access so gh CLI works there.
--- a/autogpt_platform/backend/backend/copilot/sdk/background_registry.py
+++ b/autogpt_platform/backend/backend/copilot/sdk/background_registry.py
@@ -0,0 +1,144 @@
+"""Per-session registry of backgrounded tool calls.
+
+When a tool exceeds its per-call ``timeout_seconds`` budget the in-flight
+``asyncio.Task`` is parked here rather than being cancelled. The agent can
+then use the ``check_background_tool`` tool (keyed by ``background_id``) to
+wait longer, poll status, or cancel — keeping the autopilot in control of
+slow sub-agents and graph executions.
+
+Lives in its own module so that both ``tool_adapter.py`` (which registers
+tasks during tool dispatch) and ``tools/check_background_tool.py`` (which
+inspects them) can import the registry without creating a cycle via the
+tool-registry import chain.
+
+Scoping: the registry is a :class:`ContextVar`, so each execution context
+(parent AutoPilot, and any sub-AutoPilot invoked via ``run_block``) gets an
+independent registry. A sub-AutoPilot cannot see or cancel a parent's
+background tasks — this is intentional isolation.
+"""
+
+import asyncio
+import logging
+import time
+import uuid
+from contextvars import ContextVar
+from typing import Any
+
+logger = logging.getLogger(__name__)
+
+# Max wait a single check_background_tool call may block for. Kept below the
+# stream-level idle timeout so the outer safety net still triggers if the
+# whole session genuinely stalls.
+MAX_BACKGROUND_WAIT_SECONDS = 9 * 60  # 9 minutes
+
+# Upper bound on concurrent background tasks per session. Prevents a
+# pathological agent from leaking asyncio.Tasks by timing out hundreds of
+# tools back-to-back. When full, the oldest entry is cancelled and evicted
+# so the newest registration still succeeds.
+MAX_BACKGROUND_TASKS_PER_SESSION = 32
+
+_background_tasks: ContextVar[dict[str, dict[str, Any]]] = ContextVar(
+    "_background_tasks",
+    default=None,  # type: ignore[arg-type]
+)
+
+
+def init_registry() -> None:
+    """Install a fresh per-session registry in the current context."""
+    _background_tasks.set({})
+
+
+def register_background_task(task: asyncio.Task, tool_name: str) -> str:
+    """Register *task* in the per-session background registry, returning the id.
+
+    If the registry is already at :data:`MAX_BACKGROUND_TASKS_PER_SESSION`,
+    the oldest entry is cancelled and evicted to make room.
+    """
+    bg_id = f"bg-{uuid.uuid4().hex[:12]}"
+    registry = _background_tasks.get(None)
+    if registry is None:
+        # Registry isn't initialized (e.g. unit tests that bypass
+        # set_execution_context). Fall back to a fresh dict so we at least
+        # don't drop the task silently.
+        registry = {}
+        _background_tasks.set(registry)
+
+    if len(registry) >= MAX_BACKGROUND_TASKS_PER_SESSION:
+        oldest_id, oldest_entry = min(
+            registry.items(), key=lambda kv: kv[1]["started_at"]
+        )
+        oldest_task: asyncio.Task = oldest_entry["task"]
+        if not oldest_task.done():
+            oldest_task.cancel()
+        registry.pop(oldest_id, None)
+        logger.warning(
+            "Background registry full — evicted oldest entry %s (tool=%s)",
+            oldest_id,
+            oldest_entry["tool_name"],
+        )
+
+    registry[bg_id] = {
+        "task": task,
+        "tool_name": tool_name,
+        "started_at": time.monotonic(),
+    }
+    return bg_id
+
+
+def get_background_task(background_id: str) -> dict[str, Any] | None:
+    """Return the registered entry for *background_id*, or ``None``."""
+    registry = _background_tasks.get(None)
+    if registry is None:
+        return None
+    return registry.get(background_id)
+
+
+def list_background_tasks() -> list[dict[str, Any]]:
+    """Return a snapshot of every registered task in the current session.
+
+    Each entry: ``{background_id, tool_name, started_at, done}``. Used by
+    ``check_background_tool(list=true)`` so the agent can recover IDs after
+    context compaction or a long pause.
+    """
+    registry = _background_tasks.get(None)
+    if not registry:
+        return []
+    return [
+        {
+            "background_id": bg_id,
+            "tool_name": entry["tool_name"],
+            "started_at": entry["started_at"],
+            "done": entry["task"].done(),
+        }
+        for bg_id, entry in registry.items()
+    ]
+
+
+def unregister_background_task(background_id: str) -> None:
+    """Drop a finished/cancelled task from the registry."""
+    registry = _background_tasks.get(None)
+    if registry is None:
+        return
+    registry.pop(background_id, None)
+
+
+def cancel_all_background_tasks(reason: str = "stream ended") -> int:
+    """Cancel every task in the registry and empty it.
+
+    Called from the stream's ``finally`` block so orphaned long-running
+    tools don't keep executing after the user leaves or the stream errors.
+    Returns the number of tasks that were cancelled.
+    """
+    registry = _background_tasks.get(None)
+    if not registry:
+        return 0
+    cancelled = 0
+    for bg_id, entry in list(registry.items()):
+        task: asyncio.Task = entry["task"]
+        if not task.done():
+            task.cancel()
+            cancelled += 1
+        registry.pop(bg_id, None)
+    if cancelled:
+        logger.info("Cancelled %d orphaned background task(s) on %s", cancelled, reason)
+    return cancelled
--- a/autogpt_platform/backend/backend/copilot/sdk/background_registry_test.py
+++ b/autogpt_platform/backend/backend/copilot/sdk/background_registry_test.py
@@ -0,0 +1,155 @@
+"""Tests for the background task registry."""
+
+import asyncio
+import contextlib
+
+import pytest
+
+from .background_registry import (
+    MAX_BACKGROUND_TASKS_PER_SESSION,
+    cancel_all_background_tasks,
+    get_background_task,
+    init_registry,
+    list_background_tasks,
+    register_background_task,
+    unregister_background_task,
+)
+
+
+@pytest.fixture(autouse=True)
+def _init_for_each_test():
+    init_registry()
+
+
+@pytest.mark.asyncio
+async def test_register_and_lookup():
+    async def hang():
+        await asyncio.sleep(60)
+
+    task = asyncio.create_task(hang())
+    bg_id = register_background_task(task, "some_tool")
+
+    entry = get_background_task(bg_id)
+    assert entry is not None
+    assert entry["tool_name"] == "some_tool"
+    assert entry["task"] is task
+
+    task.cancel()
+    with contextlib.suppress(asyncio.CancelledError):
+        await task
+
+
+@pytest.mark.asyncio
+async def test_unregister_removes_entry():
+    async def hang():
+        await asyncio.sleep(60)
+
+    task = asyncio.create_task(hang())
+    bg_id = register_background_task(task, "some_tool")
+    unregister_background_task(bg_id)
+    assert get_background_task(bg_id) is None
+
+    task.cancel()
+    with contextlib.suppress(asyncio.CancelledError):
+        await task
+
+
+@pytest.mark.asyncio
+async def test_cancel_all_cancels_pending_tasks_and_empties_registry():
+    events = []
+
+    async def hang_with_cancel_trap(idx: int):
+        try:
+            await asyncio.sleep(60)
+        except asyncio.CancelledError:
+            events.append(idx)
+            raise
+
+    tasks = [asyncio.create_task(hang_with_cancel_trap(i)) for i in range(3)]
+    # Let the tasks start before cancellation.
+    await asyncio.sleep(0)
+    bg_ids = [register_background_task(t, f"tool_{i}") for i, t in enumerate(tasks)]
+
+    # Sanity check: all three actually got registered under real IDs.
+    for bg_id in bg_ids:
+        assert get_background_task(bg_id) is not None
+
+    count = cancel_all_background_tasks(reason="test")
+    assert count == 3
+
+    # Let the cancellations propagate.
+    for t in tasks:
+        with contextlib.suppress(asyncio.CancelledError):
+            await t
+    assert sorted(events) == [0, 1, 2]
+
+    # Registry should be empty now — verify using the actual IDs we registered.
+    for bg_id in bg_ids:
+        assert get_background_task(bg_id) is None
+
+
+@pytest.mark.asyncio
+async def test_registry_cap_evicts_oldest_on_overflow():
+    tasks: list[asyncio.Task] = []
+    ids: list[str] = []
+
+    async def hang():
+        await asyncio.sleep(60)
+
+    # Fill to capacity.
+    for _ in range(MAX_BACKGROUND_TASKS_PER_SESSION):
+        t = asyncio.create_task(hang())
+        tasks.append(t)
+        ids.append(register_background_task(t, "pool_tool"))
+
+    oldest_id = ids[0]
+    oldest_task = tasks[0]
+    assert get_background_task(oldest_id) is not None
+
+    # One more registration should evict + cancel the oldest.
+    extra_task = asyncio.create_task(hang())
+    extra_id = register_background_task(extra_task, "overflow_tool")
+    tasks.append(extra_task)
+    ids.append(extra_id)
+
+    assert get_background_task(oldest_id) is None
+    assert get_background_task(extra_id) is not None
+    # The evicted task was cancelled.
+    with contextlib.suppress(asyncio.CancelledError):
+        await oldest_task
+    assert oldest_task.cancelled()
+
+    # Cleanup.
+    for t in tasks[1:]:
+        t.cancel()
+        with contextlib.suppress(asyncio.CancelledError):
+            await t
+
+
+@pytest.mark.asyncio
+async def test_list_background_tasks_returns_snapshot():
+    async def hang():
+        await asyncio.sleep(60)
+
+    tasks = [asyncio.create_task(hang()) for _ in range(2)]
+    await asyncio.sleep(0)
+    bg_ids = [register_background_task(t, f"tool_{i}") for i, t in enumerate(tasks)]
+
+    snapshot = list_background_tasks()
+    assert len(snapshot) == 2
+    returned = {e["background_id"]: e for e in snapshot}
+    assert set(returned) == set(bg_ids)
+    for entry in snapshot:
+        assert entry["tool_name"].startswith("tool_")
+        assert entry["done"] is False
+        assert entry["started_at"] > 0
+
+    for t in tasks:
+        t.cancel()
+        with contextlib.suppress(asyncio.CancelledError):
+            await t
+
+
+@pytest.mark.asyncio
+async def test_list_background_tasks_empty():
+    assert list_background_tasks() == []
--- a/autogpt_platform/backend/backend/copilot/sdk/service.py
+++ b/autogpt_platform/backend/backend/copilot/sdk/service.py
@@ -107,6 +107,7 @@ from ..transcript import (
 )
 from ..transcript_builder import TranscriptBuilder
 from .compaction import CompactionTracker, filter_compaction_messages
+from .background_registry import cancel_all_background_tasks
 from .env import build_sdk_env  # noqa: F401 — re-export for backward compat
 from .response_adapter import SDKResponseAdapter
 from .security_hooks import create_security_hooks
@@ -162,9 +163,12 @@ _CIRCUIT_BREAKER_ERROR_MSG = (
 )

 # Idle timeout: abort the stream if no meaningful SDK message (only heartbeats)
-# arrives for this many seconds. This catches hung tool calls (e.g. WebSearch
-# hanging on a search provider that never responds).
-_IDLE_TIMEOUT_SECONDS = 10 * 60  # 10 minutes
+# arrives for this many seconds. Acts as a last-resort safety net — individual
+# tools enforce their own timeouts at the MCP handler level (see BaseTool.
+# timeout_seconds) and return a synthetic tool result to the agent on timeout.
+# This stream-level timeout only fires if a tool's per-call timeout was
+# disabled (timeout_seconds=None) or the SDK itself is stuck between messages.
+_IDLE_TIMEOUT_SECONDS = 30 * 60  # 30 minutes

 # Event types that are ephemeral / cosmetic and must NOT be counted toward
 # ``events_yielded`` in the transient-retry loop.  Counting them would prevent
@@ -1932,20 +1936,33 @@ async def _run_stream_attempt(
                    yield ev
                yield StreamHeartbeat()

-                # Idle timeout: if no real SDK message for too long, a tool
-                # call is likely hung (e.g. WebSearch provider not responding).
+                # Idle timeout: last-resort safety net. Per-tool timeouts in
+                # the MCP handler normally catch hung tools first and return
+                # a synthetic tool result so the agent can recover. This only
+                # fires if a tool opted out of per-call timeouts or the SDK
+                # itself is stuck between messages.
                idle_seconds = time.monotonic() - _last_real_msg_time
                if idle_seconds >= _IDLE_TIMEOUT_SECONDS:
+                    unresolved_ids = (
+                        state.adapter.current_tool_calls.keys()
+                        - state.adapter.resolved_tool_calls
+                    )
+                    unresolved_tools = {
+                        tid: state.adapter.current_tool_calls[tid]
+                        for tid in unresolved_ids
+                    }
                    logger.error(
-                        "%s Idle timeout after %.0fs with no SDK message — "
-                        "aborting stream (likely hung tool call)",
+                        "%s Idle timeout after %.0fs — unresolved tool calls: %s",
                        ctx.log_prefix,
                        idle_seconds,
+                        ", ".join(
+                            f"{tc['name']}(id={tid[:12]})"
+                            for tid, tc in unresolved_tools.items()
+                        )
+                        or "(none tracked)",
                    )
                    stream_error_msg = (
-                        "A tool call appears to be stuck "
-                        "(no response for 10 minutes). "
-                        "Please try again."
+                        "The session has been idle for too long. Please try again."
                    )
                    stream_error_code = "idle_timeout"
                    _append_error_marker(ctx.session, stream_error_msg, retryable=True)
@@ -2318,6 +2335,10 @@ async def _run_stream_attempt(
                break
    finally:
        await _safe_close_sdk_client(sdk_client, ctx.log_prefix)
+        # Cancel any tool calls still parked in the background registry so
+        # orphaned long-running work (sub-AutoPilot, graph execution, etc.)
+        # doesn't keep running after the stream ends.
+        cancel_all_background_tasks(reason=f"stream ended ({ctx.log_prefix})")

    # --- Post-stream processing (only on success) ---
    if state.adapter.has_unresolved_tool_calls:
--- a/autogpt_platform/backend/backend/copilot/sdk/tool_adapter.py
+++ b/autogpt_platform/backend/backend/copilot/sdk/tool_adapter.py
@@ -37,6 +37,11 @@ from backend.copilot.tools import TOOL_REGISTRY
 from backend.copilot.tools.base import BaseTool
 from backend.util.truncate import truncate

+# Background-task registry for tools that exceed their per-call timeout —
+# lives in its own module to avoid a TOOL_REGISTRY import cycle with
+# ``tools/check_background_tool.py``.
+from .background_registry import init_registry as _init_background_registry
+from .background_registry import register_background_task as _register_background_task
 from .e2b_file_tools import (
    E2B_FILE_TOOL_NAMES,
    E2B_FILE_TOOLS,
@@ -134,6 +139,7 @@ def set_execution_context(
    _pending_tool_outputs.set({})
    _stash_event.set(asyncio.Event())
    _consecutive_tool_failures.set({})
+    _init_background_registry()


 def reset_stash_event() -> None:
@@ -248,15 +254,57 @@ async def _execute_tool_sync(
    session: ChatSession,
    args: dict[str, Any],
 ) -> dict[str, Any]:
-    """Execute a tool synchronously and return MCP-formatted response."""
+    """Execute a tool and return an MCP-formatted response.
+
+    Applies the tool's ``timeout_seconds`` budget (``None`` disables it).
+    On timeout the pending task is **not** cancelled — it is parked in the
+    background registry and a synthetic tool result is returned to the
+    agent along with a ``background_id``. The agent can then call
+    ``check_background_tool`` to keep waiting, inspect status, or cancel.
+    This lets the autopilot decide on slow sub-agents / graph executions
+    instead of the handler making an irreversible choice.
+    """
    effective_id = f"sdk-{uuid.uuid4().hex[:12]}"
-    result = await base_tool.execute(
-        user_id=user_id,
-        session=session,
-        tool_call_id=effective_id,
-        **args,
+    task: asyncio.Task = asyncio.create_task(
+        base_tool.execute(
+            user_id=user_id,
+            session=session,
+            tool_call_id=effective_id,
+            **args,
+        ),
+        name=f"tool:{base_tool.name}:{effective_id}",
    )

+    timeout = base_tool.timeout_seconds
+    try:
+        if timeout is None:
+            result = await task
+        else:
+            # asyncio.wait (unlike wait_for) does NOT cancel on timeout — the
+            # task keeps running in the background.
+            await asyncio.wait({task}, timeout=timeout)
+            if not task.done():
+                bg_id = _register_background_task(task, base_tool.name)
+                logger.warning(
+                    "Tool %s exceeded %ss budget — parked as "
+                    "background_id=%s (args=%s)",
+                    base_tool.name,
+                    timeout,
+                    bg_id,
+                    _redact_args_for_log(args),
+                )
+                return _tool_background_result(base_tool.name, timeout, bg_id)
+            # Completed within budget — .result() re-raises any exception.
+            result = task.result()
+    except asyncio.CancelledError:
+        # The handler itself was cancelled (e.g. stream teardown) mid-wait.
+        # Cancel the child so it doesn't keep running untracked — the
+        # registry hasn't seen it yet, so cancel_all_background_tasks
+        # couldn't clean it up.
+        if not task.done():
+            task.cancel()
+        raise
+
    text = (
        result.output if isinstance(result.output, str) else json.dumps(result.output)
    )
@@ -267,6 +315,65 @@ async def _execute_tool_sync(
    }


+def _tool_background_result(
+    tool_name: str, timeout: int, background_id: str
+) -> dict[str, Any]:
+    """Build a synthetic tool result when a call is parked as a background task.
+
+    The task is still running; the agent receives this so the stream can
+    continue and the autopilot can decide whether to keep waiting or cancel
+    via ``check_background_tool``.
+    """
+    payload = {
+        "type": "background",
+        "tool": tool_name,
+        "timeout_seconds": timeout,
+        "background_id": background_id,
+        "message": (
+            f"Still running after {timeout}s — use check_background_tool "
+            "to wait longer or cancel."
+        ),
+    }
+    return {
+        "content": [{"type": "text", "text": json.dumps(payload, ensure_ascii=False)}],
+        "isError": False,
+    }
+
+
+# Keys that may carry credentials / PII. Values for these keys are replaced
+# with '<redacted>' in monitoring logs.
+_SENSITIVE_ARG_KEYS = frozenset(
+    {
+        "api_key",
+        "apikey",
+        "authorization",
+        "auth",
+        "credentials",
+        "password",
+        "secret",
+        "token",
+    }
+)
+
+
+def _redact_args_for_log(args: dict[str, Any]) -> str:
+    """Render args for log monitoring, redacting sensitive keys and truncating
+    long string values."""
+    try:
+        rendered: dict[str, Any] = {}
+        for k, v in args.items():
+            if k.lower() in _SENSITIVE_ARG_KEYS:
+                rendered[k] = "<redacted>"
+                continue
+            if isinstance(v, str) and len(v) > 200:
+                rendered[k] = v[:200] + "…"
+            else:
+                rendered[k] = v
+        return json.dumps(rendered, default=str)[:500]
+    except (TypeError, ValueError):
+        return str(args)[:500]
+
+
 def _mcp_error(message: str) -> dict[str, Any]:
    return {
        "content": [
--- a/autogpt_platform/backend/backend/copilot/sdk/tool_adapter_test.py
+++ b/autogpt_platform/backend/backend/copilot/sdk/tool_adapter_test.py
@@ -251,11 +251,16 @@ class TestTruncationAndStashIntegration:
 # ---------------------------------------------------------------------------


-def _make_mock_tool(name: str, output: str = "result") -> MagicMock:
+def _make_mock_tool(
+    name: str,
+    output: str = "result",
+    timeout_seconds: int | None = 600,
+) -> MagicMock:
    """Return a BaseTool mock that returns a successful StreamToolOutputAvailable."""
    tool = MagicMock()
    tool.name = name
    tool.parameters = {"properties": {}, "required": []}
+    tool.timeout_seconds = timeout_seconds
    tool.execute = AsyncMock(
        return_value=StreamToolOutputAvailable(
            toolCallId="test-id",
@@ -336,6 +341,216 @@ class TestCreateToolHandler:
        assert mock_tool.execute.await_count == 2


+class TestToolTimeout:
+    """Tests for per-tool timeout behavior in _execute_tool_sync."""
+
+    @pytest.fixture(autouse=True)
+    def _init(self):
+        _init_ctx(session=_make_mock_session())
+
+    @pytest.mark.asyncio
+    async def test_timeout_parks_task_and_returns_background_id(self):
+        """A tool that exceeds its timeout is moved to the background
+        registry (not cancelled); the handler returns a synthetic
+        type='background' result with a background_id."""
+        from backend.copilot.sdk.background_registry import (
+            get_background_task,
+            unregister_background_task,
+        )
+
+        mock_tool = _make_mock_tool("slow_tool", timeout_seconds=1)
+
+        async def hang_forever(*_args, **_kwargs):
+            await asyncio.sleep(60)
+            return StreamToolOutputAvailable(
+                toolCallId="t1",
+                output="late",
+                toolName="slow_tool",
+                success=True,
+            )
+
+        mock_tool.execute = AsyncMock(side_effect=hang_forever)
+
+        handler = create_tool_handler(mock_tool)
+        result = await handler({"arg": "v"})
+
+        # isError=False because the task is still running — the agent isn't
+        # being told about a failure, just about a delay.
+        assert result["isError"] is False
+        payload = json.loads(result["content"][0]["text"])
+        assert payload["type"] == "background"
+        assert payload["tool"] == "slow_tool"
+        assert payload["timeout_seconds"] == 1
+        assert payload["background_id"].startswith("bg-")
+
+        entry = get_background_task(payload["background_id"])
+        assert entry is not None
+        assert entry["tool_name"] == "slow_tool"
+        assert not entry["task"].done()
+
+        # Cleanup: cancel the parked task so the test doesn't leak it.
+        entry["task"].cancel()
+        try:
+            await entry["task"]
+        except (asyncio.CancelledError, BaseException):
+            pass
+        unregister_background_task(payload["background_id"])
+
+    @pytest.mark.asyncio
+    async def test_timeout_does_not_cancel_tool_coroutine(self):
+        """The task keeps running in the background after the timeout
+        budget is exceeded — cancellation is the agent's choice."""
+        from backend.copilot.sdk.background_registry import (
+            get_background_task,
+            unregister_background_task,
+        )
+
+        mock_tool = _make_mock_tool("slow_tool", timeout_seconds=1)
+        observed_cancel = asyncio.Event()
+
+        async def stays_alive(*_args, **_kwargs):
+            try:
+                await asyncio.sleep(3)
+            except asyncio.CancelledError:
+                observed_cancel.set()
+                raise
+            return StreamToolOutputAvailable(
+                toolCallId="t1",
+                output="eventual",
+                toolName="slow_tool",
+                success=True,
+            )
+
+        mock_tool.execute = AsyncMock(side_effect=stays_alive)
+
+        handler = create_tool_handler(mock_tool)
+        result = await handler({})
+        payload = json.loads(result["content"][0]["text"])
+
+        entry = get_background_task(payload["background_id"])
+        assert entry is not None
+        # Give the background task a brief moment; it should still be
+        # running and NOT cancelled.
+        await asyncio.sleep(0.1)
+        assert not observed_cancel.is_set()
+        assert not entry["task"].done()
+
+        # Let it complete so the test stays clean.
+        await entry["task"]
+        unregister_background_task(payload["background_id"])
+
+    @pytest.mark.asyncio
+    async def test_none_timeout_disables_wait_for(self):
+        """When timeout_seconds is None, the tool runs to completion without
+        an outer timeout wrapper."""
+        mock_tool = _make_mock_tool(
+            "long_running_tool",
+            output="completed",
+            timeout_seconds=None,
+        )
+
+        async def slow_but_completes(*_args, **_kwargs):
+            await asyncio.sleep(0.05)
+            return StreamToolOutputAvailable(
+                toolCallId="t1",
+                output="completed",
+                toolName="long_running_tool",
+                success=True,
+            )
+
+        mock_tool.execute = AsyncMock(side_effect=slow_but_completes)
+
+        handler = create_tool_handler(mock_tool)
+        result = await handler({})
+
+        assert result["isError"] is False
+        assert "completed" in result["content"][0]["text"]
+
+    @pytest.mark.asyncio
+    async def test_handler_cancellation_cancels_child_task(self):
+        """If the handler itself is cancelled before the tool completes,
+        the child task is cancelled too (no leak into the background
+        registry, since it wasn't parked yet)."""
+        import contextlib
+
+        mock_tool = _make_mock_tool("slow_tool", timeout_seconds=60)
+        child_cancelled = asyncio.Event()
+
+        async def hang_until_cancelled(*_args, **_kwargs):
+            try:
+                await asyncio.sleep(60)
+            except asyncio.CancelledError:
+                child_cancelled.set()
+                raise
+
+        mock_tool.execute = AsyncMock(side_effect=hang_until_cancelled)
+
+        from backend.copilot.sdk.tool_adapter import _execute_tool_sync
+
+        outer_task = asyncio.create_task(
+            _execute_tool_sync(mock_tool, "u", _make_mock_session(), {})
+        )
+        # Let the handler start waiting on the child.
+        await asyncio.sleep(0.05)
+        outer_task.cancel()
+        with contextlib.suppress(asyncio.CancelledError):
+            await outer_task
+        await asyncio.sleep(0)
+        assert child_cancelled.is_set()
+
+    @pytest.mark.asyncio
+    async def test_fast_tool_within_timeout_succeeds(self):
+        """Tools that complete well under the timeout are unaffected."""
+        mock_tool = _make_mock_tool(
+            "fast_tool",
+            output="fast-ok",
+            timeout_seconds=30,
+        )
+
+        handler = create_tool_handler(mock_tool)
+        result = await handler({})
+
+        assert result["isError"] is False
+        assert "fast-ok" in result["content"][0]["text"]
+
+
+class TestBaseToolDefaultTimeout:
+    """The BaseTool default timeout and per-tool overrides."""
+
+    def test_default_timeout_is_ten_minutes(self):
+        from backend.copilot.tools.base import BaseTool
+
+        class _Plain(BaseTool):
+            @property
+            def name(self):
+                return "plain"
+
+            @property
+            def description(self):
+                return ""
+
+            @property
+            def parameters(self):
+                return {"type": "object", "properties": {}}
+
+        assert _Plain().timeout_seconds == 600
+
+    def test_run_agent_opts_out(self):
+        from backend.copilot.tools.run_agent import RunAgentTool
+
+        assert RunAgentTool().timeout_seconds is None
+
+    def test_run_block_opts_out(self):
+        from backend.copilot.tools.run_block import RunBlockTool
+
+        assert RunBlockTool().timeout_seconds is None
+
+    def test_continue_run_block_opts_out(self):
+        from backend.copilot.tools.continue_run_block import ContinueRunBlockTool
+
+        assert ContinueRunBlockTool().timeout_seconds is None
+
+
 # ---------------------------------------------------------------------------
 # Regression tests: bugs fixed by removing pre-launch mechanism
 #
--- a/autogpt_platform/backend/backend/copilot/tools/init.py
+++ b/autogpt_platform/backend/backend/copilot/tools/init.py
@@ -13,6 +13,7 @@ from .agent_output import AgentOutputTool
 from .ask_question import AskQuestionTool
 from .base import BaseTool
 from .bash_exec import BashExecTool
+from .check_background_tool import CheckBackgroundToolTool
 from .connect_integration import ConnectIntegrationTool
 from .continue_run_block import ContinueRunBlockTool
 from .create_agent import CreateAgentTool
@@ -81,6 +82,7 @@ TOOL_REGISTRY: dict[str, BaseTool] = {
    "run_agent": RunAgentTool(),
    "run_block": RunBlockTool(),
    "continue_run_block": ContinueRunBlockTool(),
+    "check_background_tool": CheckBackgroundToolTool(),
    "run_mcp_tool": RunMCPToolTool(),
    "get_mcp_guide": GetMCPGuideTool(),
    "view_agent_output": AgentOutputTool(),
--- a/autogpt_platform/backend/backend/copilot/tools/base.py
+++ b/autogpt_platform/backend/backend/copilot/tools/base.py
@@ -140,6 +140,21 @@ class BaseTool:
        """
        return True

+    @property
+    def timeout_seconds(self) -> int | None:
+        """Maximum seconds a single invocation may run before soft-timing out.
+
+        On timeout the MCP handler cancels the call and returns a synthetic
+        tool result to the agent (rather than hard-killing the stream), so
+        the agent can decide to retry, check progress via another tool, or
+        move on.
+
+        Return ``None`` to disable the per-call timeout — appropriate for
+        tools that manage their own lifecycle (e.g. ``run_agent`` polls an
+        execution, ``run_block`` can delegate to a sub-AutoPilot).
+        """
+        return 10 * 60  # 10 minutes
+
    def as_openai_tool(self) -> ChatCompletionToolParam:
        """Convert to OpenAI tool format."""
        return ChatCompletionToolParam(
--- a/autogpt_platform/backend/backend/copilot/tools/bash_exec.py
+++ b/autogpt_platform/backend/backend/copilot/tools/bash_exec.py
@@ -42,6 +42,11 @@ class BashExecTool(BaseTool):
    def name(self) -> str:
        return "bash_exec"

+    # BaseTool.timeout_seconds=600 is inherited but never fires in practice:
+    # the `timeout` parameter on each call is capped at 120s by this tool's
+    # own subprocess timeout, so the MCP handler's budget is only a safety
+    # net for pathological hangs around sandbox setup/teardown.
+
    @property
    def description(self) -> str:
        return (
--- a/autogpt_platform/backend/backend/copilot/tools/check_background_tool.py
+++ b/autogpt_platform/backend/backend/copilot/tools/check_background_tool.py
@@ -0,0 +1,290 @@
+"""Tool for waiting on, polling, or cancelling a backgrounded tool call.
+
+Long-running tool calls exceed their per-call timeout and are parked in the
+background registry by :func:`_execute_tool_sync`. This tool lets the agent
+decide whether to keep waiting, poll status, or cancel — so the autopilot
+stays in control rather than the handler making an irreversible choice.
+"""
+
+import asyncio
+import logging
+import time
+from typing import Any
+
+from backend.copilot.model import ChatSession
+from backend.copilot.sdk.background_registry import (
+    MAX_BACKGROUND_WAIT_SECONDS as _MAX_BACKGROUND_WAIT_SECONDS,
+)
+from backend.copilot.sdk.background_registry import (
+    get_background_task,
+    list_background_tasks,
+    unregister_background_task,
+)
+
+from .base import BaseTool
+from .models import (
+    BackgroundToolList,
+    BackgroundToolListEntry,
+    BackgroundToolStatus,
+    ErrorResponse,
+    ToolResponseBase,
+)
+
+logger = logging.getLogger(__name__)
+
+
+class CheckBackgroundToolTool(BaseTool):
+    """Inspect, wait on, or cancel a backgrounded tool call."""
+
+    @property
+    def name(self) -> str:
+        return "check_background_tool"
+
+    @property
+    def requires_auth(self) -> bool:
+        # Parked tasks almost always originate from authenticated tools
+        # (run_agent, run_block). Require auth here too for consistency
+        # with those tools even though ContextVar scoping already prevents
+        # cross-session leakage.
+        return True
+
+    @property
+    def timeout_seconds(self) -> int | None:
+        # This tool drives its own wait loop up to _MAX_BACKGROUND_WAIT_SECONDS.
+        # Applying a second timeout on top would be redundant and could cancel
+        # the wait prematurely.
+        return None
+
+    @property
+    def description(self) -> str:
+        return (
+            "Inspect a backgrounded tool call by its background_id. "
+            "Use when a prior tool call returned type='background'. "
+            "Options: list=true to enumerate all active background tasks, "
+            "wait for completion up to wait_seconds (default 60, max "
+            f"{_MAX_BACKGROUND_WAIT_SECONDS}), just check status with "
+            "wait_seconds=0, or cancel=true to abort the task and "
+            "discard its result."
+        )
+
+    @property
+    def parameters(self) -> dict[str, Any]:
+        return {
+            "type": "object",
+            "properties": {
+                "list": {
+                    "type": "boolean",
+                    "description": (
+                        "If true, return every active background task in "
+                        "this session (no other params needed). Use to "
+                        "recover background_ids after a context compaction."
+                    ),
+                    "default": False,
+                },
+                "background_id": {
+                    "type": "string",
+                    "description": (
+                        "The background_id returned by the timed-out tool. "
+                        "Required unless list=true."
+                    ),
+                },
+                "wait_seconds": {
+                    "type": "integer",
+                    "description": (
+                        "Max seconds to wait for completion. 0 = just check "
+                        "status. Values above "
+                        f"{_MAX_BACKGROUND_WAIT_SECONDS} are clamped to that "
+                        "maximum — call again to keep waiting."
+                    ),
+                    "default": 60,
+                },
+                "cancel": {
+                    "type": "boolean",
+                    "description": (
+                        "If true, cancel the background task and discard "
+                        "its result. Takes precedence over wait_seconds."
+                    ),
+                    "default": False,
+                },
+            },
+        }
+
+    async def _execute(
+        self,
+        user_id: str | None,
+        session: ChatSession,
+        *,
+        list: bool = False,
+        background_id: str = "",
+        wait_seconds: int = 60,
+        cancel: bool = False,
+        **kwargs,
+    ) -> ToolResponseBase:
+        if list:
+            return _list_response(session)
+
+        if not background_id:
+            return ErrorResponse(
+                message=(
+                    "background_id is required (or pass list=true to "
+                    "enumerate active tasks)."
+                ),
+                session_id=session.session_id,
+            )
+
+        entry = get_background_task(background_id)
+        if entry is None:
+            return ErrorResponse(
+                message=(
+                    f"No background task with id {background_id}. It may "
+                    "have already completed (and been consumed) or never "
+                    "existed."
+                ),
+                session_id=session.session_id,
+            )
+
+        task: asyncio.Task = entry["task"]
+        tool_name: str = entry["tool_name"]
+
+        if cancel:
+            # Race guard: the task may have finished between the registry
+            # lookup and the cancel. If so, surface the real result rather
+            # than reporting 'cancelled' and losing the output.
+            if task.done():
+                return _status_from_finished_task(
+                    session, tool_name, background_id, task
+                )
+            # Dry-run: simulate cancellation without touching the task, so
+            # the LLM can reason about the flow without real side effects.
+            if session.dry_run:
+                return BackgroundToolStatus(
+                    message=(
+                        f"[dry-run] Would cancel background task for " f"'{tool_name}'."
+                    ),
+                    session_id=session.session_id,
+                    status="cancelled",
+                    tool=tool_name,
+                    background_id=background_id,
+                )
+            task.cancel()
+            unregister_background_task(background_id)
+            logger.info(
+                "Cancelled background task %s for tool %s by agent request",
+                background_id,
+                tool_name,
+            )
+            return BackgroundToolStatus(
+                message=f"Cancelled background task for '{tool_name}'.",
+                session_id=session.session_id,
+                status="cancelled",
+                tool=tool_name,
+                background_id=background_id,
+            )
+
+        if task.done():
+            return _status_from_finished_task(session, tool_name, background_id, task)
+
+        effective_wait = max(0, min(wait_seconds, _MAX_BACKGROUND_WAIT_SECONDS))
+        if effective_wait == 0:
+            return BackgroundToolStatus(
+                message=(
+                    f"'{tool_name}' is still running. Call again with "
+                    "wait_seconds>0 to wait, or cancel=true to abort."
+                ),
+                session_id=session.session_id,
+                status="still_running",
+                tool=tool_name,
+                background_id=background_id,
+            )
+
+        await asyncio.wait({task}, timeout=effective_wait)
+        if task.done():
+            return _status_from_finished_task(session, tool_name, background_id, task)
+
+        return BackgroundToolStatus(
+            message=(
+                f"'{tool_name}' still running after waiting "
+                f"{effective_wait}s. Call again to keep waiting, or "
+                "cancel=true to abort."
+            ),
+            session_id=session.session_id,
+            status="still_running",
+            tool=tool_name,
+            background_id=background_id,
+            waited_seconds=effective_wait,
+        )
+
+
+def _list_response(session: ChatSession) -> BackgroundToolList:
+    """Build the response for ``check_background_tool(list=true)``."""
+    now = time.monotonic()
+    entries = [
+        BackgroundToolListEntry(
+            background_id=e["background_id"],
+            tool=e["tool_name"],
+            age_seconds=round(now - e["started_at"], 2),
+            done=e["done"],
+        )
+        for e in list_background_tasks()
+    ]
+    count = len(entries)
+    msg = (
+        f"{count} active background task(s)."
+        if count
+        else "No active background tasks."
+    )
+    return BackgroundToolList(
+        message=msg,
+        session_id=session.session_id,
+        tasks=entries,
+    )
+
+
+def _status_from_finished_task(
+    session: ChatSession,
+    tool_name: str,
+    background_id: str,
+    task: asyncio.Task,
+) -> ToolResponseBase:
+    """Unregister a finished task and return its status."""
+    unregister_background_task(background_id)
+
+    if task.cancelled():
+        return BackgroundToolStatus(
+            message=f"Background task for '{tool_name}' was cancelled.",
+            session_id=session.session_id,
+            status="cancelled",
+            tool=tool_name,
+            background_id=background_id,
+        )
+
+    exc = task.exception()
+    if exc is not None:
+        return BackgroundToolStatus(
+            message=f"'{tool_name}' raised {type(exc).__name__}: {exc}",
+            session_id=session.session_id,
+            status="error",
+            tool=tool_name,
+            background_id=background_id,
+        )
+
+    result = task.result()
+    # A tool can complete with success=False without raising — preserve
+    # that as status="error" so the agent doesn't treat it as a win.
+    if not result.success:
+        return BackgroundToolStatus(
+            message=f"'{tool_name}' completed with an error.",
+            session_id=session.session_id,
+            status="error",
+            tool=tool_name,
+            background_id=background_id,
+            output=result.output,
+        )
+    return BackgroundToolStatus(
+        message=f"'{tool_name}' completed.",
+        session_id=session.session_id,
+        status="completed",
+        tool=tool_name,
+        background_id=background_id,
+        output=result.output,
+    )
--- a/autogpt_platform/backend/backend/copilot/tools/check_background_tool_test.py
+++ b/autogpt_platform/backend/backend/copilot/tools/check_background_tool_test.py
@@ -0,0 +1,318 @@
+"""Tests for CheckBackgroundToolTool."""
+
+import asyncio
+import contextlib
+from unittest.mock import MagicMock
+
+import pytest
+
+from backend.copilot.response_model import StreamToolOutputAvailable
+from backend.copilot.sdk.background_registry import (
+    init_registry,
+    register_background_task,
+)
+
+from .check_background_tool import CheckBackgroundToolTool
+from .models import BackgroundToolList, BackgroundToolStatus
+
+
+def _make_session() -> MagicMock:
+    session = MagicMock()
+    session.session_id = "s1"
+    session.dry_run = False
+    return session
+
+
+def _completed_result(output: str = "ok") -> StreamToolOutputAvailable:
+    return StreamToolOutputAvailable(
+        toolCallId="tc-1",
+        output=output,
+        toolName="slow_tool",
+        success=True,
+    )
+
+
+@pytest.fixture(autouse=True)
+def _init_registry_for_each_test():
+    init_registry()
+
+
+class TestCheckBackgroundTool:
+    @pytest.mark.asyncio
+    async def test_missing_background_id_returns_error(self):
+        tool = CheckBackgroundToolTool()
+        response = await tool._execute(
+            user_id="u",
+            session=_make_session(),
+            background_id="",
+        )
+        assert response.type.value == "error"
+
+    @pytest.mark.asyncio
+    async def test_unknown_background_id_returns_error(self):
+        tool = CheckBackgroundToolTool()
+        response = await tool._execute(
+            user_id="u",
+            session=_make_session(),
+            background_id="bg-does-not-exist",
+        )
+        assert response.type.value == "error"
+        assert "No background task" in response.message
+
+    @pytest.mark.asyncio
+    async def test_wait_zero_returns_still_running(self):
+        async def slow():
+            await asyncio.sleep(10)
+            return _completed_result()
+
+        task = asyncio.create_task(slow())
+        bg_id = register_background_task(task, "slow_tool")
+
+        tool = CheckBackgroundToolTool()
+        response = await tool._execute(
+            user_id="u",
+            session=_make_session(),
+            background_id=bg_id,
+            wait_seconds=0,
+        )
+        assert isinstance(response, BackgroundToolStatus)
+        assert response.status == "still_running"
+        assert response.background_id == bg_id
+
+        task.cancel()
+        with contextlib.suppress(asyncio.CancelledError):
+            await task
+
+    @pytest.mark.asyncio
+    async def test_wait_returns_completed_when_task_finishes(self):
+        async def fast():
+            await asyncio.sleep(0.05)
+            return _completed_result("final-output")
+
+        task = asyncio.create_task(fast())
+        bg_id = register_background_task(task, "slow_tool")
+
+        tool = CheckBackgroundToolTool()
+        response = await tool._execute(
+            user_id="u",
+            session=_make_session(),
+            background_id=bg_id,
+            wait_seconds=5,
+        )
+        assert isinstance(response, BackgroundToolStatus)
+        assert response.status == "completed"
+        assert response.output == "final-output"
+
+    @pytest.mark.asyncio
+    async def test_wait_times_out_and_returns_still_running(self):
+        async def slow():
+            await asyncio.sleep(10)
+            return _completed_result()
+
+        task = asyncio.create_task(slow())
+        bg_id = register_background_task(task, "slow_tool")
+
+        tool = CheckBackgroundToolTool()
+        response = await tool._execute(
+            user_id="u",
+            session=_make_session(),
+            background_id=bg_id,
+            wait_seconds=1,
+        )
+        assert isinstance(response, BackgroundToolStatus)
+        assert response.status == "still_running"
+        assert response.waited_seconds == 1
+
+        task.cancel()
+        with contextlib.suppress(asyncio.CancelledError):
+            await task
+
+    @pytest.mark.asyncio
+    async def test_cancel_true_cancels_and_removes_from_registry(self):
+        observed_cancel = asyncio.Event()
+
+        async def stays_until_cancelled():
+            try:
+                await asyncio.sleep(60)
+            except asyncio.CancelledError:
+                observed_cancel.set()
+                raise
+            return _completed_result()
+
+        task = asyncio.create_task(stays_until_cancelled())
+        # Let the task start before we cancel it.
+        await asyncio.sleep(0)
+
+        bg_id = register_background_task(task, "slow_tool")
+
+        tool = CheckBackgroundToolTool()
+        response = await tool._execute(
+            user_id="u",
+            session=_make_session(),
+            background_id=bg_id,
+            cancel=True,
+        )
+        assert isinstance(response, BackgroundToolStatus)
+        assert response.status == "cancelled"
+
+        with contextlib.suppress(asyncio.CancelledError):
+            await task
+        assert observed_cancel.is_set()
+
+        from backend.copilot.sdk.background_registry import get_background_task
+
+        assert get_background_task(bg_id) is None
+
+    @pytest.mark.asyncio
+    async def test_cancel_after_task_completed_returns_real_result(self):
+        """If the task completes between registration and the agent's
+        cancel=true call, surface the real result instead of reporting
+        'cancelled' and losing the output (race guard)."""
+
+        async def finish_quickly():
+            return _completed_result("final-value")
+
+        task = asyncio.create_task(finish_quickly())
+        await task  # definitely done by the time we register
+        bg_id = register_background_task(task, "slow_tool")
+
+        tool = CheckBackgroundToolTool()
+        response = await tool._execute(
+            user_id="u",
+            session=_make_session(),
+            background_id=bg_id,
+            cancel=True,
+        )
+        assert isinstance(response, BackgroundToolStatus)
+        assert response.status == "completed"
+        assert response.output == "final-value"
+
+    @pytest.mark.asyncio
+    async def test_errored_task_reports_error_status(self):
+        async def raises():
+            raise ValueError("boom")
+
+        task = asyncio.create_task(raises())
+        # Let the task complete before we query it.
+        try:
+            await task
+        except ValueError:
+            pass
+        bg_id = register_background_task(task, "broken_tool")
+
+        tool = CheckBackgroundToolTool()
+        response = await tool._execute(
+            user_id="u",
+            session=_make_session(),
+            background_id=bg_id,
+        )
+        assert isinstance(response, BackgroundToolStatus)
+        assert response.status == "error"
+        assert "boom" in response.message
+
+    @pytest.mark.asyncio
+    async def test_finished_task_with_success_false_reports_error(self):
+        """A tool that completes with success=False (without raising) is
+        reported as status='error', not 'completed', so the agent doesn't
+        treat it as a win."""
+
+        async def finish_with_failure():
+            return StreamToolOutputAvailable(
+                toolCallId="tc-1",
+                output="partial",
+                toolName="broken_tool",
+                success=False,
+            )
+
+        task = asyncio.create_task(finish_with_failure())
+        await task
+        bg_id = register_background_task(task, "broken_tool")
+
+        tool = CheckBackgroundToolTool()
+        response = await tool._execute(
+            user_id="u",
+            session=_make_session(),
+            background_id=bg_id,
+        )
+        assert isinstance(response, BackgroundToolStatus)
+        assert response.status == "error"
+        assert response.output == "partial"
+
+    @pytest.mark.asyncio
+    async def test_list_true_returns_active_background_tasks(self):
+        """list=true enumerates registered tasks so the agent can recover
+        forgotten background_ids."""
+
+        async def hang():
+            await asyncio.sleep(60)
+
+        tasks = [asyncio.create_task(hang()) for _ in range(2)]
+        await asyncio.sleep(0)
+        bg_ids = [register_background_task(t, f"tool_{i}") for i, t in enumerate(tasks)]
+
+        tool = CheckBackgroundToolTool()
+        response = await tool._execute(
+            user_id="u",
+            session=_make_session(),
+            list=True,
+        )
+        assert isinstance(response, BackgroundToolList)
+        assert len(response.tasks) == 2
+        returned_ids = {entry.background_id for entry in response.tasks}
+        assert returned_ids == set(bg_ids)
+        for entry in response.tasks:
+            assert entry.tool.startswith("tool_")
+            assert entry.age_seconds >= 0
+            assert entry.done is False
+
+        for t in tasks:
+            t.cancel()
+            with contextlib.suppress(asyncio.CancelledError):
+                await t
+
+    @pytest.mark.asyncio
+    async def test_list_true_empty_when_no_tasks(self):
+        tool = CheckBackgroundToolTool()
+        response = await tool._execute(
+            user_id="u",
+            session=_make_session(),
+            list=True,
+        )
+        assert isinstance(response, BackgroundToolList)
+        assert response.tasks == []
+
+    @pytest.mark.asyncio
+    async def test_cancel_in_dry_run_does_not_actually_cancel_task(self):
+        """Under session.dry_run, cancel=true must not kill the real task."""
+
+        async def hang():
+            await asyncio.sleep(60)
+
+        task = asyncio.create_task(hang())
+        await asyncio.sleep(0)
+        bg_id = register_background_task(task, "slow_tool")
+
+        session = _make_session()
+        session.dry_run = True
+
+        tool = CheckBackgroundToolTool()
+        response = await tool._execute(
+            user_id="u",
+            session=session,
+            background_id=bg_id,
+            cancel=True,
+        )
+        assert isinstance(response, BackgroundToolStatus)
+        assert response.status == "cancelled"
+        assert "[dry-run]" in response.message
+        # Real task is still running.
+        assert not task.done()
+
+        # Cleanup.
+        task.cancel()
+        with contextlib.suppress(asyncio.CancelledError):
+            await task
+
+    def test_requires_auth_is_true(self):
+        tool = CheckBackgroundToolTool()
+        assert tool.requires_auth is True
--- a/autogpt_platform/backend/backend/copilot/tools/continue_run_block.py
+++ b/autogpt_platform/backend/backend/copilot/tools/continue_run_block.py
@@ -28,6 +28,12 @@ class ContinueRunBlockTool(BaseTool):
    def name(self) -> str:
        return "continue_run_block"

+    @property
+    def timeout_seconds(self) -> int | None:
+        # Resumes an execution that may be a long-running (sub-AutoPilot)
+        # block — same lifecycle as run_block.
+        return None
+
    @property
    def description(self) -> str:
        return "Resume block execution after a run_block call returned review_required. Pass the review_id."
--- a/autogpt_platform/backend/backend/copilot/tools/models.py
+++ b/autogpt_platform/backend/backend/copilot/tools/models.py
@@ -259,6 +259,45 @@ class ErrorResponse(ToolResponseBase):
    details: dict[str, Any] | None = None


+class BackgroundToolStatus(ToolResponseBase):
+    """Status of a backgrounded tool call, returned by ``check_background_tool``."""
+
+    type: ResponseType = ResponseType.MCP_TOOL_OUTPUT
+    status: Literal["completed", "still_running", "cancelled", "error"] = Field(
+        description="Current state of the background task."
+    )
+    tool: str = Field(description="The name of the originally-backgrounded tool.")
+    background_id: str
+    output: Any | None = Field(
+        default=None,
+        description="Tool output when status=completed or status=error.",
+    )
+    waited_seconds: int | None = Field(default=None)
+
+
+class BackgroundToolListEntry(BaseModel):
+    """One row in a ``check_background_tool(list=true)`` response."""
+
+    background_id: str
+    tool: str = Field(description="Name of the originally-backgrounded tool.")
+    age_seconds: float = Field(
+        description="Seconds since the task was parked in the background."
+    )
+    done: bool = Field(
+        description="True if the task has finished but hasn't been consumed yet."
+    )
+
+
+class BackgroundToolList(ToolResponseBase):
+    """List of active background tasks, returned by ``check_background_tool(list=true)``."""
+
+    type: ResponseType = ResponseType.MCP_TOOL_OUTPUT
+    tasks: list[BackgroundToolListEntry] = Field(
+        default_factory=list,
+        description="All background tasks currently registered for this session.",
+    )
+
+
 class InputValidationErrorResponse(ToolResponseBase):
    """Response when run_agent receives unknown input fields."""

--- a/autogpt_platform/backend/backend/copilot/tools/run_agent.py
+++ b/autogpt_platform/backend/backend/copilot/tools/run_agent.py
@@ -104,6 +104,13 @@ class RunAgentTool(BaseTool):
    def name(self) -> str:
        return "run_agent"

+    @property
+    def timeout_seconds(self) -> int | None:
+        # Agent executions can legitimately run 15-45+ min; the tool polls
+        # its own wait_for_result window and returns an execution_id for
+        # later progress checks, so the stream-level timeout isn't needed.
+        return None
+
    @property
    def description(self) -> str:
        return (
--- a/autogpt_platform/backend/backend/copilot/tools/run_block.py
+++ b/autogpt_platform/backend/backend/copilot/tools/run_block.py
@@ -27,6 +27,13 @@ class RunBlockTool(BaseTool):
    def name(self) -> str:
        return "run_block"

+    @property
+    def timeout_seconds(self) -> int | None:
+        # May delegate to AutoPilotBlock (sub-autopilot), which runs its own
+        # multi-turn stream of 15-45+ min. Per-call timeout is disabled here
+        # and left to the block's own execution lifecycle.
+        return None
+
    @property
    def description(self) -> str:
        return (
--- a/autogpt_platform/frontend/src/app/api/openapi.json
+++ b/autogpt_platform/frontend/src/app/api/openapi.json
@@ -1354,7 +1354,9 @@
                    },
                    {
                      "$ref": "#/components/schemas/MemoryForgetConfirmResponse"
-                    }
+                    },
+                    { "$ref": "#/components/schemas/BackgroundToolStatus" },
+                    { "$ref": "#/components/schemas/BackgroundToolList" }
                  ],
                  "title": "Response Getv2[Dummy] Tool Response Type Export For Codegen"
                }
@@ -8430,6 +8432,91 @@
        "required": ["sso_url", "expires_at"],
        "title": "AyrshareSSOResponse"
      },
+      "BackgroundToolList": {
+        "properties": {
+          "type": {
+            "$ref": "#/components/schemas/ResponseType",
+            "default": "mcp_tool_output"
+          },
+          "message": { "type": "string", "title": "Message" },
+          "session_id": {
+            "anyOf": [{ "type": "string" }, { "type": "null" }],
+            "title": "Session Id"
+          },
+          "tasks": {
+            "items": { "$ref": "#/components/schemas/BackgroundToolListEntry" },
+            "type": "array",
+            "title": "Tasks",
+            "description": "All background tasks currently registered for this session."
+          }
+        },
+        "type": "object",
+        "required": ["message"],
+        "title": "BackgroundToolList",
+        "description": "List of active background tasks, returned by ``check_background_tool(list=true)``."
+      },
+      "BackgroundToolListEntry": {
+        "properties": {
+          "background_id": { "type": "string", "title": "Background Id" },
+          "tool": {
+            "type": "string",
+            "title": "Tool",
+            "description": "Name of the originally-backgrounded tool."
+          },
+          "age_seconds": {
+            "type": "number",
+            "title": "Age Seconds",
+            "description": "Seconds since the task was parked in the background."
+          },
+          "done": {
+            "type": "boolean",
+            "title": "Done",
+            "description": "True if the task has finished but hasn't been consumed yet."
+          }
+        },
+        "type": "object",
+        "required": ["background_id", "tool", "age_seconds", "done"],
+        "title": "BackgroundToolListEntry",
+        "description": "One row in a ``check_background_tool(list=true)`` response."
+      },
+      "BackgroundToolStatus": {
+        "properties": {
+          "type": {
+            "$ref": "#/components/schemas/ResponseType",
+            "default": "mcp_tool_output"
+          },
+          "message": { "type": "string", "title": "Message" },
+          "session_id": {
+            "anyOf": [{ "type": "string" }, { "type": "null" }],
+            "title": "Session Id"
+          },
+          "status": {
+            "type": "string",
+            "enum": ["completed", "still_running", "cancelled", "error"],
+            "title": "Status",
+            "description": "Current state of the background task."
+          },
+          "tool": {
+            "type": "string",
+            "title": "Tool",
+            "description": "The name of the originally-backgrounded tool."
+          },
+          "background_id": { "type": "string", "title": "Background Id" },
+          "output": {
+            "anyOf": [{}, { "type": "null" }],
+            "title": "Output",
+            "description": "Tool output when status=completed or status=error."
+          },
+          "waited_seconds": {
+            "anyOf": [{ "type": "integer" }, { "type": "null" }],
+            "title": "Waited Seconds"
+          }
+        },
+        "type": "object",
+        "required": ["message", "status", "tool", "background_id"],
+        "title": "BackgroundToolStatus",
+        "description": "Status of a backgrounded tool call, returned by ``check_background_tool``."
+      },
      "BaseGraph-Input": {
        "properties": {
          "id": { "type": "string", "title": "Id" },
--- a/pr-12841/ui-native-v3/ui-01-async-top.png
+++ b/pr-12841/ui-native-v3/ui-01-async-top.png
--- a/pr-12841/ui-native-v3/ui-02-async-running.png
+++ b/pr-12841/ui-native-v3/ui-02-async-running.png
--- a/pr-12841/ui-native-v3/ui-03-async-completed.png
+++ b/pr-12841/ui-native-v3/ui-03-async-completed.png
--- a/pr-12841/ui-native-v3/ui-04-sub-session-opened.png
+++ b/pr-12841/ui-native-v3/ui-04-sub-session-opened.png
--- a/pr-12841/ui-proof/02-ui-inline-wait60-result.png
+++ b/pr-12841/ui-proof/02-ui-inline-wait60-result.png
--- a/pr-12841/ui-proof/03-reasoning-modal-tool-card.png
+++ b/pr-12841/ui-proof/03-reasoning-modal-tool-card.png
--- a/pr-12841/ui-proof/04-reasoning-modal-tool-output-expanded.png
+++ b/pr-12841/ui-proof/04-reasoning-modal-tool-output-expanded.png
--- a/pr-12841/ui-proof/05-async-path-call1-status-running.png
+++ b/pr-12841/ui-proof/05-async-path-call1-status-running.png
--- a/pr-12841/ui-proof/06-async-path-call2-completed.png
+++ b/pr-12841/ui-proof/06-async-path-call2-completed.png
--- a/pr-12841/ui-proof/07-ui-assistant-rendered-json.png
+++ b/pr-12841/ui-proof/07-ui-assistant-rendered-json.png
--- a/test-screenshots/PR-12841/01-logged-in.png
+++ b/test-screenshots/PR-12841/01-logged-in.png
--- a/test-screenshots/PR-12841/02-copilot-page.png
+++ b/test-screenshots/PR-12841/02-copilot-page.png
--- a/test-screenshots/PR-12841/03-check-background-list-ui.png
+++ b/test-screenshots/PR-12841/03-check-background-list-ui.png
--- a/test-screenshots/PR-12841/03-check-background-tool-call.png
+++ b/test-screenshots/PR-12841/03-check-background-tool-call.png
--- a/test-screenshots/PR-12841/04-check-background-unknown-ui.png
+++ b/test-screenshots/PR-12841/04-check-background-unknown-ui.png
--- a/test-screenshots/PR-12841/05-parked-then-completed-ui.png
+++ b/test-screenshots/PR-12841/05-parked-then-completed-ui.png
--- a/test-screenshots/PR-12841/05a-parked-response-top.png
+++ b/test-screenshots/PR-12841/05a-parked-response-top.png
--- a/test-screenshots/PR-12841/06-parked-then-cancelled-ui.png
+++ b/test-screenshots/PR-12841/06-parked-then-cancelled-ui.png
Author	SHA1	Message	Date
Zamil Majdy	e51c287ae4	test: PR #12841 native E2E v3 — link rendering + click-through	2026-04-18 18:45:54 +07:00
Zamil Majdy	58ce293ec0	test: PR #12841 native UI proof screenshots (v2)	2026-04-18 15:51:06 +07:00
Zamil Majdy	88b515c191	test: add real-feature E2E screenshots for PR #12841	2026-04-18 13:03:50 +07:00
Zamil Majdy	39f04b8990	test: add E2E screenshots for PR #12841	2026-04-18 09:01:08 +07:00
Zamil Majdy	b1b45e57e2	chore(skill/pr-test): add native-mode option, keep docker as fallback Running `poetry run app` + `pnpm dev` against infra-only docker is 3–8 minutes faster per iteration than rebuilding the full compose stack on every backend change. Document it as the preferred path for iterative PR testing and keep the existing docker-compose path as an explicit fallback for Dockerfile/compose-level changes or production-parity runs.	2026-04-18 08:37:37 +07:00
Zamil Majdy	6b199d2b9c	fix(backend/copilot): check_background_tool — auth, dry-run, list=true support Self-review findings from a fresh pass — all 🟠 Should Fix. - requires_auth=True on CheckBackgroundToolTool for consistency with other stateful tools (run_agent, run_block, continue_run_block). - Dry-run guard on cancel=true: return a simulated 'cancelled' status without actually calling task.cancel(), matching run_block / run_mcp_tool's dry-run semantics. - list=true parameter: enumerates every active background task in the session as BackgroundToolList → list[BackgroundToolListEntry]. Closes the UX gap where an agent losing context compaction can no longer reach a parked task. No other params needed when list=true. - background_id is now optional (required only when list=false). - BackgroundToolList exported via tools/models.py and added to ToolResponseUnion in chat routes so frontend codegen picks it up. - Registry gains list_background_tasks() snapshot helper. - Regenerated openapi.json for the new type. Tests: - list=true returns active tasks with real bg_ids, age>=0, done=False - list=true on empty registry returns [] - cancel=true under session.dry_run doesn't kill the real task - requires_auth is True - list_background_tasks registry-level snapshot	2026-04-18 08:10:37 +07:00
Zamil Majdy	38d3c506a1	fix(backend/copilot): register check_background_tool in PLATFORM_TOOL_NAMES Adds 'check_background_tool' to the ToolName Literal so the permission registry check (_assert_tool_names_consistent) passes.	2026-04-18 07:14:36 +07:00
Zamil Majdy	be500ba0e3	chore(backend): regenerate openapi.json for BackgroundToolStatus	2026-04-18 07:02:54 +07:00
Zamil Majdy	8915b2958c	fix(backend/copilot): guard cancel race, assert cleanup with real bg_ids - check_background_tool with cancel=true now checks task.done() before calling task.cancel(). If the task finished between the registry lookup and the cancel request, surface the real completed/error result instead of reporting 'cancelled' and losing the output. - Registry test for cancel_all_background_tasks now captures the real bg_ids returned by register_background_task and asserts they're gone (instead of checking fabricated IDs). - New test pins the cancel-after-completed race guard.	2026-04-18 06:54:59 +07:00
Zamil Majdy	453e90d0f4	fix(backend/copilot): address PR review — CancelledError propagation, error status, BackgroundToolStatus in codegen union - _execute_tool_sync now catches asyncio.CancelledError and cancels the unregistered child task before re-raising. Prevents orphans when the handler is torn down before the per-tool timeout fires (child is not yet in the registry so cancel_all_background_tasks can't clean it up). - check_background_tool now maps result.success=False to status='error' (not 'completed'), so an agent doesn't treat a failed finish as a win. - BackgroundToolStatus moved to tools/models.py and added to the ToolResponseUnion in chat routes so frontend codegen picks it up. - Tests: replace broad `except (CancelledError, BaseException)` catches with contextlib.suppress(asyncio.CancelledError) in the cleanup paths. - New tests: handler cancellation propagates to child task; success=False result reports status='error'.	2026-04-18 06:51:37 +07:00
Zamil Majdy	bca21e84e4	fix(backend/copilot): address PR review — orphan cleanup, cap, clamp wording, trim - Cancel all background tasks in the stream's finally block (cancel_all_background_tasks) so orphan long-running work doesn't outlive the session when the user leaves or the stream errors. - Cap per-session registry at MAX_BACKGROUND_TASKS_PER_SESSION=32; overflow evicts + cancels the oldest entry. - Document ContextVar scoping: sub-AutoPilots get an isolated registry. - Trim the 'still running' background payload message; structured fields (type, tool, background_id, timeout_seconds) carry the rest. - Clarify check_background_tool's wait_seconds: values above the max are clamped (not rejected) and the agent should call again to wait longer. - Comment the intentional 10-min default on BashExecTool (its own subprocess timeout is capped at 120s so the budget never fires in the normal path). - Add registry tests: register/lookup, unregister, cancel_all, overflow eviction.	2026-04-18 06:45:02 +07:00
Zamil Majdy	c32a4017fe	fix(backend/copilot): non-cancelling per-tool timeouts + check_background_tool When a tool call exceeds its per-call time budget the handler no longer cancels the task — it parks the asyncio.Task in a per-session background registry and returns a synthetic tool result with a background_id. The agent then uses the new check_background_tool to wait longer, poll status, or cancel. This keeps the autopilot in control of slow sub-agent and graph-execution work instead of the handler making an irreversible choice, and removes the need for an exemption list. Design ------ - BaseTool.timeout_seconds (default 10 min, None disables) decides when to park. - run_agent / run_block / continue_run_block declare None — they manage their own lifecycles. - _execute_tool_sync wraps the tool coroutine in asyncio.wait(timeout=...) (non-cancelling). On timeout → register_background_task + synthetic result with type='background' and background_id. - New tool check_background_tool exposes wait_seconds (0..540) and cancel=true to the agent; drives its own wait via asyncio.wait. - Background registry lives in its own module (sdk/background_registry.py) to avoid a TOOL_REGISTRY import cycle. - Stream-level idle timeout kept as last-resort safety net (30 min) and now logs the unresolved tool calls for monitoring. Security / ops -------------- - _redact_args_for_log replaces values of sensitive keys (api_key, token, password, secret, credentials, authorization, auth) with '<redacted>' before logging, on top of the existing 200-char truncation. Docs ---- - _SHARED_TOOL_NOTES now documents the background lifecycle and tells the agent to keep polling for legitimate long-running work rather than cancelling. Tests ----- - TestToolTimeout: timeout parks task (doesn't cancel), synthetic result has type='background' and a bg_id, None disables timeout. - TestBaseToolDefaultTimeout: default 600s, per-tool overrides. - TestCheckBackgroundTool: missing/unknown id → error, wait=0 → status, wait returns completed/still_running, cancel=true propagates, errored tasks surface as status='error'. Ref: SECRT-2247	2026-04-18 06:39:40 +07:00