mirror of
https://github.com/openclaw/openclaw.git
synced 2026-02-19 18:39:20 -05:00
fix: codex and similar processes keep dying on pty, solved by refactoring process spawning (#14257)
* exec: clean up PTY resources on timeout and exit * cli: harden resume cleanup and watchdog stalled runs * cli: productionize PTY and resume reliability paths * docs: add PTY process supervision architecture plan * docs: rewrite PTY supervision plan as pre-rewrite baseline * docs: switch PTY supervision plan to one-go execution * docs: add one-line root cause to PTY supervision plan * docs: add OS contracts and test matrix to PTY supervision plan * docs: define process-supervisor package placement and scope * docs: tie supervisor plan to existing CI lanes * docs: place PTY supervisor plan under src/process * refactor(process): route exec and cli runs through supervisor * docs(process): refresh PTY supervision plan * wip * fix(process): harden supervisor timeout and PTY termination * fix(process): harden supervisor adapters env and wait handling * ci: avoid failing formal conformance on comment permissions * test(ui): fix cron request mock argument typing * fix(ui): remove leftover conflict marker * fix: supervise PTY processes (#14257) (openclaw#14257) (thanks @onutc)
This commit is contained in:
1
.github/workflows/formal-conformance.yml
vendored
1
.github/workflows/formal-conformance.yml
vendored
@@ -108,6 +108,7 @@ jobs:
|
||||
|
||||
- name: Comment on PR (informational)
|
||||
if: steps.drift.outputs.drift == 'true'
|
||||
continue-on-error: true
|
||||
uses: actions/github-script@v7
|
||||
with:
|
||||
script: |
|
||||
|
||||
@@ -108,6 +108,7 @@ Docs: https://docs.openclaw.ai
|
||||
- Tools/Write/Edit: normalize structured text-block arguments for `content`/`oldText`/`newText` before filesystem edits, preventing JSON-like file corruption and false “exact text not found” misses from block-form params. (#16778) Thanks @danielpipernz.
|
||||
- Ollama/Agents: avoid forcing `<final>` tag enforcement for Ollama models, which could suppress all output as `(no output)`. (#16191) Thanks @Glucksberg.
|
||||
- Plugins: suppress false duplicate plugin id warnings when the same extension is discovered via multiple paths (config/workspace/global vs bundled), while still warning on genuine duplicates. (#16222) Thanks @shadril238.
|
||||
- Agents/Process: supervise PTY/child process lifecycles with explicit ownership, cancellation, timeouts, and deterministic cleanup, preventing Codex/Pi PTY sessions from dying or stalling on resume. (#14257) Thanks @onutc.
|
||||
- Skills: watch `SKILL.md` only when refreshing skills snapshot to avoid file-descriptor exhaustion in large data trees. (#11325) Thanks @household-bard.
|
||||
- Memory/QMD: make `memory status` read-only by skipping QMD boot update/embed side effects for status-only manager checks.
|
||||
- Memory/QMD: keep original QMD failures when builtin fallback initialization fails (for example missing embedding API keys), instead of replacing them with fallback init errors.
|
||||
|
||||
192
docs/experiments/plans/pty-process-supervision.md
Normal file
192
docs/experiments/plans/pty-process-supervision.md
Normal file
@@ -0,0 +1,192 @@
|
||||
---
|
||||
summary: "Production plan for reliable interactive process supervision (PTY + non-PTY) with explicit ownership, unified lifecycle, and deterministic cleanup"
|
||||
owner: "openclaw"
|
||||
status: "in-progress"
|
||||
last_updated: "2026-02-15"
|
||||
title: "PTY and Process Supervision Plan"
|
||||
---
|
||||
|
||||
# PTY and Process Supervision Plan
|
||||
|
||||
## 1. Problem and goal
|
||||
|
||||
We need one reliable lifecycle for long-running command execution across:
|
||||
|
||||
- `exec` foreground runs
|
||||
- `exec` background runs
|
||||
- `process` follow up actions (`poll`, `log`, `send-keys`, `paste`, `submit`, `kill`, `remove`)
|
||||
- CLI agent runner subprocesses
|
||||
|
||||
The goal is not just to support PTY. The goal is predictable ownership, cancellation, timeout, and cleanup with no unsafe process matching heuristics.
|
||||
|
||||
## 2. Scope and boundaries
|
||||
|
||||
- Keep implementation internal in `src/process/supervisor`.
|
||||
- Do not create a new package for this.
|
||||
- Keep current behavior compatibility where practical.
|
||||
- Do not broaden scope to terminal replay or tmux style session persistence.
|
||||
|
||||
## 3. Implemented in this branch
|
||||
|
||||
### Supervisor baseline already present
|
||||
|
||||
- Supervisor module is in place under `src/process/supervisor/*`.
|
||||
- Exec runtime and CLI runner are already routed through supervisor spawn and wait.
|
||||
- Registry finalization is idempotent.
|
||||
|
||||
### This pass completed
|
||||
|
||||
1. Explicit PTY command contract
|
||||
|
||||
- `SpawnInput` is now a discriminated union in `src/process/supervisor/types.ts`.
|
||||
- PTY runs require `ptyCommand` instead of reusing generic `argv`.
|
||||
- Supervisor no longer rebuilds PTY command strings from argv joins in `src/process/supervisor/supervisor.ts`.
|
||||
- Exec runtime now passes `ptyCommand` directly in `src/agents/bash-tools.exec-runtime.ts`.
|
||||
|
||||
2. Process layer type decoupling
|
||||
|
||||
- Supervisor types no longer import `SessionStdin` from agents.
|
||||
- Process local stdin contract lives in `src/process/supervisor/types.ts` (`ManagedRunStdin`).
|
||||
- Adapters now depend only on process level types:
|
||||
- `src/process/supervisor/adapters/child.ts`
|
||||
- `src/process/supervisor/adapters/pty.ts`
|
||||
|
||||
3. Process tool lifecycle ownership improvement
|
||||
|
||||
- `src/agents/bash-tools.process.ts` now requests cancellation through supervisor first.
|
||||
- `process kill/remove` now use process-tree fallback termination when supervisor lookup misses.
|
||||
- `remove` keeps deterministic remove behavior by dropping running session entries immediately after termination is requested.
|
||||
|
||||
4. Single source watchdog defaults
|
||||
|
||||
- Added shared defaults in `src/agents/cli-watchdog-defaults.ts`.
|
||||
- `src/agents/cli-backends.ts` consumes the shared defaults.
|
||||
- `src/agents/cli-runner/reliability.ts` consumes the same shared defaults.
|
||||
|
||||
5. Dead helper cleanup
|
||||
|
||||
- Removed unused `killSession` helper path from `src/agents/bash-tools.shared.ts`.
|
||||
|
||||
6. Direct supervisor path tests added
|
||||
|
||||
- Added `src/agents/bash-tools.process.supervisor.test.ts` to cover kill and remove routing through supervisor cancellation.
|
||||
|
||||
7. Reliability gap fixes completed
|
||||
|
||||
- `src/agents/bash-tools.process.ts` now falls back to real OS-level process termination when supervisor lookup misses.
|
||||
- `src/process/supervisor/adapters/child.ts` now uses process-tree termination semantics for default cancel/timeout kill paths.
|
||||
- Added shared process-tree utility in `src/process/kill-tree.ts`.
|
||||
|
||||
8. PTY contract edge-case coverage added
|
||||
|
||||
- Added `src/process/supervisor/supervisor.pty-command.test.ts` for verbatim PTY command forwarding and empty-command rejection.
|
||||
- Added `src/process/supervisor/adapters/child.test.ts` for process-tree kill behavior in child adapter cancellation.
|
||||
|
||||
## 4. Remaining gaps and decisions
|
||||
|
||||
### Reliability status
|
||||
|
||||
The two required reliability gaps for this pass are now closed:
|
||||
|
||||
- `process kill/remove` now has a real OS termination fallback when supervisor lookup misses.
|
||||
- child cancel/timeout now uses process-tree kill semantics for default kill path.
|
||||
- Regression tests were added for both behaviors.
|
||||
|
||||
### Durability and startup reconciliation
|
||||
|
||||
Restart behavior is now explicitly defined as in-memory lifecycle only.
|
||||
|
||||
- `reconcileOrphans()` remains a no-op in `src/process/supervisor/supervisor.ts` by design.
|
||||
- Active runs are not recovered after process restart.
|
||||
- This boundary is intentional for this implementation pass to avoid partial persistence risks.
|
||||
|
||||
### Maintainability follow-ups
|
||||
|
||||
1. `runExecProcess` in `src/agents/bash-tools.exec-runtime.ts` still handles multiple responsibilities and can be split into focused helpers in a follow-up.
|
||||
|
||||
## 5. Implementation plan
|
||||
|
||||
The implementation pass for required reliability and contract items is complete.
|
||||
|
||||
Completed:
|
||||
|
||||
- `process kill/remove` fallback real termination
|
||||
- process-tree cancellation for child adapter default kill path
|
||||
- regression tests for fallback kill and child adapter kill path
|
||||
- PTY command edge-case tests under explicit `ptyCommand`
|
||||
- explicit in-memory restart boundary with `reconcileOrphans()` no-op by design
|
||||
|
||||
Optional follow-up:
|
||||
|
||||
- split `runExecProcess` into focused helpers with no behavior drift
|
||||
|
||||
## 6. File map
|
||||
|
||||
### Process supervisor
|
||||
|
||||
- `src/process/supervisor/types.ts` updated with discriminated spawn input and process local stdin contract.
|
||||
- `src/process/supervisor/supervisor.ts` updated to use explicit `ptyCommand`.
|
||||
- `src/process/supervisor/adapters/child.ts` and `src/process/supervisor/adapters/pty.ts` decoupled from agent types.
|
||||
- `src/process/supervisor/registry.ts` idempotent finalize unchanged and retained.
|
||||
|
||||
### Exec and process integration
|
||||
|
||||
- `src/agents/bash-tools.exec-runtime.ts` updated to pass PTY command explicitly and keep fallback path.
|
||||
- `src/agents/bash-tools.process.ts` updated to cancel via supervisor with real process-tree fallback termination.
|
||||
- `src/agents/bash-tools.shared.ts` removed direct kill helper path.
|
||||
|
||||
### CLI reliability
|
||||
|
||||
- `src/agents/cli-watchdog-defaults.ts` added as shared baseline.
|
||||
- `src/agents/cli-backends.ts` and `src/agents/cli-runner/reliability.ts` now consume same defaults.
|
||||
|
||||
## 7. Validation run in this pass
|
||||
|
||||
Unit tests:
|
||||
|
||||
- `pnpm vitest src/process/supervisor/registry.test.ts`
|
||||
- `pnpm vitest src/process/supervisor/supervisor.test.ts`
|
||||
- `pnpm vitest src/process/supervisor/supervisor.pty-command.test.ts`
|
||||
- `pnpm vitest src/process/supervisor/adapters/child.test.ts`
|
||||
- `pnpm vitest src/agents/cli-backends.test.ts`
|
||||
- `pnpm vitest src/agents/bash-tools.exec.pty-cleanup.test.ts`
|
||||
- `pnpm vitest src/agents/bash-tools.process.poll-timeout.test.ts`
|
||||
- `pnpm vitest src/agents/bash-tools.process.supervisor.test.ts`
|
||||
- `pnpm vitest src/process/exec.test.ts`
|
||||
|
||||
E2E targets:
|
||||
|
||||
- `pnpm test:e2e src/agents/cli-runner.e2e.test.ts`
|
||||
- `pnpm test:e2e src/agents/bash-tools.exec.pty-fallback.e2e.test.ts src/agents/bash-tools.exec.background-abort.e2e.test.ts src/agents/bash-tools.process.send-keys.e2e.test.ts`
|
||||
|
||||
Typecheck note:
|
||||
|
||||
- `pnpm tsgo` currently fails in this repo due to a pre-existing UI typing dependency issue (`@vitest/browser-playwright` resolution), unrelated to this process supervision work.
|
||||
|
||||
## 8. Operational guarantees preserved
|
||||
|
||||
- Exec env hardening behavior is unchanged.
|
||||
- Approval and allowlist flow is unchanged.
|
||||
- Output sanitization and output caps are unchanged.
|
||||
- PTY adapter still guarantees wait settlement on forced kill and listener disposal.
|
||||
|
||||
## 9. Definition of done
|
||||
|
||||
1. Supervisor is lifecycle owner for managed runs.
|
||||
2. PTY spawn uses explicit command contract with no argv reconstruction.
|
||||
3. Process layer has no type dependency on agent layer for supervisor stdin contracts.
|
||||
4. Watchdog defaults are single source.
|
||||
5. Targeted unit and e2e tests remain green.
|
||||
6. Restart durability boundary is explicitly documented or fully implemented.
|
||||
|
||||
## 10. Summary
|
||||
|
||||
The branch now has a coherent and safer supervision shape:
|
||||
|
||||
- explicit PTY contract
|
||||
- cleaner process layering
|
||||
- supervisor driven cancellation path for process operations
|
||||
- real fallback termination when supervisor lookup misses
|
||||
- process-tree cancellation for child-run default kill paths
|
||||
- unified watchdog defaults
|
||||
- explicit in-memory restart boundary (no orphan reconciliation across restart in this pass)
|
||||
@@ -1,17 +1,17 @@
|
||||
import type { AgentToolResult } from "@mariozechner/pi-agent-core";
|
||||
import type { ChildProcessWithoutNullStreams } from "node:child_process";
|
||||
import { Type } from "@sinclair/typebox";
|
||||
import path from "node:path";
|
||||
import type { ExecAsk, ExecHost, ExecSecurity } from "../infra/exec-approvals.js";
|
||||
import type { ProcessSession, SessionStdin } from "./bash-process-registry.js";
|
||||
import type { ProcessSession } from "./bash-process-registry.js";
|
||||
import type { ExecToolDetails } from "./bash-tools.exec.js";
|
||||
import type { BashSandboxConfig } from "./bash-tools.shared.js";
|
||||
import { requestHeartbeatNow } from "../infra/heartbeat-wake.js";
|
||||
import { mergePathPrepend } from "../infra/path-prepend.js";
|
||||
import { enqueueSystemEvent } from "../infra/system-events.js";
|
||||
export { applyPathPrepend, normalizePathPrepend } from "../infra/path-prepend.js";
|
||||
import type { ManagedRun } from "../process/supervisor/index.js";
|
||||
import { logWarn } from "../logger.js";
|
||||
import { formatSpawnError, spawnWithFallback } from "../process/spawn-utils.js";
|
||||
import { getProcessSupervisor } from "../process/supervisor/index.js";
|
||||
import {
|
||||
addSession,
|
||||
appendOutput,
|
||||
@@ -23,7 +23,6 @@ import {
|
||||
buildDockerExecArgs,
|
||||
chunkString,
|
||||
clampWithDefault,
|
||||
killSession,
|
||||
readEnvInt,
|
||||
} from "./bash-tools.shared.js";
|
||||
import { buildCursorPositionResponse, stripDsrRequests } from "./pty-dsr.js";
|
||||
@@ -147,26 +146,6 @@ export const execSchema = Type.Object({
|
||||
),
|
||||
});
|
||||
|
||||
type PtyExitEvent = { exitCode: number; signal?: number };
|
||||
type PtyListener<T> = (event: T) => void;
|
||||
type PtyHandle = {
|
||||
pid: number;
|
||||
write: (data: string | Buffer) => void;
|
||||
onData: (listener: PtyListener<string>) => void;
|
||||
onExit: (listener: PtyListener<PtyExitEvent>) => void;
|
||||
};
|
||||
type PtySpawn = (
|
||||
file: string,
|
||||
args: string[] | string,
|
||||
options: {
|
||||
name?: string;
|
||||
cols?: number;
|
||||
rows?: number;
|
||||
cwd?: string;
|
||||
env?: Record<string, string>;
|
||||
},
|
||||
) => PtyHandle;
|
||||
|
||||
export type ExecProcessOutcome = {
|
||||
status: "completed" | "failed";
|
||||
exitCode: number | null;
|
||||
@@ -319,138 +298,10 @@ export async function runExecProcess(opts: {
|
||||
}): Promise<ExecProcessHandle> {
|
||||
const startedAt = Date.now();
|
||||
const sessionId = createSessionSlug();
|
||||
let child: ChildProcessWithoutNullStreams | null = null;
|
||||
let pty: PtyHandle | null = null;
|
||||
let stdin: SessionStdin | undefined;
|
||||
const execCommand = opts.execCommand ?? opts.command;
|
||||
const supervisor = getProcessSupervisor();
|
||||
|
||||
const spawnFallbacks = [
|
||||
{
|
||||
label: "no-detach",
|
||||
options: { detached: false },
|
||||
},
|
||||
];
|
||||
|
||||
const handleSpawnFallback = (err: unknown, fallback: { label: string }) => {
|
||||
const errText = formatSpawnError(err);
|
||||
const warning = `Warning: spawn failed (${errText}); retrying with ${fallback.label}.`;
|
||||
logWarn(`exec: spawn failed (${errText}); retrying with ${fallback.label}.`);
|
||||
opts.warnings.push(warning);
|
||||
};
|
||||
|
||||
const spawnShellChild = async (
|
||||
shell: string,
|
||||
shellArgs: string[],
|
||||
): Promise<ChildProcessWithoutNullStreams> => {
|
||||
const { child: spawned } = await spawnWithFallback({
|
||||
argv: [shell, ...shellArgs, execCommand],
|
||||
options: {
|
||||
cwd: opts.workdir,
|
||||
env: opts.env,
|
||||
detached: process.platform !== "win32",
|
||||
stdio: ["pipe", "pipe", "pipe"],
|
||||
windowsHide: true,
|
||||
},
|
||||
fallbacks: spawnFallbacks,
|
||||
onFallback: handleSpawnFallback,
|
||||
});
|
||||
return spawned as ChildProcessWithoutNullStreams;
|
||||
};
|
||||
|
||||
// `exec` does not currently accept tool-provided stdin content. For non-PTY runs,
|
||||
// keeping stdin open can cause commands like `wc -l` (or safeBins-hardened segments)
|
||||
// to block forever waiting for input, leading to accidental backgrounding.
|
||||
// For interactive flows, callers should use `pty: true` (stdin kept open).
|
||||
const maybeCloseNonPtyStdin = () => {
|
||||
if (opts.usePty) {
|
||||
return;
|
||||
}
|
||||
try {
|
||||
// Signal EOF immediately so stdin-only commands can terminate.
|
||||
child?.stdin?.end();
|
||||
} catch {
|
||||
// ignore stdin close errors
|
||||
}
|
||||
};
|
||||
|
||||
if (opts.sandbox) {
|
||||
const { child: spawned } = await spawnWithFallback({
|
||||
argv: [
|
||||
"docker",
|
||||
...buildDockerExecArgs({
|
||||
containerName: opts.sandbox.containerName,
|
||||
command: execCommand,
|
||||
workdir: opts.containerWorkdir ?? opts.sandbox.containerWorkdir,
|
||||
env: opts.env,
|
||||
tty: opts.usePty,
|
||||
}),
|
||||
],
|
||||
options: {
|
||||
cwd: opts.workdir,
|
||||
env: process.env,
|
||||
detached: process.platform !== "win32",
|
||||
stdio: ["pipe", "pipe", "pipe"],
|
||||
windowsHide: true,
|
||||
},
|
||||
fallbacks: spawnFallbacks,
|
||||
onFallback: handleSpawnFallback,
|
||||
});
|
||||
child = spawned as ChildProcessWithoutNullStreams;
|
||||
stdin = child.stdin;
|
||||
maybeCloseNonPtyStdin();
|
||||
} else if (opts.usePty) {
|
||||
const { shell, args: shellArgs } = getShellConfig();
|
||||
try {
|
||||
const ptyModule = (await import("@lydell/node-pty")) as unknown as {
|
||||
spawn?: PtySpawn;
|
||||
default?: { spawn?: PtySpawn };
|
||||
};
|
||||
const spawnPty = ptyModule.spawn ?? ptyModule.default?.spawn;
|
||||
if (!spawnPty) {
|
||||
throw new Error("PTY support is unavailable (node-pty spawn not found).");
|
||||
}
|
||||
pty = spawnPty(shell, [...shellArgs, execCommand], {
|
||||
cwd: opts.workdir,
|
||||
env: opts.env,
|
||||
name: process.env.TERM ?? "xterm-256color",
|
||||
cols: 120,
|
||||
rows: 30,
|
||||
});
|
||||
stdin = {
|
||||
destroyed: false,
|
||||
write: (data, cb) => {
|
||||
try {
|
||||
pty?.write(data);
|
||||
cb?.(null);
|
||||
} catch (err) {
|
||||
cb?.(err as Error);
|
||||
}
|
||||
},
|
||||
end: () => {
|
||||
try {
|
||||
const eof = process.platform === "win32" ? "\x1a" : "\x04";
|
||||
pty?.write(eof);
|
||||
} catch {
|
||||
// ignore EOF errors
|
||||
}
|
||||
},
|
||||
};
|
||||
} catch (err) {
|
||||
const errText = String(err);
|
||||
const warning = `Warning: PTY spawn failed (${errText}); retrying without PTY for \`${opts.command}\`.`;
|
||||
logWarn(`exec: PTY spawn failed (${errText}); retrying without PTY for "${opts.command}".`);
|
||||
opts.warnings.push(warning);
|
||||
child = await spawnShellChild(shell, shellArgs);
|
||||
stdin = child.stdin;
|
||||
}
|
||||
} else {
|
||||
const { shell, args: shellArgs } = getShellConfig();
|
||||
child = await spawnShellChild(shell, shellArgs);
|
||||
stdin = child.stdin;
|
||||
maybeCloseNonPtyStdin();
|
||||
}
|
||||
|
||||
const session = {
|
||||
const session: ProcessSession = {
|
||||
id: sessionId,
|
||||
command: opts.command,
|
||||
scopeKey: opts.scopeKey,
|
||||
@@ -458,9 +309,9 @@ export async function runExecProcess(opts: {
|
||||
notifyOnExit: opts.notifyOnExit,
|
||||
notifyOnExitEmptySuccess: opts.notifyOnExitEmptySuccess === true,
|
||||
exitNotified: false,
|
||||
child: child ?? undefined,
|
||||
stdin,
|
||||
pid: child?.pid ?? pty?.pid,
|
||||
child: undefined,
|
||||
stdin: undefined,
|
||||
pid: undefined,
|
||||
startedAt,
|
||||
cwd: opts.workdir,
|
||||
maxOutputChars: opts.maxOutput,
|
||||
@@ -477,59 +328,9 @@ export async function runExecProcess(opts: {
|
||||
exitSignal: undefined as NodeJS.Signals | number | null | undefined,
|
||||
truncated: false,
|
||||
backgrounded: false,
|
||||
} satisfies ProcessSession;
|
||||
};
|
||||
addSession(session);
|
||||
|
||||
let settled = false;
|
||||
let timeoutTimer: NodeJS.Timeout | null = null;
|
||||
let timeoutFinalizeTimer: NodeJS.Timeout | null = null;
|
||||
let timedOut = false;
|
||||
const timeoutFinalizeMs = 1000;
|
||||
let resolveFn: ((outcome: ExecProcessOutcome) => void) | null = null;
|
||||
|
||||
const settle = (outcome: ExecProcessOutcome) => {
|
||||
if (settled) {
|
||||
return;
|
||||
}
|
||||
settled = true;
|
||||
resolveFn?.(outcome);
|
||||
};
|
||||
|
||||
const finalizeTimeout = () => {
|
||||
if (session.exited) {
|
||||
return;
|
||||
}
|
||||
markExited(session, null, "SIGKILL", "failed");
|
||||
maybeNotifyOnExit(session, "failed");
|
||||
const aggregated = session.aggregated.trim();
|
||||
const reason = `Command timed out after ${opts.timeoutSec} seconds`;
|
||||
settle({
|
||||
status: "failed",
|
||||
exitCode: null,
|
||||
exitSignal: "SIGKILL",
|
||||
durationMs: Date.now() - startedAt,
|
||||
aggregated,
|
||||
timedOut: true,
|
||||
reason: aggregated ? `${aggregated}\n\n${reason}` : reason,
|
||||
});
|
||||
};
|
||||
|
||||
const onTimeout = () => {
|
||||
timedOut = true;
|
||||
killSession(session);
|
||||
if (!timeoutFinalizeTimer) {
|
||||
timeoutFinalizeTimer = setTimeout(() => {
|
||||
finalizeTimeout();
|
||||
}, timeoutFinalizeMs);
|
||||
}
|
||||
};
|
||||
|
||||
if (opts.timeoutSec > 0) {
|
||||
timeoutTimer = setTimeout(() => {
|
||||
onTimeout();
|
||||
}, opts.timeoutSec * 1000);
|
||||
}
|
||||
|
||||
const emitUpdate = () => {
|
||||
if (!opts.onUpdate) {
|
||||
return;
|
||||
@@ -565,116 +366,208 @@ export async function runExecProcess(opts: {
|
||||
}
|
||||
};
|
||||
|
||||
if (pty) {
|
||||
const cursorResponse = buildCursorPositionResponse();
|
||||
pty.onData((data) => {
|
||||
const raw = data.toString();
|
||||
const { cleaned, requests } = stripDsrRequests(raw);
|
||||
if (requests > 0) {
|
||||
const timeoutMs =
|
||||
typeof opts.timeoutSec === "number" && opts.timeoutSec > 0
|
||||
? Math.floor(opts.timeoutSec * 1000)
|
||||
: undefined;
|
||||
|
||||
const spawnSpec:
|
||||
| {
|
||||
mode: "child";
|
||||
argv: string[];
|
||||
env: NodeJS.ProcessEnv;
|
||||
stdinMode: "pipe-open" | "pipe-closed";
|
||||
}
|
||||
| {
|
||||
mode: "pty";
|
||||
ptyCommand: string;
|
||||
childFallbackArgv: string[];
|
||||
env: NodeJS.ProcessEnv;
|
||||
stdinMode: "pipe-open";
|
||||
} = (() => {
|
||||
if (opts.sandbox) {
|
||||
return {
|
||||
mode: "child" as const,
|
||||
argv: [
|
||||
"docker",
|
||||
...buildDockerExecArgs({
|
||||
containerName: opts.sandbox.containerName,
|
||||
command: execCommand,
|
||||
workdir: opts.containerWorkdir ?? opts.sandbox.containerWorkdir,
|
||||
env: opts.env,
|
||||
tty: opts.usePty,
|
||||
}),
|
||||
],
|
||||
env: process.env,
|
||||
stdinMode: opts.usePty ? ("pipe-open" as const) : ("pipe-closed" as const),
|
||||
};
|
||||
}
|
||||
const { shell, args: shellArgs } = getShellConfig();
|
||||
const childArgv = [shell, ...shellArgs, execCommand];
|
||||
if (opts.usePty) {
|
||||
return {
|
||||
mode: "pty" as const,
|
||||
ptyCommand: execCommand,
|
||||
childFallbackArgv: childArgv,
|
||||
env: opts.env,
|
||||
stdinMode: "pipe-open" as const,
|
||||
};
|
||||
}
|
||||
return {
|
||||
mode: "child" as const,
|
||||
argv: childArgv,
|
||||
env: opts.env,
|
||||
stdinMode: "pipe-closed" as const,
|
||||
};
|
||||
})();
|
||||
|
||||
let managedRun: ManagedRun | null = null;
|
||||
let usingPty = spawnSpec.mode === "pty";
|
||||
const cursorResponse = buildCursorPositionResponse();
|
||||
|
||||
const onSupervisorStdout = (chunk: string) => {
|
||||
if (usingPty) {
|
||||
const { cleaned, requests } = stripDsrRequests(chunk);
|
||||
if (requests > 0 && managedRun?.stdin) {
|
||||
for (let i = 0; i < requests; i += 1) {
|
||||
pty.write(cursorResponse);
|
||||
managedRun.stdin.write(cursorResponse);
|
||||
}
|
||||
}
|
||||
handleStdout(cleaned);
|
||||
});
|
||||
} else if (child) {
|
||||
child.stdout.on("data", handleStdout);
|
||||
child.stderr.on("data", handleStderr);
|
||||
}
|
||||
return;
|
||||
}
|
||||
handleStdout(chunk);
|
||||
};
|
||||
|
||||
const promise = new Promise<ExecProcessOutcome>((resolve) => {
|
||||
resolveFn = resolve;
|
||||
const handleExit = (code: number | null, exitSignal: NodeJS.Signals | number | null) => {
|
||||
if (timeoutTimer) {
|
||||
clearTimeout(timeoutTimer);
|
||||
}
|
||||
if (timeoutFinalizeTimer) {
|
||||
clearTimeout(timeoutFinalizeTimer);
|
||||
try {
|
||||
const spawnBase = {
|
||||
runId: sessionId,
|
||||
sessionId: opts.sessionKey?.trim() || sessionId,
|
||||
backendId: opts.sandbox ? "exec-sandbox" : "exec-host",
|
||||
scopeKey: opts.scopeKey,
|
||||
cwd: opts.workdir,
|
||||
env: spawnSpec.env,
|
||||
timeoutMs,
|
||||
captureOutput: false,
|
||||
onStdout: onSupervisorStdout,
|
||||
onStderr: handleStderr,
|
||||
};
|
||||
managedRun =
|
||||
spawnSpec.mode === "pty"
|
||||
? await supervisor.spawn({
|
||||
...spawnBase,
|
||||
mode: "pty",
|
||||
ptyCommand: spawnSpec.ptyCommand,
|
||||
})
|
||||
: await supervisor.spawn({
|
||||
...spawnBase,
|
||||
mode: "child",
|
||||
argv: spawnSpec.argv,
|
||||
stdinMode: spawnSpec.stdinMode,
|
||||
});
|
||||
} catch (err) {
|
||||
if (spawnSpec.mode === "pty") {
|
||||
const warning = `Warning: PTY spawn failed (${String(err)}); retrying without PTY for \`${opts.command}\`.`;
|
||||
logWarn(
|
||||
`exec: PTY spawn failed (${String(err)}); retrying without PTY for "${opts.command}".`,
|
||||
);
|
||||
opts.warnings.push(warning);
|
||||
usingPty = false;
|
||||
try {
|
||||
managedRun = await supervisor.spawn({
|
||||
runId: sessionId,
|
||||
sessionId: opts.sessionKey?.trim() || sessionId,
|
||||
backendId: "exec-host",
|
||||
scopeKey: opts.scopeKey,
|
||||
mode: "child",
|
||||
argv: spawnSpec.childFallbackArgv,
|
||||
cwd: opts.workdir,
|
||||
env: spawnSpec.env,
|
||||
stdinMode: "pipe-open",
|
||||
timeoutMs,
|
||||
captureOutput: false,
|
||||
onStdout: handleStdout,
|
||||
onStderr: handleStderr,
|
||||
});
|
||||
} catch (retryErr) {
|
||||
markExited(session, null, null, "failed");
|
||||
maybeNotifyOnExit(session, "failed");
|
||||
throw retryErr;
|
||||
}
|
||||
} else {
|
||||
markExited(session, null, null, "failed");
|
||||
maybeNotifyOnExit(session, "failed");
|
||||
throw err;
|
||||
}
|
||||
}
|
||||
session.stdin = managedRun.stdin;
|
||||
session.pid = managedRun.pid;
|
||||
|
||||
const promise = managedRun
|
||||
.wait()
|
||||
.then((exit): ExecProcessOutcome => {
|
||||
const durationMs = Date.now() - startedAt;
|
||||
const wasSignal = exitSignal != null;
|
||||
const isSuccess = code === 0 && !wasSignal && !timedOut;
|
||||
const status: "completed" | "failed" = isSuccess ? "completed" : "failed";
|
||||
markExited(session, code, exitSignal, status);
|
||||
const status: "completed" | "failed" =
|
||||
exit.exitCode === 0 && exit.reason === "exit" ? "completed" : "failed";
|
||||
markExited(session, exit.exitCode, exit.exitSignal, status);
|
||||
maybeNotifyOnExit(session, status);
|
||||
if (!session.child && session.stdin) {
|
||||
session.stdin.destroyed = true;
|
||||
}
|
||||
|
||||
if (settled) {
|
||||
return;
|
||||
}
|
||||
const aggregated = session.aggregated.trim();
|
||||
if (!isSuccess) {
|
||||
const reason = timedOut
|
||||
? `Command timed out after ${opts.timeoutSec} seconds`
|
||||
: wasSignal && exitSignal
|
||||
? `Command aborted by signal ${exitSignal}`
|
||||
: code === null
|
||||
? "Command aborted before exit code was captured"
|
||||
: `Command exited with code ${code}`;
|
||||
const message = aggregated ? `${aggregated}\n\n${reason}` : reason;
|
||||
settle({
|
||||
status: "failed",
|
||||
exitCode: code ?? null,
|
||||
exitSignal: exitSignal ?? null,
|
||||
if (status === "completed") {
|
||||
return {
|
||||
status: "completed",
|
||||
exitCode: exit.exitCode ?? 0,
|
||||
exitSignal: exit.exitSignal,
|
||||
durationMs,
|
||||
aggregated,
|
||||
timedOut,
|
||||
reason: message,
|
||||
});
|
||||
return;
|
||||
timedOut: false,
|
||||
};
|
||||
}
|
||||
settle({
|
||||
status: "completed",
|
||||
exitCode: code ?? 0,
|
||||
exitSignal: exitSignal ?? null,
|
||||
const reason =
|
||||
exit.reason === "overall-timeout"
|
||||
? `Command timed out after ${opts.timeoutSec} seconds`
|
||||
: exit.reason === "no-output-timeout"
|
||||
? "Command timed out waiting for output"
|
||||
: exit.exitSignal != null
|
||||
? `Command aborted by signal ${exit.exitSignal}`
|
||||
: exit.exitCode == null
|
||||
? "Command aborted before exit code was captured"
|
||||
: `Command exited with code ${exit.exitCode}`;
|
||||
return {
|
||||
status: "failed",
|
||||
exitCode: exit.exitCode,
|
||||
exitSignal: exit.exitSignal,
|
||||
durationMs,
|
||||
aggregated,
|
||||
timedOut: exit.timedOut,
|
||||
reason: aggregated ? `${aggregated}\n\n${reason}` : reason,
|
||||
};
|
||||
})
|
||||
.catch((err): ExecProcessOutcome => {
|
||||
markExited(session, null, null, "failed");
|
||||
maybeNotifyOnExit(session, "failed");
|
||||
const aggregated = session.aggregated.trim();
|
||||
const message = aggregated ? `${aggregated}\n\n${String(err)}` : String(err);
|
||||
return {
|
||||
status: "failed",
|
||||
exitCode: null,
|
||||
exitSignal: null,
|
||||
durationMs: Date.now() - startedAt,
|
||||
aggregated,
|
||||
timedOut: false,
|
||||
});
|
||||
};
|
||||
|
||||
if (pty) {
|
||||
pty.onExit((event) => {
|
||||
const rawSignal = event.signal ?? null;
|
||||
const normalizedSignal = rawSignal === 0 ? null : rawSignal;
|
||||
handleExit(event.exitCode ?? null, normalizedSignal);
|
||||
});
|
||||
} else if (child) {
|
||||
child.once("close", (code, exitSignal) => {
|
||||
handleExit(code, exitSignal);
|
||||
});
|
||||
|
||||
child.once("error", (err) => {
|
||||
if (timeoutTimer) {
|
||||
clearTimeout(timeoutTimer);
|
||||
}
|
||||
if (timeoutFinalizeTimer) {
|
||||
clearTimeout(timeoutFinalizeTimer);
|
||||
}
|
||||
markExited(session, null, null, "failed");
|
||||
maybeNotifyOnExit(session, "failed");
|
||||
const aggregated = session.aggregated.trim();
|
||||
const message = aggregated ? `${aggregated}\n\n${String(err)}` : String(err);
|
||||
settle({
|
||||
status: "failed",
|
||||
exitCode: null,
|
||||
exitSignal: null,
|
||||
durationMs: Date.now() - startedAt,
|
||||
aggregated,
|
||||
timedOut,
|
||||
reason: message,
|
||||
});
|
||||
});
|
||||
}
|
||||
});
|
||||
reason: message,
|
||||
};
|
||||
});
|
||||
|
||||
return {
|
||||
session,
|
||||
startedAt,
|
||||
pid: session.pid ?? undefined,
|
||||
promise,
|
||||
kill: () => killSession(session),
|
||||
kill: () => {
|
||||
managedRun?.cancel("manual-cancel");
|
||||
},
|
||||
};
|
||||
}
|
||||
|
||||
73
src/agents/bash-tools.exec.pty-cleanup.test.ts
Normal file
73
src/agents/bash-tools.exec.pty-cleanup.test.ts
Normal file
@@ -0,0 +1,73 @@
|
||||
import { afterEach, expect, test, vi } from "vitest";
|
||||
import { resetProcessRegistryForTests } from "./bash-process-registry";
|
||||
|
||||
afterEach(() => {
|
||||
resetProcessRegistryForTests();
|
||||
vi.resetModules();
|
||||
vi.clearAllMocks();
|
||||
});
|
||||
|
||||
test("exec disposes PTY listeners after normal exit", async () => {
|
||||
const disposeData = vi.fn();
|
||||
const disposeExit = vi.fn();
|
||||
|
||||
vi.doMock("@lydell/node-pty", () => ({
|
||||
spawn: () => {
|
||||
return {
|
||||
pid: 0,
|
||||
write: vi.fn(),
|
||||
onData: (listener: (value: string) => void) => {
|
||||
setTimeout(() => listener("ok"), 0);
|
||||
return { dispose: disposeData };
|
||||
},
|
||||
onExit: (listener: (event: { exitCode: number; signal?: number }) => void) => {
|
||||
setTimeout(() => listener({ exitCode: 0 }), 0);
|
||||
return { dispose: disposeExit };
|
||||
},
|
||||
kill: vi.fn(),
|
||||
};
|
||||
},
|
||||
}));
|
||||
|
||||
const { createExecTool } = await import("./bash-tools.exec");
|
||||
const tool = createExecTool({ allowBackground: false });
|
||||
const result = await tool.execute("toolcall", {
|
||||
command: "echo ok",
|
||||
pty: true,
|
||||
});
|
||||
|
||||
expect(result.details.status).toBe("completed");
|
||||
expect(disposeData).toHaveBeenCalledTimes(1);
|
||||
expect(disposeExit).toHaveBeenCalledTimes(1);
|
||||
});
|
||||
|
||||
test("exec tears down PTY resources on timeout", async () => {
|
||||
const disposeData = vi.fn();
|
||||
const disposeExit = vi.fn();
|
||||
const kill = vi.fn();
|
||||
|
||||
vi.doMock("@lydell/node-pty", () => ({
|
||||
spawn: () => {
|
||||
return {
|
||||
pid: 0,
|
||||
write: vi.fn(),
|
||||
onData: () => ({ dispose: disposeData }),
|
||||
onExit: () => ({ dispose: disposeExit }),
|
||||
kill,
|
||||
};
|
||||
},
|
||||
}));
|
||||
|
||||
const { createExecTool } = await import("./bash-tools.exec");
|
||||
const tool = createExecTool({ allowBackground: false });
|
||||
await expect(
|
||||
tool.execute("toolcall", {
|
||||
command: "sleep 5",
|
||||
pty: true,
|
||||
timeout: 0.01,
|
||||
}),
|
||||
).rejects.toThrow("Command timed out");
|
||||
expect(kill).toHaveBeenCalledTimes(1);
|
||||
expect(disposeData).toHaveBeenCalledTimes(1);
|
||||
expect(disposeExit).toHaveBeenCalledTimes(1);
|
||||
});
|
||||
40
src/agents/bash-tools.exec.pty-fallback-failure.test.ts
Normal file
40
src/agents/bash-tools.exec.pty-fallback-failure.test.ts
Normal file
@@ -0,0 +1,40 @@
|
||||
import { afterEach, expect, test, vi } from "vitest";
|
||||
import { listRunningSessions, resetProcessRegistryForTests } from "./bash-process-registry";
|
||||
|
||||
const { supervisorSpawnMock } = vi.hoisted(() => ({
|
||||
supervisorSpawnMock: vi.fn(),
|
||||
}));
|
||||
|
||||
vi.mock("../process/supervisor/index.js", () => ({
|
||||
getProcessSupervisor: () => ({
|
||||
spawn: (...args: unknown[]) => supervisorSpawnMock(...args),
|
||||
cancel: vi.fn(),
|
||||
cancelScope: vi.fn(),
|
||||
reconcileOrphans: vi.fn(),
|
||||
getRecord: vi.fn(),
|
||||
}),
|
||||
}));
|
||||
|
||||
afterEach(() => {
|
||||
resetProcessRegistryForTests();
|
||||
vi.resetModules();
|
||||
vi.clearAllMocks();
|
||||
});
|
||||
|
||||
test("exec cleans session state when PTY fallback spawn also fails", async () => {
|
||||
supervisorSpawnMock
|
||||
.mockRejectedValueOnce(new Error("pty spawn failed"))
|
||||
.mockRejectedValueOnce(new Error("child fallback failed"));
|
||||
|
||||
const { createExecTool } = await import("./bash-tools.exec");
|
||||
const tool = createExecTool({ allowBackground: false });
|
||||
|
||||
await expect(
|
||||
tool.execute("toolcall", {
|
||||
command: "echo ok",
|
||||
pty: true,
|
||||
}),
|
||||
).rejects.toThrow("child fallback failed");
|
||||
|
||||
expect(listRunningSessions()).toHaveLength(0);
|
||||
});
|
||||
152
src/agents/bash-tools.process.supervisor.test.ts
Normal file
152
src/agents/bash-tools.process.supervisor.test.ts
Normal file
@@ -0,0 +1,152 @@
|
||||
import { afterEach, beforeEach, describe, expect, it, vi } from "vitest";
|
||||
import type { ProcessSession } from "./bash-process-registry.js";
|
||||
import {
|
||||
addSession,
|
||||
getFinishedSession,
|
||||
getSession,
|
||||
resetProcessRegistryForTests,
|
||||
} from "./bash-process-registry.js";
|
||||
import { createProcessTool } from "./bash-tools.process.js";
|
||||
|
||||
const { supervisorMock } = vi.hoisted(() => ({
|
||||
supervisorMock: {
|
||||
spawn: vi.fn(),
|
||||
cancel: vi.fn(),
|
||||
cancelScope: vi.fn(),
|
||||
reconcileOrphans: vi.fn(),
|
||||
getRecord: vi.fn(),
|
||||
},
|
||||
}));
|
||||
|
||||
const { killProcessTreeMock } = vi.hoisted(() => ({
|
||||
killProcessTreeMock: vi.fn(),
|
||||
}));
|
||||
|
||||
vi.mock("../process/supervisor/index.js", () => ({
|
||||
getProcessSupervisor: () => supervisorMock,
|
||||
}));
|
||||
|
||||
vi.mock("../process/kill-tree.js", () => ({
|
||||
killProcessTree: (...args: unknown[]) => killProcessTreeMock(...args),
|
||||
}));
|
||||
|
||||
function createBackgroundSession(id: string, pid?: number): ProcessSession {
|
||||
return {
|
||||
id,
|
||||
command: "sleep 999",
|
||||
startedAt: Date.now(),
|
||||
cwd: "/tmp",
|
||||
maxOutputChars: 10_000,
|
||||
pendingMaxOutputChars: 30_000,
|
||||
totalOutputChars: 0,
|
||||
pendingStdout: [],
|
||||
pendingStderr: [],
|
||||
pendingStdoutChars: 0,
|
||||
pendingStderrChars: 0,
|
||||
aggregated: "",
|
||||
tail: "",
|
||||
pid,
|
||||
exited: false,
|
||||
exitCode: undefined,
|
||||
exitSignal: undefined,
|
||||
truncated: false,
|
||||
backgrounded: true,
|
||||
};
|
||||
}
|
||||
|
||||
describe("process tool supervisor cancellation", () => {
|
||||
beforeEach(() => {
|
||||
supervisorMock.spawn.mockReset();
|
||||
supervisorMock.cancel.mockReset();
|
||||
supervisorMock.cancelScope.mockReset();
|
||||
supervisorMock.reconcileOrphans.mockReset();
|
||||
supervisorMock.getRecord.mockReset();
|
||||
killProcessTreeMock.mockReset();
|
||||
});
|
||||
|
||||
afterEach(() => {
|
||||
resetProcessRegistryForTests();
|
||||
});
|
||||
|
||||
it("routes kill through supervisor when run is managed", async () => {
|
||||
supervisorMock.getRecord.mockReturnValue({
|
||||
runId: "sess",
|
||||
state: "running",
|
||||
});
|
||||
addSession(createBackgroundSession("sess"));
|
||||
const processTool = createProcessTool();
|
||||
|
||||
const result = await processTool.execute("toolcall", {
|
||||
action: "kill",
|
||||
sessionId: "sess",
|
||||
});
|
||||
|
||||
expect(supervisorMock.cancel).toHaveBeenCalledWith("sess", "manual-cancel");
|
||||
expect(getSession("sess")).toBeDefined();
|
||||
expect(getSession("sess")?.exited).toBe(false);
|
||||
expect(result.content[0]).toMatchObject({
|
||||
type: "text",
|
||||
text: "Termination requested for session sess.",
|
||||
});
|
||||
});
|
||||
|
||||
it("remove drops running session immediately when cancellation is requested", async () => {
|
||||
supervisorMock.getRecord.mockReturnValue({
|
||||
runId: "sess",
|
||||
state: "running",
|
||||
});
|
||||
addSession(createBackgroundSession("sess"));
|
||||
const processTool = createProcessTool();
|
||||
|
||||
const result = await processTool.execute("toolcall", {
|
||||
action: "remove",
|
||||
sessionId: "sess",
|
||||
});
|
||||
|
||||
expect(supervisorMock.cancel).toHaveBeenCalledWith("sess", "manual-cancel");
|
||||
expect(getSession("sess")).toBeUndefined();
|
||||
expect(getFinishedSession("sess")).toBeUndefined();
|
||||
expect(result.content[0]).toMatchObject({
|
||||
type: "text",
|
||||
text: "Removed session sess (termination requested).",
|
||||
});
|
||||
});
|
||||
|
||||
it("falls back to process-tree kill when supervisor record is missing", async () => {
|
||||
supervisorMock.getRecord.mockReturnValue(undefined);
|
||||
addSession(createBackgroundSession("sess-fallback", 4242));
|
||||
const processTool = createProcessTool();
|
||||
|
||||
const result = await processTool.execute("toolcall", {
|
||||
action: "kill",
|
||||
sessionId: "sess-fallback",
|
||||
});
|
||||
|
||||
expect(killProcessTreeMock).toHaveBeenCalledWith(4242);
|
||||
expect(getSession("sess-fallback")).toBeUndefined();
|
||||
expect(getFinishedSession("sess-fallback")).toBeDefined();
|
||||
expect(result.content[0]).toMatchObject({
|
||||
type: "text",
|
||||
text: "Killed session sess-fallback.",
|
||||
});
|
||||
});
|
||||
|
||||
it("fails remove when no supervisor record and no pid is available", async () => {
|
||||
supervisorMock.getRecord.mockReturnValue(undefined);
|
||||
addSession(createBackgroundSession("sess-no-pid"));
|
||||
const processTool = createProcessTool();
|
||||
|
||||
const result = await processTool.execute("toolcall", {
|
||||
action: "remove",
|
||||
sessionId: "sess-no-pid",
|
||||
});
|
||||
|
||||
expect(killProcessTreeMock).not.toHaveBeenCalled();
|
||||
expect(getSession("sess-no-pid")).toBeDefined();
|
||||
expect(result.details).toMatchObject({ status: "failed" });
|
||||
expect(result.content[0]).toMatchObject({
|
||||
type: "text",
|
||||
text: "Unable to remove session sess-no-pid: no active supervisor run or process id.",
|
||||
});
|
||||
});
|
||||
});
|
||||
@@ -1,7 +1,10 @@
|
||||
import type { AgentTool, AgentToolResult } from "@mariozechner/pi-agent-core";
|
||||
import { Type } from "@sinclair/typebox";
|
||||
import { formatDurationCompact } from "../infra/format-time/format-duration.ts";
|
||||
import { killProcessTree } from "../process/kill-tree.js";
|
||||
import { getProcessSupervisor } from "../process/supervisor/index.js";
|
||||
import {
|
||||
type ProcessSession,
|
||||
deleteSession,
|
||||
drainSession,
|
||||
getFinishedSession,
|
||||
@@ -11,13 +14,7 @@ import {
|
||||
markExited,
|
||||
setJobTtlMs,
|
||||
} from "./bash-process-registry.js";
|
||||
import {
|
||||
deriveSessionName,
|
||||
killSession,
|
||||
pad,
|
||||
sliceLogLines,
|
||||
truncateMiddle,
|
||||
} from "./bash-tools.shared.js";
|
||||
import { deriveSessionName, pad, sliceLogLines, truncateMiddle } from "./bash-tools.shared.js";
|
||||
import { encodeKeySequence, encodePaste } from "./pty-keys.js";
|
||||
|
||||
export type ProcessToolDefaults = {
|
||||
@@ -107,9 +104,28 @@ export function createProcessTool(
|
||||
setJobTtlMs(defaults.cleanupMs);
|
||||
}
|
||||
const scopeKey = defaults?.scopeKey;
|
||||
const supervisor = getProcessSupervisor();
|
||||
const isInScope = (session?: { scopeKey?: string } | null) =>
|
||||
!scopeKey || session?.scopeKey === scopeKey;
|
||||
|
||||
const cancelManagedSession = (sessionId: string) => {
|
||||
const record = supervisor.getRecord(sessionId);
|
||||
if (!record || record.state === "exited") {
|
||||
return false;
|
||||
}
|
||||
supervisor.cancel(sessionId, "manual-cancel");
|
||||
return true;
|
||||
};
|
||||
|
||||
const terminateSessionFallback = (session: ProcessSession) => {
|
||||
const pid = session.pid ?? session.child?.pid;
|
||||
if (typeof pid !== "number" || !Number.isFinite(pid) || pid <= 0) {
|
||||
return false;
|
||||
}
|
||||
killProcessTree(pid);
|
||||
return true;
|
||||
};
|
||||
|
||||
return {
|
||||
name: "process",
|
||||
label: "process",
|
||||
@@ -523,10 +539,25 @@ export function createProcessTool(
|
||||
if (!scopedSession.backgrounded) {
|
||||
return failText(`Session ${params.sessionId} is not backgrounded.`);
|
||||
}
|
||||
killSession(scopedSession);
|
||||
markExited(scopedSession, null, "SIGKILL", "failed");
|
||||
const canceled = cancelManagedSession(scopedSession.id);
|
||||
if (!canceled) {
|
||||
const terminated = terminateSessionFallback(scopedSession);
|
||||
if (!terminated) {
|
||||
return failText(
|
||||
`Unable to terminate session ${params.sessionId}: no active supervisor run or process id.`,
|
||||
);
|
||||
}
|
||||
markExited(scopedSession, null, "SIGKILL", "failed");
|
||||
}
|
||||
return {
|
||||
content: [{ type: "text", text: `Killed session ${params.sessionId}.` }],
|
||||
content: [
|
||||
{
|
||||
type: "text",
|
||||
text: canceled
|
||||
? `Termination requested for session ${params.sessionId}.`
|
||||
: `Killed session ${params.sessionId}.`,
|
||||
},
|
||||
],
|
||||
details: {
|
||||
status: "failed",
|
||||
name: scopedSession ? deriveSessionName(scopedSession.command) : undefined,
|
||||
@@ -555,10 +586,30 @@ export function createProcessTool(
|
||||
|
||||
case "remove": {
|
||||
if (scopedSession) {
|
||||
killSession(scopedSession);
|
||||
markExited(scopedSession, null, "SIGKILL", "failed");
|
||||
const canceled = cancelManagedSession(scopedSession.id);
|
||||
if (canceled) {
|
||||
// Keep remove semantics deterministic: drop from process registry now.
|
||||
scopedSession.backgrounded = false;
|
||||
deleteSession(params.sessionId);
|
||||
} else {
|
||||
const terminated = terminateSessionFallback(scopedSession);
|
||||
if (!terminated) {
|
||||
return failText(
|
||||
`Unable to remove session ${params.sessionId}: no active supervisor run or process id.`,
|
||||
);
|
||||
}
|
||||
markExited(scopedSession, null, "SIGKILL", "failed");
|
||||
deleteSession(params.sessionId);
|
||||
}
|
||||
return {
|
||||
content: [{ type: "text", text: `Removed session ${params.sessionId}.` }],
|
||||
content: [
|
||||
{
|
||||
type: "text",
|
||||
text: canceled
|
||||
? `Removed session ${params.sessionId} (termination requested).`
|
||||
: `Removed session ${params.sessionId}.`,
|
||||
},
|
||||
],
|
||||
details: {
|
||||
status: "failed",
|
||||
name: scopedSession ? deriveSessionName(scopedSession.command) : undefined,
|
||||
|
||||
@@ -1,11 +1,9 @@
|
||||
import type { ChildProcessWithoutNullStreams } from "node:child_process";
|
||||
import { existsSync, statSync } from "node:fs";
|
||||
import fs from "node:fs/promises";
|
||||
import { homedir } from "node:os";
|
||||
import path from "node:path";
|
||||
import { sliceUtf16Safe } from "../utils.js";
|
||||
import { assertSandboxPath } from "./sandbox-paths.js";
|
||||
import { killProcessTree } from "./shell-utils.js";
|
||||
|
||||
const CHUNK_LIMIT = 8 * 1024;
|
||||
|
||||
@@ -115,13 +113,6 @@ export async function resolveSandboxWorkdir(params: {
|
||||
}
|
||||
}
|
||||
|
||||
export function killSession(session: { pid?: number; child?: ChildProcessWithoutNullStreams }) {
|
||||
const pid = session.pid ?? session.child?.pid;
|
||||
if (pid) {
|
||||
killProcessTree(pid);
|
||||
}
|
||||
}
|
||||
|
||||
export function resolveWorkdir(workdir: string, warnings: string[]) {
|
||||
const current = safeCwd();
|
||||
const fallback = current ?? homedir();
|
||||
|
||||
36
src/agents/cli-backends.test.ts
Normal file
36
src/agents/cli-backends.test.ts
Normal file
@@ -0,0 +1,36 @@
|
||||
import { describe, expect, it } from "vitest";
|
||||
import type { OpenClawConfig } from "../config/config.js";
|
||||
import { resolveCliBackendConfig } from "./cli-backends.js";
|
||||
|
||||
describe("resolveCliBackendConfig reliability merge", () => {
|
||||
it("deep-merges reliability watchdog overrides for codex", () => {
|
||||
const cfg = {
|
||||
agents: {
|
||||
defaults: {
|
||||
cliBackends: {
|
||||
"codex-cli": {
|
||||
command: "codex",
|
||||
reliability: {
|
||||
watchdog: {
|
||||
resume: {
|
||||
noOutputTimeoutMs: 42_000,
|
||||
},
|
||||
},
|
||||
},
|
||||
},
|
||||
},
|
||||
},
|
||||
},
|
||||
} satisfies OpenClawConfig;
|
||||
|
||||
const resolved = resolveCliBackendConfig("codex-cli", cfg);
|
||||
|
||||
expect(resolved).not.toBeNull();
|
||||
expect(resolved?.config.reliability?.watchdog?.resume?.noOutputTimeoutMs).toBe(42_000);
|
||||
// Ensure defaults are retained when only one field is overridden.
|
||||
expect(resolved?.config.reliability?.watchdog?.resume?.noOutputTimeoutRatio).toBe(0.3);
|
||||
expect(resolved?.config.reliability?.watchdog?.resume?.minMs).toBe(60_000);
|
||||
expect(resolved?.config.reliability?.watchdog?.resume?.maxMs).toBe(180_000);
|
||||
expect(resolved?.config.reliability?.watchdog?.fresh?.noOutputTimeoutRatio).toBe(0.8);
|
||||
});
|
||||
});
|
||||
@@ -1,5 +1,9 @@
|
||||
import type { OpenClawConfig } from "../config/config.js";
|
||||
import type { CliBackendConfig } from "../config/types.js";
|
||||
import {
|
||||
CLI_FRESH_WATCHDOG_DEFAULTS,
|
||||
CLI_RESUME_WATCHDOG_DEFAULTS,
|
||||
} from "./cli-watchdog-defaults.js";
|
||||
import { normalizeProviderId } from "./model-selection.js";
|
||||
|
||||
export type ResolvedCliBackend = {
|
||||
@@ -49,6 +53,12 @@ const DEFAULT_CLAUDE_BACKEND: CliBackendConfig = {
|
||||
systemPromptMode: "append",
|
||||
systemPromptWhen: "first",
|
||||
clearEnv: ["ANTHROPIC_API_KEY", "ANTHROPIC_API_KEY_OLD"],
|
||||
reliability: {
|
||||
watchdog: {
|
||||
fresh: { ...CLI_FRESH_WATCHDOG_DEFAULTS },
|
||||
resume: { ...CLI_RESUME_WATCHDOG_DEFAULTS },
|
||||
},
|
||||
},
|
||||
serialize: true,
|
||||
};
|
||||
|
||||
@@ -73,6 +83,12 @@ const DEFAULT_CODEX_BACKEND: CliBackendConfig = {
|
||||
sessionMode: "existing",
|
||||
imageArg: "--image",
|
||||
imageMode: "repeat",
|
||||
reliability: {
|
||||
watchdog: {
|
||||
fresh: { ...CLI_FRESH_WATCHDOG_DEFAULTS },
|
||||
resume: { ...CLI_RESUME_WATCHDOG_DEFAULTS },
|
||||
},
|
||||
},
|
||||
serialize: true,
|
||||
};
|
||||
|
||||
@@ -96,6 +112,10 @@ function mergeBackendConfig(base: CliBackendConfig, override?: CliBackendConfig)
|
||||
if (!override) {
|
||||
return { ...base };
|
||||
}
|
||||
const baseFresh = base.reliability?.watchdog?.fresh ?? {};
|
||||
const baseResume = base.reliability?.watchdog?.resume ?? {};
|
||||
const overrideFresh = override.reliability?.watchdog?.fresh ?? {};
|
||||
const overrideResume = override.reliability?.watchdog?.resume ?? {};
|
||||
return {
|
||||
...base,
|
||||
...override,
|
||||
@@ -106,6 +126,22 @@ function mergeBackendConfig(base: CliBackendConfig, override?: CliBackendConfig)
|
||||
sessionIdFields: override.sessionIdFields ?? base.sessionIdFields,
|
||||
sessionArgs: override.sessionArgs ?? base.sessionArgs,
|
||||
resumeArgs: override.resumeArgs ?? base.resumeArgs,
|
||||
reliability: {
|
||||
...base.reliability,
|
||||
...override.reliability,
|
||||
watchdog: {
|
||||
...base.reliability?.watchdog,
|
||||
...override.reliability?.watchdog,
|
||||
fresh: {
|
||||
...baseFresh,
|
||||
...overrideFresh,
|
||||
},
|
||||
resume: {
|
||||
...baseResume,
|
||||
...overrideResume,
|
||||
},
|
||||
},
|
||||
},
|
||||
};
|
||||
}
|
||||
|
||||
|
||||
@@ -3,50 +3,69 @@ import os from "node:os";
|
||||
import path from "node:path";
|
||||
import { beforeEach, describe, expect, it, vi } from "vitest";
|
||||
import type { OpenClawConfig } from "../config/config.js";
|
||||
import type { CliBackendConfig } from "../config/types.js";
|
||||
import { runCliAgent } from "./cli-runner.js";
|
||||
import { cleanupResumeProcesses, cleanupSuspendedCliProcesses } from "./cli-runner/helpers.js";
|
||||
import { resolveCliNoOutputTimeoutMs } from "./cli-runner/helpers.js";
|
||||
|
||||
const runCommandWithTimeoutMock = vi.fn();
|
||||
const runExecMock = vi.fn();
|
||||
const supervisorSpawnMock = vi.fn();
|
||||
|
||||
vi.mock("../process/exec.js", () => ({
|
||||
runCommandWithTimeout: (...args: unknown[]) => runCommandWithTimeoutMock(...args),
|
||||
runExec: (...args: unknown[]) => runExecMock(...args),
|
||||
vi.mock("../process/supervisor/index.js", () => ({
|
||||
getProcessSupervisor: () => ({
|
||||
spawn: (...args: unknown[]) => supervisorSpawnMock(...args),
|
||||
cancel: vi.fn(),
|
||||
cancelScope: vi.fn(),
|
||||
reconcileOrphans: vi.fn(),
|
||||
getRecord: vi.fn(),
|
||||
}),
|
||||
}));
|
||||
|
||||
describe("runCliAgent resume cleanup", () => {
|
||||
type MockRunExit = {
|
||||
reason:
|
||||
| "manual-cancel"
|
||||
| "overall-timeout"
|
||||
| "no-output-timeout"
|
||||
| "spawn-error"
|
||||
| "signal"
|
||||
| "exit";
|
||||
exitCode: number | null;
|
||||
exitSignal: NodeJS.Signals | number | null;
|
||||
durationMs: number;
|
||||
stdout: string;
|
||||
stderr: string;
|
||||
timedOut: boolean;
|
||||
noOutputTimedOut: boolean;
|
||||
};
|
||||
|
||||
function createManagedRun(exit: MockRunExit, pid = 1234) {
|
||||
return {
|
||||
runId: "run-supervisor",
|
||||
pid,
|
||||
startedAtMs: Date.now(),
|
||||
stdin: undefined,
|
||||
wait: vi.fn().mockResolvedValue(exit),
|
||||
cancel: vi.fn(),
|
||||
};
|
||||
}
|
||||
|
||||
describe("runCliAgent with process supervisor", () => {
|
||||
beforeEach(() => {
|
||||
runCommandWithTimeoutMock.mockReset();
|
||||
runExecMock.mockReset();
|
||||
supervisorSpawnMock.mockReset();
|
||||
});
|
||||
|
||||
it("kills stale resume processes for codex sessions", async () => {
|
||||
const selfPid = process.pid;
|
||||
|
||||
runExecMock
|
||||
.mockResolvedValueOnce({
|
||||
stdout: " 1 999 S /bin/launchd\n",
|
||||
it("runs CLI through supervisor and returns payload", async () => {
|
||||
supervisorSpawnMock.mockResolvedValueOnce(
|
||||
createManagedRun({
|
||||
reason: "exit",
|
||||
exitCode: 0,
|
||||
exitSignal: null,
|
||||
durationMs: 50,
|
||||
stdout: "ok",
|
||||
stderr: "",
|
||||
}) // cleanupSuspendedCliProcesses (ps) — ppid 999 != selfPid, no match
|
||||
.mockResolvedValueOnce({
|
||||
stdout: [
|
||||
` ${selfPid + 1} ${selfPid} codex exec resume thread-123 --color never --sandbox read-only --skip-git-repo-check`,
|
||||
` ${selfPid + 2} 999 codex exec resume thread-123 --color never --sandbox read-only --skip-git-repo-check`,
|
||||
].join("\n"),
|
||||
stderr: "",
|
||||
}) // cleanupResumeProcesses (ps)
|
||||
.mockResolvedValueOnce({ stdout: "", stderr: "" }) // cleanupResumeProcesses (kill -TERM)
|
||||
.mockResolvedValueOnce({ stdout: "", stderr: "" }); // cleanupResumeProcesses (kill -9)
|
||||
runCommandWithTimeoutMock.mockResolvedValueOnce({
|
||||
stdout: "ok",
|
||||
stderr: "",
|
||||
code: 0,
|
||||
signal: null,
|
||||
killed: false,
|
||||
});
|
||||
timedOut: false,
|
||||
noOutputTimedOut: false,
|
||||
}),
|
||||
);
|
||||
|
||||
await runCliAgent({
|
||||
const result = await runCliAgent({
|
||||
sessionId: "s1",
|
||||
sessionFile: "/tmp/session.jsonl",
|
||||
workspaceDir: "/tmp",
|
||||
@@ -58,28 +77,80 @@ describe("runCliAgent resume cleanup", () => {
|
||||
cliSessionId: "thread-123",
|
||||
});
|
||||
|
||||
if (process.platform === "win32") {
|
||||
expect(runExecMock).not.toHaveBeenCalled();
|
||||
return;
|
||||
}
|
||||
expect(result.payloads?.[0]?.text).toBe("ok");
|
||||
expect(supervisorSpawnMock).toHaveBeenCalledTimes(1);
|
||||
const input = supervisorSpawnMock.mock.calls[0]?.[0] as {
|
||||
argv?: string[];
|
||||
mode?: string;
|
||||
timeoutMs?: number;
|
||||
noOutputTimeoutMs?: number;
|
||||
replaceExistingScope?: boolean;
|
||||
scopeKey?: string;
|
||||
};
|
||||
expect(input.mode).toBe("child");
|
||||
expect(input.argv?.[0]).toBe("codex");
|
||||
expect(input.timeoutMs).toBe(1_000);
|
||||
expect(input.noOutputTimeoutMs).toBeGreaterThanOrEqual(1_000);
|
||||
expect(input.replaceExistingScope).toBe(true);
|
||||
expect(input.scopeKey).toContain("thread-123");
|
||||
});
|
||||
|
||||
expect(runExecMock).toHaveBeenCalledTimes(4);
|
||||
it("fails with timeout when no-output watchdog trips", async () => {
|
||||
supervisorSpawnMock.mockResolvedValueOnce(
|
||||
createManagedRun({
|
||||
reason: "no-output-timeout",
|
||||
exitCode: null,
|
||||
exitSignal: "SIGKILL",
|
||||
durationMs: 200,
|
||||
stdout: "",
|
||||
stderr: "",
|
||||
timedOut: true,
|
||||
noOutputTimedOut: true,
|
||||
}),
|
||||
);
|
||||
|
||||
// Second call: cleanupResumeProcesses ps
|
||||
const psCall = runExecMock.mock.calls[1] ?? [];
|
||||
expect(psCall[0]).toBe("ps");
|
||||
await expect(
|
||||
runCliAgent({
|
||||
sessionId: "s1",
|
||||
sessionFile: "/tmp/session.jsonl",
|
||||
workspaceDir: "/tmp",
|
||||
prompt: "hi",
|
||||
provider: "codex-cli",
|
||||
model: "gpt-5.2-codex",
|
||||
timeoutMs: 1_000,
|
||||
runId: "run-2",
|
||||
cliSessionId: "thread-123",
|
||||
}),
|
||||
).rejects.toThrow("produced no output");
|
||||
});
|
||||
|
||||
// Third call: TERM, only the child PID
|
||||
const termCall = runExecMock.mock.calls[2] ?? [];
|
||||
expect(termCall[0]).toBe("kill");
|
||||
const termArgs = termCall[1] as string[];
|
||||
expect(termArgs).toEqual(["-TERM", String(selfPid + 1)]);
|
||||
it("fails with timeout when overall timeout trips", async () => {
|
||||
supervisorSpawnMock.mockResolvedValueOnce(
|
||||
createManagedRun({
|
||||
reason: "overall-timeout",
|
||||
exitCode: null,
|
||||
exitSignal: "SIGKILL",
|
||||
durationMs: 200,
|
||||
stdout: "",
|
||||
stderr: "",
|
||||
timedOut: true,
|
||||
noOutputTimedOut: false,
|
||||
}),
|
||||
);
|
||||
|
||||
// Fourth call: KILL, only the child PID
|
||||
const killCall = runExecMock.mock.calls[3] ?? [];
|
||||
expect(killCall[0]).toBe("kill");
|
||||
const killArgs = killCall[1] as string[];
|
||||
expect(killArgs).toEqual(["-9", String(selfPid + 1)]);
|
||||
await expect(
|
||||
runCliAgent({
|
||||
sessionId: "s1",
|
||||
sessionFile: "/tmp/session.jsonl",
|
||||
workspaceDir: "/tmp",
|
||||
prompt: "hi",
|
||||
provider: "codex-cli",
|
||||
model: "gpt-5.2-codex",
|
||||
timeoutMs: 1_000,
|
||||
runId: "run-3",
|
||||
cliSessionId: "thread-123",
|
||||
}),
|
||||
).rejects.toThrow("exceeded timeout");
|
||||
});
|
||||
|
||||
it("falls back to per-agent workspace when workspaceDir is missing", async () => {
|
||||
@@ -94,14 +165,18 @@ describe("runCliAgent resume cleanup", () => {
|
||||
},
|
||||
} satisfies OpenClawConfig;
|
||||
|
||||
runExecMock.mockResolvedValue({ stdout: "", stderr: "" });
|
||||
runCommandWithTimeoutMock.mockResolvedValueOnce({
|
||||
stdout: "ok",
|
||||
stderr: "",
|
||||
code: 0,
|
||||
signal: null,
|
||||
killed: false,
|
||||
});
|
||||
supervisorSpawnMock.mockResolvedValueOnce(
|
||||
createManagedRun({
|
||||
reason: "exit",
|
||||
exitCode: 0,
|
||||
exitSignal: null,
|
||||
durationMs: 25,
|
||||
stdout: "ok",
|
||||
stderr: "",
|
||||
timedOut: false,
|
||||
noOutputTimedOut: false,
|
||||
}),
|
||||
);
|
||||
|
||||
try {
|
||||
await runCliAgent({
|
||||
@@ -114,264 +189,33 @@ describe("runCliAgent resume cleanup", () => {
|
||||
provider: "codex-cli",
|
||||
model: "gpt-5.2-codex",
|
||||
timeoutMs: 1_000,
|
||||
runId: "run-1",
|
||||
runId: "run-4",
|
||||
});
|
||||
} finally {
|
||||
await fs.rm(tempDir, { recursive: true, force: true });
|
||||
}
|
||||
|
||||
const options = runCommandWithTimeoutMock.mock.calls[0]?.[1] as { cwd?: string };
|
||||
expect(options.cwd).toBe(path.resolve(fallbackWorkspace));
|
||||
const input = supervisorSpawnMock.mock.calls[0]?.[0] as { cwd?: string };
|
||||
expect(input.cwd).toBe(path.resolve(fallbackWorkspace));
|
||||
});
|
||||
});
|
||||
|
||||
it("throws when sessionKey is malformed", async () => {
|
||||
const tempDir = await fs.mkdtemp(path.join(os.tmpdir(), "openclaw-cli-runner-"));
|
||||
const mainWorkspace = path.join(tempDir, "workspace-main");
|
||||
const researchWorkspace = path.join(tempDir, "workspace-research");
|
||||
await fs.mkdir(mainWorkspace, { recursive: true });
|
||||
await fs.mkdir(researchWorkspace, { recursive: true });
|
||||
const cfg = {
|
||||
agents: {
|
||||
defaults: {
|
||||
workspace: mainWorkspace,
|
||||
describe("resolveCliNoOutputTimeoutMs", () => {
|
||||
it("uses backend-configured resume watchdog override", () => {
|
||||
const timeoutMs = resolveCliNoOutputTimeoutMs({
|
||||
backend: {
|
||||
command: "codex",
|
||||
reliability: {
|
||||
watchdog: {
|
||||
resume: {
|
||||
noOutputTimeoutMs: 42_000,
|
||||
},
|
||||
},
|
||||
},
|
||||
list: [{ id: "research", workspace: researchWorkspace }],
|
||||
},
|
||||
} satisfies OpenClawConfig;
|
||||
|
||||
try {
|
||||
await expect(
|
||||
runCliAgent({
|
||||
sessionId: "s1",
|
||||
sessionKey: "agent::broken",
|
||||
agentId: "research",
|
||||
sessionFile: "/tmp/session.jsonl",
|
||||
workspaceDir: undefined as unknown as string,
|
||||
config: cfg,
|
||||
prompt: "hi",
|
||||
provider: "codex-cli",
|
||||
model: "gpt-5.2-codex",
|
||||
timeoutMs: 1_000,
|
||||
runId: "run-2",
|
||||
}),
|
||||
).rejects.toThrow("Malformed agent session key");
|
||||
} finally {
|
||||
await fs.rm(tempDir, { recursive: true, force: true });
|
||||
}
|
||||
expect(runCommandWithTimeoutMock).not.toHaveBeenCalled();
|
||||
});
|
||||
});
|
||||
|
||||
describe("cleanupSuspendedCliProcesses", () => {
|
||||
beforeEach(() => {
|
||||
runExecMock.mockReset();
|
||||
});
|
||||
|
||||
it("skips when no session tokens are configured", async () => {
|
||||
await cleanupSuspendedCliProcesses(
|
||||
{
|
||||
command: "tool",
|
||||
} as CliBackendConfig,
|
||||
0,
|
||||
);
|
||||
|
||||
if (process.platform === "win32") {
|
||||
expect(runExecMock).not.toHaveBeenCalled();
|
||||
return;
|
||||
}
|
||||
|
||||
expect(runExecMock).not.toHaveBeenCalled();
|
||||
});
|
||||
|
||||
it("matches sessionArg-based commands", async () => {
|
||||
const selfPid = process.pid;
|
||||
runExecMock
|
||||
.mockResolvedValueOnce({
|
||||
stdout: [
|
||||
` 40 ${selfPid} T+ claude --session-id thread-1 -p`,
|
||||
` 41 ${selfPid} S claude --session-id thread-2 -p`,
|
||||
].join("\n"),
|
||||
stderr: "",
|
||||
})
|
||||
.mockResolvedValueOnce({ stdout: "", stderr: "" });
|
||||
|
||||
await cleanupSuspendedCliProcesses(
|
||||
{
|
||||
command: "claude",
|
||||
sessionArg: "--session-id",
|
||||
} as CliBackendConfig,
|
||||
0,
|
||||
);
|
||||
|
||||
if (process.platform === "win32") {
|
||||
expect(runExecMock).not.toHaveBeenCalled();
|
||||
return;
|
||||
}
|
||||
|
||||
expect(runExecMock).toHaveBeenCalledTimes(2);
|
||||
const killCall = runExecMock.mock.calls[1] ?? [];
|
||||
expect(killCall[0]).toBe("kill");
|
||||
expect(killCall[1]).toEqual(["-9", "40"]);
|
||||
});
|
||||
|
||||
it("matches resumeArgs with positional session id", async () => {
|
||||
const selfPid = process.pid;
|
||||
runExecMock
|
||||
.mockResolvedValueOnce({
|
||||
stdout: [
|
||||
` 50 ${selfPid} T codex exec resume thread-99 --color never --sandbox read-only`,
|
||||
` 51 ${selfPid} T codex exec resume other --color never --sandbox read-only`,
|
||||
].join("\n"),
|
||||
stderr: "",
|
||||
})
|
||||
.mockResolvedValueOnce({ stdout: "", stderr: "" });
|
||||
|
||||
await cleanupSuspendedCliProcesses(
|
||||
{
|
||||
command: "codex",
|
||||
resumeArgs: ["exec", "resume", "{sessionId}", "--color", "never", "--sandbox", "read-only"],
|
||||
} as CliBackendConfig,
|
||||
1,
|
||||
);
|
||||
|
||||
if (process.platform === "win32") {
|
||||
expect(runExecMock).not.toHaveBeenCalled();
|
||||
return;
|
||||
}
|
||||
|
||||
expect(runExecMock).toHaveBeenCalledTimes(2);
|
||||
const killCall = runExecMock.mock.calls[1] ?? [];
|
||||
expect(killCall[0]).toBe("kill");
|
||||
expect(killCall[1]).toEqual(["-9", "50", "51"]);
|
||||
});
|
||||
|
||||
it("only kills child processes of current process (ppid validation)", async () => {
|
||||
const selfPid = process.pid;
|
||||
const childPid = selfPid + 1;
|
||||
const unrelatedPid = 9999;
|
||||
|
||||
runExecMock
|
||||
.mockResolvedValueOnce({
|
||||
stdout: [
|
||||
` ${childPid} ${selfPid} T claude --session-id thread-1 -p`,
|
||||
` ${unrelatedPid} 100 T claude --session-id thread-2 -p`,
|
||||
].join("\n"),
|
||||
stderr: "",
|
||||
})
|
||||
.mockResolvedValueOnce({ stdout: "", stderr: "" });
|
||||
|
||||
await cleanupSuspendedCliProcesses(
|
||||
{
|
||||
command: "claude",
|
||||
sessionArg: "--session-id",
|
||||
} as CliBackendConfig,
|
||||
0,
|
||||
);
|
||||
|
||||
if (process.platform === "win32") {
|
||||
expect(runExecMock).not.toHaveBeenCalled();
|
||||
return;
|
||||
}
|
||||
|
||||
expect(runExecMock).toHaveBeenCalledTimes(2);
|
||||
const killCall = runExecMock.mock.calls[1] ?? [];
|
||||
expect(killCall[0]).toBe("kill");
|
||||
// Only childPid killed; unrelatedPid (ppid=100) excluded
|
||||
expect(killCall[1]).toEqual(["-9", String(childPid)]);
|
||||
});
|
||||
|
||||
it("skips all processes when none are children of current process", async () => {
|
||||
runExecMock.mockResolvedValueOnce({
|
||||
stdout: [
|
||||
" 200 100 T claude --session-id thread-1 -p",
|
||||
" 201 100 T claude --session-id thread-2 -p",
|
||||
].join("\n"),
|
||||
stderr: "",
|
||||
timeoutMs: 120_000,
|
||||
useResume: true,
|
||||
});
|
||||
|
||||
await cleanupSuspendedCliProcesses(
|
||||
{
|
||||
command: "claude",
|
||||
sessionArg: "--session-id",
|
||||
} as CliBackendConfig,
|
||||
0,
|
||||
);
|
||||
|
||||
if (process.platform === "win32") {
|
||||
expect(runExecMock).not.toHaveBeenCalled();
|
||||
return;
|
||||
}
|
||||
|
||||
// Only ps called — no kill because no matching ppid
|
||||
expect(runExecMock).toHaveBeenCalledTimes(1);
|
||||
});
|
||||
});
|
||||
|
||||
describe("cleanupResumeProcesses", () => {
|
||||
beforeEach(() => {
|
||||
runExecMock.mockReset();
|
||||
});
|
||||
|
||||
it("only kills resume processes owned by current process", async () => {
|
||||
const selfPid = process.pid;
|
||||
|
||||
runExecMock
|
||||
.mockResolvedValueOnce({
|
||||
stdout: [
|
||||
` ${selfPid + 1} ${selfPid} codex exec resume abc-123`,
|
||||
` ${selfPid + 2} 999 codex exec resume abc-123`,
|
||||
].join("\n"),
|
||||
stderr: "",
|
||||
})
|
||||
.mockResolvedValueOnce({ stdout: "", stderr: "" })
|
||||
.mockResolvedValueOnce({ stdout: "", stderr: "" });
|
||||
|
||||
await cleanupResumeProcesses(
|
||||
{
|
||||
command: "codex",
|
||||
resumeArgs: ["exec", "resume", "{sessionId}"],
|
||||
} as CliBackendConfig,
|
||||
"abc-123",
|
||||
);
|
||||
|
||||
if (process.platform === "win32") {
|
||||
expect(runExecMock).not.toHaveBeenCalled();
|
||||
return;
|
||||
}
|
||||
|
||||
expect(runExecMock).toHaveBeenCalledTimes(3);
|
||||
|
||||
const termCall = runExecMock.mock.calls[1] ?? [];
|
||||
expect(termCall[0]).toBe("kill");
|
||||
expect(termCall[1]).toEqual(["-TERM", String(selfPid + 1)]);
|
||||
|
||||
const killCall = runExecMock.mock.calls[2] ?? [];
|
||||
expect(killCall[0]).toBe("kill");
|
||||
expect(killCall[1]).toEqual(["-9", String(selfPid + 1)]);
|
||||
});
|
||||
|
||||
it("skips kill when no resume processes match ppid", async () => {
|
||||
runExecMock.mockResolvedValueOnce({
|
||||
stdout: [" 300 100 codex exec resume abc-123", " 301 200 codex exec resume abc-123"].join(
|
||||
"\n",
|
||||
),
|
||||
stderr: "",
|
||||
});
|
||||
|
||||
await cleanupResumeProcesses(
|
||||
{
|
||||
command: "codex",
|
||||
resumeArgs: ["exec", "resume", "{sessionId}"],
|
||||
} as CliBackendConfig,
|
||||
"abc-123",
|
||||
);
|
||||
|
||||
if (process.platform === "win32") {
|
||||
expect(runExecMock).not.toHaveBeenCalled();
|
||||
return;
|
||||
}
|
||||
|
||||
// Only ps called — no kill because no matching ppid
|
||||
expect(runExecMock).toHaveBeenCalledTimes(1);
|
||||
expect(timeoutMs).toBe(42_000);
|
||||
});
|
||||
});
|
||||
|
||||
@@ -6,20 +6,20 @@ import { resolveHeartbeatPrompt } from "../auto-reply/heartbeat.js";
|
||||
import { shouldLogVerbose } from "../globals.js";
|
||||
import { isTruthyEnvValue } from "../infra/env.js";
|
||||
import { createSubsystemLogger } from "../logging/subsystem.js";
|
||||
import { runCommandWithTimeout } from "../process/exec.js";
|
||||
import { getProcessSupervisor } from "../process/supervisor/index.js";
|
||||
import { resolveSessionAgentIds } from "./agent-scope.js";
|
||||
import { makeBootstrapWarn, resolveBootstrapContextForRun } from "./bootstrap-files.js";
|
||||
import { resolveCliBackendConfig } from "./cli-backends.js";
|
||||
import {
|
||||
appendImagePathsToPrompt,
|
||||
buildCliSupervisorScopeKey,
|
||||
buildCliArgs,
|
||||
buildSystemPrompt,
|
||||
cleanupResumeProcesses,
|
||||
cleanupSuspendedCliProcesses,
|
||||
enqueueCliRun,
|
||||
normalizeCliModel,
|
||||
parseCliJson,
|
||||
parseCliJsonl,
|
||||
resolveCliNoOutputTimeoutMs,
|
||||
resolvePromptInput,
|
||||
resolveSessionIdToSend,
|
||||
resolveSystemPromptUsage,
|
||||
@@ -226,19 +226,32 @@ export async function runCliAgent(params: {
|
||||
}
|
||||
return next;
|
||||
})();
|
||||
|
||||
// Cleanup suspended processes that have accumulated (regardless of sessionId)
|
||||
await cleanupSuspendedCliProcesses(backend);
|
||||
if (useResume && cliSessionIdToSend) {
|
||||
await cleanupResumeProcesses(backend, cliSessionIdToSend);
|
||||
}
|
||||
|
||||
const result = await runCommandWithTimeout([backend.command, ...args], {
|
||||
const noOutputTimeoutMs = resolveCliNoOutputTimeoutMs({
|
||||
backend,
|
||||
timeoutMs: params.timeoutMs,
|
||||
useResume,
|
||||
});
|
||||
const supervisor = getProcessSupervisor();
|
||||
const scopeKey = buildCliSupervisorScopeKey({
|
||||
backend,
|
||||
backendId: backendResolved.id,
|
||||
cliSessionId: useResume ? cliSessionIdToSend : undefined,
|
||||
});
|
||||
|
||||
const managedRun = await supervisor.spawn({
|
||||
sessionId: params.sessionId,
|
||||
backendId: backendResolved.id,
|
||||
scopeKey,
|
||||
replaceExistingScope: Boolean(useResume && scopeKey),
|
||||
mode: "child",
|
||||
argv: [backend.command, ...args],
|
||||
timeoutMs: params.timeoutMs,
|
||||
noOutputTimeoutMs,
|
||||
cwd: workspaceDir,
|
||||
env,
|
||||
input: stdinPayload,
|
||||
});
|
||||
const result = await managedRun.wait();
|
||||
|
||||
const stdout = result.stdout.trim();
|
||||
const stderr = result.stderr.trim();
|
||||
@@ -259,7 +272,28 @@ export async function runCliAgent(params: {
|
||||
}
|
||||
}
|
||||
|
||||
if (result.code !== 0) {
|
||||
if (result.exitCode !== 0 || result.reason !== "exit") {
|
||||
if (result.reason === "no-output-timeout" || result.noOutputTimedOut) {
|
||||
const timeoutReason = `CLI produced no output for ${Math.round(noOutputTimeoutMs / 1000)}s and was terminated.`;
|
||||
log.warn(
|
||||
`cli watchdog timeout: provider=${params.provider} model=${modelId} session=${cliSessionIdToSend ?? params.sessionId} noOutputTimeoutMs=${noOutputTimeoutMs} pid=${managedRun.pid ?? "unknown"}`,
|
||||
);
|
||||
throw new FailoverError(timeoutReason, {
|
||||
reason: "timeout",
|
||||
provider: params.provider,
|
||||
model: modelId,
|
||||
status: resolveFailoverStatus("timeout"),
|
||||
});
|
||||
}
|
||||
if (result.reason === "overall-timeout") {
|
||||
const timeoutReason = `CLI exceeded timeout (${Math.round(params.timeoutMs / 1000)}s) and was terminated.`;
|
||||
throw new FailoverError(timeoutReason, {
|
||||
reason: "timeout",
|
||||
provider: params.provider,
|
||||
model: modelId,
|
||||
status: resolveFailoverStatus("timeout"),
|
||||
});
|
||||
}
|
||||
const err = stderr || stdout || "CLI failed.";
|
||||
const reason = classifyFailoverReason(err) ?? "unknown";
|
||||
const status = resolveFailoverStatus(reason);
|
||||
|
||||
@@ -8,232 +8,27 @@ import type { ThinkLevel } from "../../auto-reply/thinking.js";
|
||||
import type { OpenClawConfig } from "../../config/config.js";
|
||||
import type { CliBackendConfig } from "../../config/types.js";
|
||||
import type { EmbeddedContextFile } from "../pi-embedded-helpers.js";
|
||||
import { runExec } from "../../process/exec.js";
|
||||
import { buildTtsSystemPromptHint } from "../../tts/tts.js";
|
||||
import { escapeRegExp, isRecord } from "../../utils.js";
|
||||
import { isRecord } from "../../utils.js";
|
||||
import { buildModelAliasLines } from "../model-alias-lines.js";
|
||||
import { resolveDefaultModelForAgent } from "../model-selection.js";
|
||||
import { detectRuntimeShell } from "../shell-utils.js";
|
||||
import { buildSystemPromptParams } from "../system-prompt-params.js";
|
||||
import { buildAgentSystemPrompt } from "../system-prompt.js";
|
||||
export { buildCliSupervisorScopeKey, resolveCliNoOutputTimeoutMs } from "./reliability.js";
|
||||
|
||||
const CLI_RUN_QUEUE = new Map<string, Promise<unknown>>();
|
||||
|
||||
function buildLooseArgOrderRegex(tokens: string[]): RegExp {
|
||||
// Scan `ps` output lines. Keep matching flexible, but require whitespace arg boundaries
|
||||
// to avoid substring matches like `codexx` or `/path/to/codexx`.
|
||||
const [head, ...rest] = tokens.map((t) => String(t ?? "").trim()).filter(Boolean);
|
||||
if (!head) {
|
||||
return /$^/;
|
||||
}
|
||||
|
||||
const headEscaped = escapeRegExp(head);
|
||||
const headFragment = `(?:^|\\s)(?:${headEscaped}|\\S+\\/${headEscaped})(?=\\s|$)`;
|
||||
const restFragments = rest.map((t) => `(?:^|\\s)${escapeRegExp(t)}(?=\\s|$)`);
|
||||
return new RegExp([headFragment, ...restFragments].join(".*"));
|
||||
}
|
||||
|
||||
async function psWithFallback(argsA: string[], argsB: string[]): Promise<string> {
|
||||
try {
|
||||
const { stdout } = await runExec("ps", argsA);
|
||||
return stdout;
|
||||
} catch {
|
||||
// fallthrough
|
||||
}
|
||||
const { stdout } = await runExec("ps", argsB);
|
||||
return stdout;
|
||||
}
|
||||
|
||||
export async function cleanupResumeProcesses(
|
||||
backend: CliBackendConfig,
|
||||
sessionId: string,
|
||||
): Promise<void> {
|
||||
if (process.platform === "win32") {
|
||||
return;
|
||||
}
|
||||
const resumeArgs = backend.resumeArgs ?? [];
|
||||
if (resumeArgs.length === 0) {
|
||||
return;
|
||||
}
|
||||
if (!resumeArgs.some((arg) => arg.includes("{sessionId}"))) {
|
||||
return;
|
||||
}
|
||||
const commandToken = path.basename(backend.command ?? "").trim();
|
||||
if (!commandToken) {
|
||||
return;
|
||||
}
|
||||
|
||||
const resumeTokens = resumeArgs.map((arg) => arg.replaceAll("{sessionId}", sessionId));
|
||||
const pattern = [commandToken, ...resumeTokens]
|
||||
.filter(Boolean)
|
||||
.map((token) => escapeRegExp(token))
|
||||
.join(".*");
|
||||
if (!pattern) {
|
||||
return;
|
||||
}
|
||||
|
||||
try {
|
||||
const stdout = await psWithFallback(
|
||||
["-axww", "-o", "pid=,ppid=,command="],
|
||||
["-ax", "-o", "pid=,ppid=,command="],
|
||||
);
|
||||
const patternRegex = buildLooseArgOrderRegex([commandToken, ...resumeTokens]);
|
||||
const toKill: number[] = [];
|
||||
|
||||
for (const line of stdout.split("\n")) {
|
||||
const trimmed = line.trim();
|
||||
if (!trimmed) {
|
||||
continue;
|
||||
}
|
||||
const match = /^(\d+)\s+(\d+)\s+(.*)$/.exec(trimmed);
|
||||
if (!match) {
|
||||
continue;
|
||||
}
|
||||
const pid = Number(match[1]);
|
||||
const ppid = Number(match[2]);
|
||||
const cmd = match[3] ?? "";
|
||||
if (!Number.isFinite(pid)) {
|
||||
continue;
|
||||
}
|
||||
if (ppid !== process.pid) {
|
||||
continue;
|
||||
}
|
||||
if (!patternRegex.test(cmd)) {
|
||||
continue;
|
||||
}
|
||||
toKill.push(pid);
|
||||
}
|
||||
|
||||
if (toKill.length > 0) {
|
||||
const pidArgs = toKill.map((pid) => String(pid));
|
||||
try {
|
||||
await runExec("kill", ["-TERM", ...pidArgs]);
|
||||
} catch {
|
||||
// ignore
|
||||
}
|
||||
await new Promise((resolve) => setTimeout(resolve, 250));
|
||||
try {
|
||||
await runExec("kill", ["-9", ...pidArgs]);
|
||||
} catch {
|
||||
// ignore
|
||||
}
|
||||
}
|
||||
} catch {
|
||||
// ignore errors - best effort cleanup
|
||||
}
|
||||
}
|
||||
|
||||
function buildSessionMatchers(backend: CliBackendConfig): RegExp[] {
|
||||
const commandToken = path.basename(backend.command ?? "").trim();
|
||||
if (!commandToken) {
|
||||
return [];
|
||||
}
|
||||
const matchers: RegExp[] = [];
|
||||
const sessionArg = backend.sessionArg?.trim();
|
||||
const sessionArgs = backend.sessionArgs ?? [];
|
||||
const resumeArgs = backend.resumeArgs ?? [];
|
||||
|
||||
const addMatcher = (args: string[]) => {
|
||||
if (args.length === 0) {
|
||||
return;
|
||||
}
|
||||
const tokens = [commandToken, ...args];
|
||||
const pattern = tokens
|
||||
.map((token, index) => {
|
||||
const tokenPattern = tokenToRegex(token);
|
||||
return index === 0 ? `(?:^|\\s)${tokenPattern}` : `\\s+${tokenPattern}`;
|
||||
})
|
||||
.join("");
|
||||
matchers.push(new RegExp(pattern));
|
||||
};
|
||||
|
||||
if (sessionArgs.some((arg) => arg.includes("{sessionId}"))) {
|
||||
addMatcher(sessionArgs);
|
||||
} else if (sessionArg) {
|
||||
addMatcher([sessionArg, "{sessionId}"]);
|
||||
}
|
||||
|
||||
if (resumeArgs.some((arg) => arg.includes("{sessionId}"))) {
|
||||
addMatcher(resumeArgs);
|
||||
}
|
||||
|
||||
return matchers;
|
||||
}
|
||||
|
||||
function tokenToRegex(token: string): string {
|
||||
if (!token.includes("{sessionId}")) {
|
||||
return escapeRegExp(token);
|
||||
}
|
||||
const parts = token.split("{sessionId}").map((part) => escapeRegExp(part));
|
||||
return parts.join("\\S+");
|
||||
}
|
||||
|
||||
/**
|
||||
* Cleanup suspended OpenClaw CLI processes that have accumulated.
|
||||
* Only cleans up if there are more than the threshold (default: 10).
|
||||
*/
|
||||
export async function cleanupSuspendedCliProcesses(
|
||||
backend: CliBackendConfig,
|
||||
threshold = 10,
|
||||
): Promise<void> {
|
||||
if (process.platform === "win32") {
|
||||
return;
|
||||
}
|
||||
const matchers = buildSessionMatchers(backend);
|
||||
if (matchers.length === 0) {
|
||||
return;
|
||||
}
|
||||
|
||||
try {
|
||||
const stdout = await psWithFallback(
|
||||
["-axww", "-o", "pid=,ppid=,stat=,command="],
|
||||
["-ax", "-o", "pid=,ppid=,stat=,command="],
|
||||
);
|
||||
const suspended: number[] = [];
|
||||
for (const line of stdout.split("\n")) {
|
||||
const trimmed = line.trim();
|
||||
if (!trimmed) {
|
||||
continue;
|
||||
}
|
||||
const match = /^(\d+)\s+(\d+)\s+(\S+)\s+(.*)$/.exec(trimmed);
|
||||
if (!match) {
|
||||
continue;
|
||||
}
|
||||
const pid = Number(match[1]);
|
||||
const ppid = Number(match[2]);
|
||||
const stat = match[3] ?? "";
|
||||
const command = match[4] ?? "";
|
||||
if (!Number.isFinite(pid)) {
|
||||
continue;
|
||||
}
|
||||
if (ppid !== process.pid) {
|
||||
continue;
|
||||
}
|
||||
if (!stat.includes("T")) {
|
||||
continue;
|
||||
}
|
||||
if (!matchers.some((matcher) => matcher.test(command))) {
|
||||
continue;
|
||||
}
|
||||
suspended.push(pid);
|
||||
}
|
||||
|
||||
if (suspended.length > threshold) {
|
||||
// Verified locally: stopped (T) processes ignore SIGTERM, so use SIGKILL.
|
||||
await runExec("kill", ["-9", ...suspended.map((pid) => String(pid))]);
|
||||
}
|
||||
} catch {
|
||||
// ignore errors - best effort cleanup
|
||||
}
|
||||
}
|
||||
export function enqueueCliRun<T>(key: string, task: () => Promise<T>): Promise<T> {
|
||||
const prior = CLI_RUN_QUEUE.get(key) ?? Promise.resolve();
|
||||
const chained = prior.catch(() => undefined).then(task);
|
||||
const tracked = chained.finally(() => {
|
||||
if (CLI_RUN_QUEUE.get(key) === tracked) {
|
||||
CLI_RUN_QUEUE.delete(key);
|
||||
}
|
||||
});
|
||||
// Keep queue continuity even when a run rejects, without emitting unhandled rejections.
|
||||
const tracked = chained
|
||||
.catch(() => undefined)
|
||||
.finally(() => {
|
||||
if (CLI_RUN_QUEUE.get(key) === tracked) {
|
||||
CLI_RUN_QUEUE.delete(key);
|
||||
}
|
||||
});
|
||||
CLI_RUN_QUEUE.set(key, tracked);
|
||||
return chained;
|
||||
}
|
||||
|
||||
88
src/agents/cli-runner/reliability.ts
Normal file
88
src/agents/cli-runner/reliability.ts
Normal file
@@ -0,0 +1,88 @@
|
||||
import path from "node:path";
|
||||
import type { CliBackendConfig } from "../../config/types.js";
|
||||
import {
|
||||
CLI_FRESH_WATCHDOG_DEFAULTS,
|
||||
CLI_RESUME_WATCHDOG_DEFAULTS,
|
||||
CLI_WATCHDOG_MIN_TIMEOUT_MS,
|
||||
} from "../cli-watchdog-defaults.js";
|
||||
|
||||
function pickWatchdogProfile(
|
||||
backend: CliBackendConfig,
|
||||
useResume: boolean,
|
||||
): {
|
||||
noOutputTimeoutMs?: number;
|
||||
noOutputTimeoutRatio: number;
|
||||
minMs: number;
|
||||
maxMs: number;
|
||||
} {
|
||||
const defaults = useResume ? CLI_RESUME_WATCHDOG_DEFAULTS : CLI_FRESH_WATCHDOG_DEFAULTS;
|
||||
const configured = useResume
|
||||
? backend.reliability?.watchdog?.resume
|
||||
: backend.reliability?.watchdog?.fresh;
|
||||
|
||||
const ratio = (() => {
|
||||
const value = configured?.noOutputTimeoutRatio;
|
||||
if (typeof value !== "number" || !Number.isFinite(value)) {
|
||||
return defaults.noOutputTimeoutRatio;
|
||||
}
|
||||
return Math.max(0.05, Math.min(0.95, value));
|
||||
})();
|
||||
const minMs = (() => {
|
||||
const value = configured?.minMs;
|
||||
if (typeof value !== "number" || !Number.isFinite(value)) {
|
||||
return defaults.minMs;
|
||||
}
|
||||
return Math.max(CLI_WATCHDOG_MIN_TIMEOUT_MS, Math.floor(value));
|
||||
})();
|
||||
const maxMs = (() => {
|
||||
const value = configured?.maxMs;
|
||||
if (typeof value !== "number" || !Number.isFinite(value)) {
|
||||
return defaults.maxMs;
|
||||
}
|
||||
return Math.max(CLI_WATCHDOG_MIN_TIMEOUT_MS, Math.floor(value));
|
||||
})();
|
||||
|
||||
return {
|
||||
noOutputTimeoutMs:
|
||||
typeof configured?.noOutputTimeoutMs === "number" &&
|
||||
Number.isFinite(configured.noOutputTimeoutMs)
|
||||
? Math.max(CLI_WATCHDOG_MIN_TIMEOUT_MS, Math.floor(configured.noOutputTimeoutMs))
|
||||
: undefined,
|
||||
noOutputTimeoutRatio: ratio,
|
||||
minMs: Math.min(minMs, maxMs),
|
||||
maxMs: Math.max(minMs, maxMs),
|
||||
};
|
||||
}
|
||||
|
||||
export function resolveCliNoOutputTimeoutMs(params: {
|
||||
backend: CliBackendConfig;
|
||||
timeoutMs: number;
|
||||
useResume: boolean;
|
||||
}): number {
|
||||
const profile = pickWatchdogProfile(params.backend, params.useResume);
|
||||
// Keep watchdog below global timeout in normal cases.
|
||||
const cap = Math.max(CLI_WATCHDOG_MIN_TIMEOUT_MS, params.timeoutMs - 1_000);
|
||||
if (profile.noOutputTimeoutMs !== undefined) {
|
||||
return Math.min(profile.noOutputTimeoutMs, cap);
|
||||
}
|
||||
const computed = Math.floor(params.timeoutMs * profile.noOutputTimeoutRatio);
|
||||
const bounded = Math.min(profile.maxMs, Math.max(profile.minMs, computed));
|
||||
return Math.min(bounded, cap);
|
||||
}
|
||||
|
||||
export function buildCliSupervisorScopeKey(params: {
|
||||
backend: CliBackendConfig;
|
||||
backendId: string;
|
||||
cliSessionId?: string;
|
||||
}): string | undefined {
|
||||
const commandToken = path
|
||||
.basename(params.backend.command ?? "")
|
||||
.trim()
|
||||
.toLowerCase();
|
||||
const backendToken = params.backendId.trim().toLowerCase();
|
||||
const sessionToken = params.cliSessionId?.trim();
|
||||
if (!sessionToken) {
|
||||
return undefined;
|
||||
}
|
||||
return `cli:${backendToken}:${commandToken}:${sessionToken}`;
|
||||
}
|
||||
13
src/agents/cli-watchdog-defaults.ts
Normal file
13
src/agents/cli-watchdog-defaults.ts
Normal file
@@ -0,0 +1,13 @@
|
||||
export const CLI_WATCHDOG_MIN_TIMEOUT_MS = 1_000;
|
||||
|
||||
export const CLI_FRESH_WATCHDOG_DEFAULTS = {
|
||||
noOutputTimeoutRatio: 0.8,
|
||||
minMs: 180_000,
|
||||
maxMs: 600_000,
|
||||
} as const;
|
||||
|
||||
export const CLI_RESUME_WATCHDOG_DEFAULTS = {
|
||||
noOutputTimeoutRatio: 0.3,
|
||||
minMs: 60_000,
|
||||
maxMs: 180_000,
|
||||
} as const;
|
||||
@@ -91,6 +91,34 @@ export type CliBackendConfig = {
|
||||
imageMode?: "repeat" | "list";
|
||||
/** Serialize runs for this CLI. */
|
||||
serialize?: boolean;
|
||||
/** Runtime reliability tuning for this backend's process lifecycle. */
|
||||
reliability?: {
|
||||
/** No-output watchdog tuning (fresh vs resumed runs). */
|
||||
watchdog?: {
|
||||
/** Fresh/new sessions (non-resume). */
|
||||
fresh?: {
|
||||
/** Fixed watchdog timeout in ms (overrides ratio when set). */
|
||||
noOutputTimeoutMs?: number;
|
||||
/** Fraction of overall timeout used when fixed timeout is not set. */
|
||||
noOutputTimeoutRatio?: number;
|
||||
/** Lower bound for computed watchdog timeout. */
|
||||
minMs?: number;
|
||||
/** Upper bound for computed watchdog timeout. */
|
||||
maxMs?: number;
|
||||
};
|
||||
/** Resume sessions. */
|
||||
resume?: {
|
||||
/** Fixed watchdog timeout in ms (overrides ratio when set). */
|
||||
noOutputTimeoutMs?: number;
|
||||
/** Fraction of overall timeout used when fixed timeout is not set. */
|
||||
noOutputTimeoutRatio?: number;
|
||||
/** Lower bound for computed watchdog timeout. */
|
||||
minMs?: number;
|
||||
/** Upper bound for computed watchdog timeout. */
|
||||
maxMs?: number;
|
||||
};
|
||||
};
|
||||
};
|
||||
};
|
||||
|
||||
export type AgentDefaultsConfig = {
|
||||
|
||||
@@ -275,6 +275,34 @@ export const CliBackendSchema = z
|
||||
imageArg: z.string().optional(),
|
||||
imageMode: z.union([z.literal("repeat"), z.literal("list")]).optional(),
|
||||
serialize: z.boolean().optional(),
|
||||
reliability: z
|
||||
.object({
|
||||
watchdog: z
|
||||
.object({
|
||||
fresh: z
|
||||
.object({
|
||||
noOutputTimeoutMs: z.number().int().min(1000).optional(),
|
||||
noOutputTimeoutRatio: z.number().min(0.05).max(0.95).optional(),
|
||||
minMs: z.number().int().min(1000).optional(),
|
||||
maxMs: z.number().int().min(1000).optional(),
|
||||
})
|
||||
.strict()
|
||||
.optional(),
|
||||
resume: z
|
||||
.object({
|
||||
noOutputTimeoutMs: z.number().int().min(1000).optional(),
|
||||
noOutputTimeoutRatio: z.number().min(0.05).max(0.95).optional(),
|
||||
minMs: z.number().int().min(1000).optional(),
|
||||
maxMs: z.number().int().min(1000).optional(),
|
||||
})
|
||||
.strict()
|
||||
.optional(),
|
||||
})
|
||||
.strict()
|
||||
.optional(),
|
||||
})
|
||||
.strict()
|
||||
.optional(),
|
||||
})
|
||||
.strict();
|
||||
|
||||
|
||||
@@ -23,6 +23,7 @@ describe("runCommandWithTimeout", () => {
|
||||
|
||||
expect(result.code).toBe(0);
|
||||
expect(result.stdout).toBe("ok");
|
||||
expect(result.termination).toBe("exit");
|
||||
});
|
||||
|
||||
it("merges custom env with process.env", async () => {
|
||||
@@ -43,8 +44,58 @@ describe("runCommandWithTimeout", () => {
|
||||
|
||||
expect(result.code).toBe(0);
|
||||
expect(result.stdout).toBe("base|ok");
|
||||
expect(result.termination).toBe("exit");
|
||||
} finally {
|
||||
envSnapshot.restore();
|
||||
}
|
||||
});
|
||||
|
||||
it("kills command when no output timeout elapses", async () => {
|
||||
const startedAt = Date.now();
|
||||
const result = await runCommandWithTimeout(
|
||||
[process.execPath, "-e", "setTimeout(() => {}, 10_000)"],
|
||||
{
|
||||
timeoutMs: 5_000,
|
||||
noOutputTimeoutMs: 300,
|
||||
},
|
||||
);
|
||||
|
||||
const durationMs = Date.now() - startedAt;
|
||||
expect(durationMs).toBeLessThan(2_500);
|
||||
expect(result.termination).toBe("no-output-timeout");
|
||||
expect(result.noOutputTimedOut).toBe(true);
|
||||
expect(result.code).not.toBe(0);
|
||||
});
|
||||
|
||||
it("resets no output timer when command keeps emitting output", async () => {
|
||||
const result = await runCommandWithTimeout(
|
||||
[
|
||||
process.execPath,
|
||||
"-e",
|
||||
'let i=0; const t=setInterval(() => { process.stdout.write("."); i += 1; if (i >= 5) { clearInterval(t); process.exit(0); } }, 50);',
|
||||
],
|
||||
{
|
||||
timeoutMs: 5_000,
|
||||
noOutputTimeoutMs: 200,
|
||||
},
|
||||
);
|
||||
|
||||
expect(result.code).toBe(0);
|
||||
expect(result.termination).toBe("exit");
|
||||
expect(result.noOutputTimedOut).toBe(false);
|
||||
expect(result.stdout.length).toBeGreaterThanOrEqual(5);
|
||||
});
|
||||
|
||||
it("reports global timeout termination when overall timeout elapses", async () => {
|
||||
const result = await runCommandWithTimeout(
|
||||
[process.execPath, "-e", "setTimeout(() => {}, 10_000)"],
|
||||
{
|
||||
timeoutMs: 200,
|
||||
},
|
||||
);
|
||||
|
||||
expect(result.termination).toBe("timeout");
|
||||
expect(result.noOutputTimedOut).toBe(false);
|
||||
expect(result.code).not.toBe(0);
|
||||
});
|
||||
});
|
||||
|
||||
@@ -76,11 +76,14 @@ export async function runExec(
|
||||
}
|
||||
|
||||
export type SpawnResult = {
|
||||
pid?: number;
|
||||
stdout: string;
|
||||
stderr: string;
|
||||
code: number | null;
|
||||
signal: NodeJS.Signals | null;
|
||||
killed: boolean;
|
||||
termination: "exit" | "timeout" | "no-output-timeout" | "signal";
|
||||
noOutputTimedOut?: boolean;
|
||||
};
|
||||
|
||||
export type CommandOptions = {
|
||||
@@ -89,6 +92,7 @@ export type CommandOptions = {
|
||||
input?: string;
|
||||
env?: NodeJS.ProcessEnv;
|
||||
windowsVerbatimArguments?: boolean;
|
||||
noOutputTimeoutMs?: number;
|
||||
};
|
||||
|
||||
export async function runCommandWithTimeout(
|
||||
@@ -97,7 +101,7 @@ export async function runCommandWithTimeout(
|
||||
): Promise<SpawnResult> {
|
||||
const options: CommandOptions =
|
||||
typeof optionsOrTimeout === "number" ? { timeoutMs: optionsOrTimeout } : optionsOrTimeout;
|
||||
const { timeoutMs, cwd, input, env } = options;
|
||||
const { timeoutMs, cwd, input, env, noOutputTimeoutMs } = options;
|
||||
const { windowsVerbatimArguments } = options;
|
||||
const hasInput = input !== undefined;
|
||||
|
||||
@@ -144,11 +148,45 @@ export async function runCommandWithTimeout(
|
||||
let stdout = "";
|
||||
let stderr = "";
|
||||
let settled = false;
|
||||
let timedOut = false;
|
||||
let noOutputTimedOut = false;
|
||||
let noOutputTimer: NodeJS.Timeout | null = null;
|
||||
const shouldTrackOutputTimeout =
|
||||
typeof noOutputTimeoutMs === "number" &&
|
||||
Number.isFinite(noOutputTimeoutMs) &&
|
||||
noOutputTimeoutMs > 0;
|
||||
|
||||
const clearNoOutputTimer = () => {
|
||||
if (!noOutputTimer) {
|
||||
return;
|
||||
}
|
||||
clearTimeout(noOutputTimer);
|
||||
noOutputTimer = null;
|
||||
};
|
||||
|
||||
const armNoOutputTimer = () => {
|
||||
if (!shouldTrackOutputTimeout || settled) {
|
||||
return;
|
||||
}
|
||||
clearNoOutputTimer();
|
||||
noOutputTimer = setTimeout(() => {
|
||||
if (settled) {
|
||||
return;
|
||||
}
|
||||
noOutputTimedOut = true;
|
||||
if (typeof child.kill === "function") {
|
||||
child.kill("SIGKILL");
|
||||
}
|
||||
}, Math.floor(noOutputTimeoutMs));
|
||||
};
|
||||
|
||||
const timer = setTimeout(() => {
|
||||
timedOut = true;
|
||||
if (typeof child.kill === "function") {
|
||||
child.kill("SIGKILL");
|
||||
}
|
||||
}, timeoutMs);
|
||||
armNoOutputTimer();
|
||||
|
||||
if (hasInput && child.stdin) {
|
||||
child.stdin.write(input ?? "");
|
||||
@@ -157,9 +195,11 @@ export async function runCommandWithTimeout(
|
||||
|
||||
child.stdout?.on("data", (d) => {
|
||||
stdout += d.toString();
|
||||
armNoOutputTimer();
|
||||
});
|
||||
child.stderr?.on("data", (d) => {
|
||||
stderr += d.toString();
|
||||
armNoOutputTimer();
|
||||
});
|
||||
child.on("error", (err) => {
|
||||
if (settled) {
|
||||
@@ -167,6 +207,7 @@ export async function runCommandWithTimeout(
|
||||
}
|
||||
settled = true;
|
||||
clearTimeout(timer);
|
||||
clearNoOutputTimer();
|
||||
reject(err);
|
||||
});
|
||||
child.on("close", (code, signal) => {
|
||||
@@ -175,7 +216,24 @@ export async function runCommandWithTimeout(
|
||||
}
|
||||
settled = true;
|
||||
clearTimeout(timer);
|
||||
resolve({ stdout, stderr, code, signal, killed: child.killed });
|
||||
clearNoOutputTimer();
|
||||
const termination = noOutputTimedOut
|
||||
? "no-output-timeout"
|
||||
: timedOut
|
||||
? "timeout"
|
||||
: signal != null
|
||||
? "signal"
|
||||
: "exit";
|
||||
resolve({
|
||||
pid: child.pid ?? undefined,
|
||||
stdout,
|
||||
stderr,
|
||||
code,
|
||||
signal,
|
||||
killed: child.killed,
|
||||
termination,
|
||||
noOutputTimedOut,
|
||||
});
|
||||
});
|
||||
});
|
||||
}
|
||||
|
||||
34
src/process/kill-tree.ts
Normal file
34
src/process/kill-tree.ts
Normal file
@@ -0,0 +1,34 @@
|
||||
import { spawn } from "node:child_process";
|
||||
|
||||
/**
|
||||
* Best-effort process-tree termination.
|
||||
* - Windows: use taskkill /T to include descendants.
|
||||
* - Unix: try process-group kill first, then direct pid kill.
|
||||
*/
|
||||
export function killProcessTree(pid: number): void {
|
||||
if (!Number.isFinite(pid) || pid <= 0) {
|
||||
return;
|
||||
}
|
||||
|
||||
if (process.platform === "win32") {
|
||||
try {
|
||||
spawn("taskkill", ["/F", "/T", "/PID", String(pid)], {
|
||||
stdio: "ignore",
|
||||
detached: true,
|
||||
});
|
||||
} catch {
|
||||
// ignore taskkill failures
|
||||
}
|
||||
return;
|
||||
}
|
||||
|
||||
try {
|
||||
process.kill(-pid, "SIGKILL");
|
||||
} catch {
|
||||
try {
|
||||
process.kill(pid, "SIGKILL");
|
||||
} catch {
|
||||
// process already gone or inaccessible
|
||||
}
|
||||
}
|
||||
}
|
||||
116
src/process/supervisor/adapters/child.test.ts
Normal file
116
src/process/supervisor/adapters/child.test.ts
Normal file
@@ -0,0 +1,116 @@
|
||||
import type { ChildProcess } from "node:child_process";
|
||||
import { EventEmitter } from "node:events";
|
||||
import { PassThrough } from "node:stream";
|
||||
import { beforeEach, describe, expect, it, vi } from "vitest";
|
||||
|
||||
const { spawnWithFallbackMock, killProcessTreeMock } = vi.hoisted(() => ({
|
||||
spawnWithFallbackMock: vi.fn(),
|
||||
killProcessTreeMock: vi.fn(),
|
||||
}));
|
||||
|
||||
vi.mock("../../spawn-utils.js", () => ({
|
||||
spawnWithFallback: (...args: unknown[]) => spawnWithFallbackMock(...args),
|
||||
}));
|
||||
|
||||
vi.mock("../../kill-tree.js", () => ({
|
||||
killProcessTree: (...args: unknown[]) => killProcessTreeMock(...args),
|
||||
}));
|
||||
|
||||
function createStubChild(pid = 1234) {
|
||||
const child = new EventEmitter() as ChildProcess;
|
||||
child.stdin = new PassThrough() as ChildProcess["stdin"];
|
||||
child.stdout = new PassThrough() as ChildProcess["stdout"];
|
||||
child.stderr = new PassThrough() as ChildProcess["stderr"];
|
||||
child.pid = pid;
|
||||
child.killed = false;
|
||||
const killMock = vi.fn(() => true);
|
||||
child.kill = killMock as ChildProcess["kill"];
|
||||
return { child, killMock };
|
||||
}
|
||||
|
||||
describe("createChildAdapter", () => {
|
||||
beforeEach(() => {
|
||||
spawnWithFallbackMock.mockReset();
|
||||
killProcessTreeMock.mockReset();
|
||||
});
|
||||
|
||||
it("uses process-tree kill for default SIGKILL", async () => {
|
||||
const { child, killMock } = createStubChild(4321);
|
||||
spawnWithFallbackMock.mockResolvedValue({
|
||||
child,
|
||||
usedFallback: false,
|
||||
});
|
||||
const { createChildAdapter } = await import("./child.js");
|
||||
const adapter = await createChildAdapter({
|
||||
argv: ["node", "-e", "setTimeout(() => {}, 1000)"],
|
||||
stdinMode: "pipe-open",
|
||||
});
|
||||
|
||||
const spawnArgs = spawnWithFallbackMock.mock.calls[0]?.[0] as {
|
||||
options?: { detached?: boolean };
|
||||
fallbacks?: Array<{ options?: { detached?: boolean } }>;
|
||||
};
|
||||
expect(spawnArgs.options?.detached).toBe(true);
|
||||
expect(spawnArgs.fallbacks?.[0]?.options?.detached).toBe(false);
|
||||
|
||||
adapter.kill();
|
||||
|
||||
expect(killProcessTreeMock).toHaveBeenCalledWith(4321);
|
||||
expect(killMock).not.toHaveBeenCalled();
|
||||
});
|
||||
|
||||
it("uses direct child.kill for non-SIGKILL signals", async () => {
|
||||
const { child, killMock } = createStubChild(7654);
|
||||
spawnWithFallbackMock.mockResolvedValue({
|
||||
child,
|
||||
usedFallback: false,
|
||||
});
|
||||
const { createChildAdapter } = await import("./child.js");
|
||||
const adapter = await createChildAdapter({
|
||||
argv: ["node", "-e", "setTimeout(() => {}, 1000)"],
|
||||
stdinMode: "pipe-open",
|
||||
});
|
||||
|
||||
adapter.kill("SIGTERM");
|
||||
|
||||
expect(killProcessTreeMock).not.toHaveBeenCalled();
|
||||
expect(killMock).toHaveBeenCalledWith("SIGTERM");
|
||||
});
|
||||
|
||||
it("keeps inherited env when no override env is provided", async () => {
|
||||
const { child } = createStubChild(3333);
|
||||
spawnWithFallbackMock.mockResolvedValue({
|
||||
child,
|
||||
usedFallback: false,
|
||||
});
|
||||
const { createChildAdapter } = await import("./child.js");
|
||||
await createChildAdapter({
|
||||
argv: ["node", "-e", "process.exit(0)"],
|
||||
stdinMode: "pipe-open",
|
||||
});
|
||||
|
||||
const spawnArgs = spawnWithFallbackMock.mock.calls[0]?.[0] as {
|
||||
options?: { env?: NodeJS.ProcessEnv };
|
||||
};
|
||||
expect(spawnArgs.options?.env).toBeUndefined();
|
||||
});
|
||||
|
||||
it("passes explicit env overrides as strings", async () => {
|
||||
const { child } = createStubChild(4444);
|
||||
spawnWithFallbackMock.mockResolvedValue({
|
||||
child,
|
||||
usedFallback: false,
|
||||
});
|
||||
const { createChildAdapter } = await import("./child.js");
|
||||
await createChildAdapter({
|
||||
argv: ["node", "-e", "process.exit(0)"],
|
||||
env: { FOO: "bar", COUNT: "12", DROP_ME: undefined },
|
||||
stdinMode: "pipe-open",
|
||||
});
|
||||
|
||||
const spawnArgs = spawnWithFallbackMock.mock.calls[0]?.[0] as {
|
||||
options?: { env?: Record<string, string> };
|
||||
};
|
||||
expect(spawnArgs.options?.env).toEqual({ FOO: "bar", COUNT: "12" });
|
||||
});
|
||||
});
|
||||
174
src/process/supervisor/adapters/child.ts
Normal file
174
src/process/supervisor/adapters/child.ts
Normal file
@@ -0,0 +1,174 @@
|
||||
import type { ChildProcessWithoutNullStreams, SpawnOptions } from "node:child_process";
|
||||
import type { ManagedRunStdin } from "../types.js";
|
||||
import { killProcessTree } from "../../kill-tree.js";
|
||||
import { spawnWithFallback } from "../../spawn-utils.js";
|
||||
|
||||
function resolveCommand(command: string): string {
|
||||
if (process.platform !== "win32") {
|
||||
return command;
|
||||
}
|
||||
const lower = command.toLowerCase();
|
||||
if (lower.endsWith(".exe") || lower.endsWith(".cmd") || lower.endsWith(".bat")) {
|
||||
return command;
|
||||
}
|
||||
const basename = lower.split(/[\\/]/).pop() ?? lower;
|
||||
if (basename === "npm" || basename === "pnpm" || basename === "yarn" || basename === "npx") {
|
||||
return `${command}.cmd`;
|
||||
}
|
||||
return command;
|
||||
}
|
||||
|
||||
function toStringEnv(env?: NodeJS.ProcessEnv): Record<string, string> {
|
||||
if (!env) {
|
||||
return {};
|
||||
}
|
||||
const out: Record<string, string> = {};
|
||||
for (const [key, value] of Object.entries(env)) {
|
||||
if (value === undefined) {
|
||||
continue;
|
||||
}
|
||||
out[key] = String(value);
|
||||
}
|
||||
return out;
|
||||
}
|
||||
|
||||
export type ChildAdapter = {
|
||||
pid?: number;
|
||||
stdin?: ManagedRunStdin;
|
||||
onStdout: (listener: (chunk: string) => void) => void;
|
||||
onStderr: (listener: (chunk: string) => void) => void;
|
||||
wait: () => Promise<{ code: number | null; signal: NodeJS.Signals | null }>;
|
||||
kill: (signal?: NodeJS.Signals) => void;
|
||||
dispose: () => void;
|
||||
};
|
||||
|
||||
export async function createChildAdapter(params: {
|
||||
argv: string[];
|
||||
cwd?: string;
|
||||
env?: NodeJS.ProcessEnv;
|
||||
windowsVerbatimArguments?: boolean;
|
||||
input?: string;
|
||||
stdinMode?: "inherit" | "pipe-open" | "pipe-closed";
|
||||
}): Promise<ChildAdapter> {
|
||||
const resolvedArgv = [...params.argv];
|
||||
resolvedArgv[0] = resolveCommand(resolvedArgv[0] ?? "");
|
||||
|
||||
const stdinMode = params.stdinMode ?? (params.input !== undefined ? "pipe-closed" : "inherit");
|
||||
|
||||
const options: SpawnOptions = {
|
||||
cwd: params.cwd,
|
||||
env: params.env ? toStringEnv(params.env) : undefined,
|
||||
stdio: ["pipe", "pipe", "pipe"],
|
||||
detached: true,
|
||||
windowsHide: true,
|
||||
windowsVerbatimArguments: params.windowsVerbatimArguments,
|
||||
};
|
||||
if (stdinMode === "inherit") {
|
||||
options.stdio = ["inherit", "pipe", "pipe"];
|
||||
} else {
|
||||
options.stdio = ["pipe", "pipe", "pipe"];
|
||||
}
|
||||
|
||||
const spawned = await spawnWithFallback({
|
||||
argv: resolvedArgv,
|
||||
options,
|
||||
fallbacks: [
|
||||
{
|
||||
label: "no-detach",
|
||||
options: { detached: false },
|
||||
},
|
||||
],
|
||||
});
|
||||
|
||||
const child = spawned.child as ChildProcessWithoutNullStreams;
|
||||
if (child.stdin) {
|
||||
if (params.input !== undefined) {
|
||||
child.stdin.write(params.input);
|
||||
child.stdin.end();
|
||||
} else if (stdinMode === "pipe-closed") {
|
||||
child.stdin.end();
|
||||
}
|
||||
}
|
||||
|
||||
const stdin: ManagedRunStdin | undefined = child.stdin
|
||||
? {
|
||||
destroyed: false,
|
||||
write: (data: string, cb?: (err?: Error | null) => void) => {
|
||||
try {
|
||||
child.stdin.write(data, cb);
|
||||
} catch (err) {
|
||||
cb?.(err as Error);
|
||||
}
|
||||
},
|
||||
end: () => {
|
||||
try {
|
||||
child.stdin.end();
|
||||
} catch {
|
||||
// ignore close errors
|
||||
}
|
||||
},
|
||||
destroy: () => {
|
||||
try {
|
||||
child.stdin.destroy();
|
||||
} catch {
|
||||
// ignore destroy errors
|
||||
}
|
||||
},
|
||||
}
|
||||
: undefined;
|
||||
|
||||
const onStdout = (listener: (chunk: string) => void) => {
|
||||
child.stdout.on("data", (chunk) => {
|
||||
listener(chunk.toString());
|
||||
});
|
||||
};
|
||||
|
||||
const onStderr = (listener: (chunk: string) => void) => {
|
||||
child.stderr.on("data", (chunk) => {
|
||||
listener(chunk.toString());
|
||||
});
|
||||
};
|
||||
|
||||
const wait = async () =>
|
||||
await new Promise<{ code: number | null; signal: NodeJS.Signals | null }>((resolve, reject) => {
|
||||
child.once("error", reject);
|
||||
child.once("close", (code, signal) => {
|
||||
resolve({ code, signal });
|
||||
});
|
||||
});
|
||||
|
||||
const kill = (signal?: NodeJS.Signals) => {
|
||||
const pid = child.pid ?? undefined;
|
||||
if (signal === undefined || signal === "SIGKILL") {
|
||||
if (pid) {
|
||||
killProcessTree(pid);
|
||||
} else {
|
||||
try {
|
||||
child.kill("SIGKILL");
|
||||
} catch {
|
||||
// ignore kill errors
|
||||
}
|
||||
}
|
||||
return;
|
||||
}
|
||||
try {
|
||||
child.kill(signal);
|
||||
} catch {
|
||||
// ignore kill errors for non-kill signals
|
||||
}
|
||||
};
|
||||
|
||||
const dispose = () => {
|
||||
child.removeAllListeners();
|
||||
};
|
||||
|
||||
return {
|
||||
pid: child.pid ?? undefined,
|
||||
stdin,
|
||||
onStdout,
|
||||
onStderr,
|
||||
wait,
|
||||
kill,
|
||||
dispose,
|
||||
};
|
||||
}
|
||||
161
src/process/supervisor/adapters/pty.test.ts
Normal file
161
src/process/supervisor/adapters/pty.test.ts
Normal file
@@ -0,0 +1,161 @@
|
||||
import { afterEach, beforeEach, describe, expect, it, vi } from "vitest";
|
||||
|
||||
const { spawnMock, ptyKillMock, killProcessTreeMock } = vi.hoisted(() => ({
|
||||
spawnMock: vi.fn(),
|
||||
ptyKillMock: vi.fn(),
|
||||
killProcessTreeMock: vi.fn(),
|
||||
}));
|
||||
|
||||
vi.mock("@lydell/node-pty", () => ({
|
||||
spawn: (...args: unknown[]) => spawnMock(...args),
|
||||
}));
|
||||
|
||||
vi.mock("../../kill-tree.js", () => ({
|
||||
killProcessTree: (...args: unknown[]) => killProcessTreeMock(...args),
|
||||
}));
|
||||
|
||||
function createStubPty(pid = 1234) {
|
||||
let exitListener: ((event: { exitCode: number; signal?: number }) => void) | null = null;
|
||||
return {
|
||||
pid,
|
||||
write: vi.fn(),
|
||||
onData: vi.fn(() => ({ dispose: vi.fn() })),
|
||||
onExit: vi.fn((listener: (event: { exitCode: number; signal?: number }) => void) => {
|
||||
exitListener = listener;
|
||||
return { dispose: vi.fn() };
|
||||
}),
|
||||
kill: (signal?: string) => ptyKillMock(signal),
|
||||
emitExit: (event: { exitCode: number; signal?: number }) => {
|
||||
exitListener?.(event);
|
||||
},
|
||||
};
|
||||
}
|
||||
|
||||
describe("createPtyAdapter", () => {
|
||||
beforeEach(() => {
|
||||
spawnMock.mockReset();
|
||||
ptyKillMock.mockReset();
|
||||
killProcessTreeMock.mockReset();
|
||||
});
|
||||
|
||||
afterEach(() => {
|
||||
vi.resetModules();
|
||||
vi.clearAllMocks();
|
||||
});
|
||||
|
||||
it("forwards explicit signals to node-pty kill", async () => {
|
||||
spawnMock.mockReturnValue(createStubPty());
|
||||
const { createPtyAdapter } = await import("./pty.js");
|
||||
|
||||
const adapter = await createPtyAdapter({
|
||||
shell: "bash",
|
||||
args: ["-lc", "sleep 10"],
|
||||
});
|
||||
|
||||
adapter.kill("SIGTERM");
|
||||
expect(ptyKillMock).toHaveBeenCalledWith("SIGTERM");
|
||||
expect(killProcessTreeMock).not.toHaveBeenCalled();
|
||||
});
|
||||
|
||||
it("uses process-tree kill for SIGKILL by default", async () => {
|
||||
spawnMock.mockReturnValue(createStubPty());
|
||||
const { createPtyAdapter } = await import("./pty.js");
|
||||
|
||||
const adapter = await createPtyAdapter({
|
||||
shell: "bash",
|
||||
args: ["-lc", "sleep 10"],
|
||||
});
|
||||
|
||||
adapter.kill();
|
||||
expect(killProcessTreeMock).toHaveBeenCalledWith(1234);
|
||||
expect(ptyKillMock).not.toHaveBeenCalled();
|
||||
});
|
||||
|
||||
it("resolves wait when exit fires before wait is called", async () => {
|
||||
const stub = createStubPty();
|
||||
spawnMock.mockReturnValue(stub);
|
||||
const { createPtyAdapter } = await import("./pty.js");
|
||||
|
||||
const adapter = await createPtyAdapter({
|
||||
shell: "bash",
|
||||
args: ["-lc", "exit 3"],
|
||||
});
|
||||
|
||||
expect(stub.onExit).toHaveBeenCalledTimes(1);
|
||||
stub.emitExit({ exitCode: 3, signal: 0 });
|
||||
await expect(adapter.wait()).resolves.toEqual({ code: 3, signal: null });
|
||||
});
|
||||
|
||||
it("keeps inherited env when no override env is provided", async () => {
|
||||
const stub = createStubPty();
|
||||
spawnMock.mockReturnValue(stub);
|
||||
const { createPtyAdapter } = await import("./pty.js");
|
||||
|
||||
await createPtyAdapter({
|
||||
shell: "bash",
|
||||
args: ["-lc", "env"],
|
||||
});
|
||||
|
||||
const spawnOptions = spawnMock.mock.calls[0]?.[2] as { env?: Record<string, string> };
|
||||
expect(spawnOptions?.env).toBeUndefined();
|
||||
});
|
||||
|
||||
it("passes explicit env overrides as strings", async () => {
|
||||
const stub = createStubPty();
|
||||
spawnMock.mockReturnValue(stub);
|
||||
const { createPtyAdapter } = await import("./pty.js");
|
||||
|
||||
await createPtyAdapter({
|
||||
shell: "bash",
|
||||
args: ["-lc", "env"],
|
||||
env: { FOO: "bar", COUNT: "12", DROP_ME: undefined },
|
||||
});
|
||||
|
||||
const spawnOptions = spawnMock.mock.calls[0]?.[2] as { env?: Record<string, string> };
|
||||
expect(spawnOptions?.env).toEqual({ FOO: "bar", COUNT: "12" });
|
||||
});
|
||||
|
||||
it("does not pass a signal to node-pty on Windows", async () => {
|
||||
const originalPlatform = Object.getOwnPropertyDescriptor(process, "platform");
|
||||
Object.defineProperty(process, "platform", { value: "win32", configurable: true });
|
||||
try {
|
||||
spawnMock.mockReturnValue(createStubPty());
|
||||
const { createPtyAdapter } = await import("./pty.js");
|
||||
|
||||
const adapter = await createPtyAdapter({
|
||||
shell: "powershell.exe",
|
||||
args: ["-NoLogo"],
|
||||
});
|
||||
|
||||
adapter.kill("SIGTERM");
|
||||
expect(ptyKillMock).toHaveBeenCalledWith(undefined);
|
||||
expect(killProcessTreeMock).not.toHaveBeenCalled();
|
||||
} finally {
|
||||
if (originalPlatform) {
|
||||
Object.defineProperty(process, "platform", originalPlatform);
|
||||
}
|
||||
}
|
||||
});
|
||||
|
||||
it("uses process-tree kill for SIGKILL on Windows", async () => {
|
||||
const originalPlatform = Object.getOwnPropertyDescriptor(process, "platform");
|
||||
Object.defineProperty(process, "platform", { value: "win32", configurable: true });
|
||||
try {
|
||||
spawnMock.mockReturnValue(createStubPty(4567));
|
||||
const { createPtyAdapter } = await import("./pty.js");
|
||||
|
||||
const adapter = await createPtyAdapter({
|
||||
shell: "powershell.exe",
|
||||
args: ["-NoLogo"],
|
||||
});
|
||||
|
||||
adapter.kill("SIGKILL");
|
||||
expect(killProcessTreeMock).toHaveBeenCalledWith(4567);
|
||||
expect(ptyKillMock).not.toHaveBeenCalled();
|
||||
} finally {
|
||||
if (originalPlatform) {
|
||||
Object.defineProperty(process, "platform", originalPlatform);
|
||||
}
|
||||
}
|
||||
});
|
||||
});
|
||||
197
src/process/supervisor/adapters/pty.ts
Normal file
197
src/process/supervisor/adapters/pty.ts
Normal file
@@ -0,0 +1,197 @@
|
||||
import type { ManagedRunStdin } from "../types.js";
|
||||
import { killProcessTree } from "../../kill-tree.js";
|
||||
|
||||
type PtyExitEvent = { exitCode: number; signal?: number };
|
||||
type PtyDisposable = { dispose: () => void };
|
||||
type PtySpawnHandle = {
|
||||
pid: number;
|
||||
write: (data: string | Buffer) => void;
|
||||
onData: (listener: (value: string) => void) => PtyDisposable | void;
|
||||
onExit: (listener: (event: PtyExitEvent) => void) => PtyDisposable | void;
|
||||
kill: (signal?: string) => void;
|
||||
};
|
||||
type PtySpawn = (
|
||||
file: string,
|
||||
args: string[] | string,
|
||||
options: {
|
||||
name?: string;
|
||||
cols?: number;
|
||||
rows?: number;
|
||||
cwd?: string;
|
||||
env?: Record<string, string>;
|
||||
},
|
||||
) => PtySpawnHandle;
|
||||
|
||||
type PtyModule = {
|
||||
spawn?: PtySpawn;
|
||||
default?: {
|
||||
spawn?: PtySpawn;
|
||||
};
|
||||
};
|
||||
|
||||
function toStringEnv(env?: NodeJS.ProcessEnv): Record<string, string> {
|
||||
if (!env) {
|
||||
return {};
|
||||
}
|
||||
const out: Record<string, string> = {};
|
||||
for (const [key, value] of Object.entries(env)) {
|
||||
if (value === undefined) {
|
||||
continue;
|
||||
}
|
||||
out[key] = String(value);
|
||||
}
|
||||
return out;
|
||||
}
|
||||
|
||||
export type PtyAdapter = {
|
||||
pid?: number;
|
||||
stdin?: ManagedRunStdin;
|
||||
onStdout: (listener: (chunk: string) => void) => void;
|
||||
onStderr: (listener: (chunk: string) => void) => void;
|
||||
wait: () => Promise<{ code: number | null; signal: NodeJS.Signals | number | null }>;
|
||||
kill: (signal?: NodeJS.Signals) => void;
|
||||
dispose: () => void;
|
||||
};
|
||||
|
||||
export async function createPtyAdapter(params: {
|
||||
shell: string;
|
||||
args: string[];
|
||||
cwd?: string;
|
||||
env?: NodeJS.ProcessEnv;
|
||||
cols?: number;
|
||||
rows?: number;
|
||||
name?: string;
|
||||
}): Promise<PtyAdapter> {
|
||||
const module = (await import("@lydell/node-pty")) as unknown as PtyModule;
|
||||
const spawn = module.spawn ?? module.default?.spawn;
|
||||
if (!spawn) {
|
||||
throw new Error("PTY support is unavailable (node-pty spawn not found).");
|
||||
}
|
||||
const pty = spawn(params.shell, params.args, {
|
||||
cwd: params.cwd,
|
||||
env: params.env ? toStringEnv(params.env) : undefined,
|
||||
name: params.name ?? process.env.TERM ?? "xterm-256color",
|
||||
cols: params.cols ?? 120,
|
||||
rows: params.rows ?? 30,
|
||||
});
|
||||
|
||||
let dataListener: PtyDisposable | null = null;
|
||||
let exitListener: PtyDisposable | null = null;
|
||||
let waitResult: { code: number | null; signal: NodeJS.Signals | number | null } | null = null;
|
||||
let resolveWait:
|
||||
| ((value: { code: number | null; signal: NodeJS.Signals | number | null }) => void)
|
||||
| null = null;
|
||||
let waitPromise: Promise<{ code: number | null; signal: NodeJS.Signals | number | null }> | null =
|
||||
null;
|
||||
|
||||
const settleWait = (value: { code: number | null; signal: NodeJS.Signals | number | null }) => {
|
||||
if (waitResult) {
|
||||
return;
|
||||
}
|
||||
waitResult = value;
|
||||
if (resolveWait) {
|
||||
const resolve = resolveWait;
|
||||
resolveWait = null;
|
||||
resolve(value);
|
||||
}
|
||||
};
|
||||
|
||||
exitListener =
|
||||
pty.onExit((event) => {
|
||||
const signal = event.signal && event.signal !== 0 ? event.signal : null;
|
||||
settleWait({ code: event.exitCode ?? null, signal });
|
||||
}) ?? null;
|
||||
|
||||
const stdin: ManagedRunStdin = {
|
||||
destroyed: false,
|
||||
write: (data, cb) => {
|
||||
try {
|
||||
pty.write(data);
|
||||
cb?.(null);
|
||||
} catch (err) {
|
||||
cb?.(err as Error);
|
||||
}
|
||||
},
|
||||
end: () => {
|
||||
try {
|
||||
const eof = process.platform === "win32" ? "\x1a" : "\x04";
|
||||
pty.write(eof);
|
||||
} catch {
|
||||
// ignore EOF errors
|
||||
}
|
||||
},
|
||||
};
|
||||
|
||||
const onStdout = (listener: (chunk: string) => void) => {
|
||||
dataListener =
|
||||
pty.onData((chunk) => {
|
||||
listener(chunk.toString());
|
||||
}) ?? null;
|
||||
};
|
||||
|
||||
const onStderr = (_listener: (chunk: string) => void) => {
|
||||
// PTY gives a unified output stream.
|
||||
};
|
||||
|
||||
const wait = async () => {
|
||||
if (waitResult) {
|
||||
return waitResult;
|
||||
}
|
||||
if (!waitPromise) {
|
||||
waitPromise = new Promise<{ code: number | null; signal: NodeJS.Signals | number | null }>(
|
||||
(resolve) => {
|
||||
resolveWait = resolve;
|
||||
if (waitResult) {
|
||||
const settled = waitResult;
|
||||
resolveWait = null;
|
||||
resolve(settled);
|
||||
}
|
||||
},
|
||||
);
|
||||
}
|
||||
return waitPromise;
|
||||
};
|
||||
|
||||
const kill = (signal: NodeJS.Signals = "SIGKILL") => {
|
||||
try {
|
||||
if (signal === "SIGKILL" && typeof pty.pid === "number" && pty.pid > 0) {
|
||||
killProcessTree(pty.pid);
|
||||
} else if (process.platform === "win32") {
|
||||
pty.kill();
|
||||
} else {
|
||||
pty.kill(signal);
|
||||
}
|
||||
} catch {
|
||||
// ignore kill errors
|
||||
}
|
||||
// Some PTY hosts do not emit `onExit` reliably after kill.
|
||||
// Ensure waiters can progress on forced termination.
|
||||
settleWait({ code: null, signal });
|
||||
};
|
||||
|
||||
const dispose = () => {
|
||||
try {
|
||||
dataListener?.dispose();
|
||||
} catch {
|
||||
// ignore disposal errors
|
||||
}
|
||||
try {
|
||||
exitListener?.dispose();
|
||||
} catch {
|
||||
// ignore disposal errors
|
||||
}
|
||||
dataListener = null;
|
||||
exitListener = null;
|
||||
settleWait({ code: null, signal: null });
|
||||
};
|
||||
|
||||
return {
|
||||
pid: pty.pid || undefined,
|
||||
stdin,
|
||||
onStdout,
|
||||
onStderr,
|
||||
wait,
|
||||
kill,
|
||||
dispose,
|
||||
};
|
||||
}
|
||||
24
src/process/supervisor/index.ts
Normal file
24
src/process/supervisor/index.ts
Normal file
@@ -0,0 +1,24 @@
|
||||
import type { ProcessSupervisor } from "./types.js";
|
||||
import { createProcessSupervisor } from "./supervisor.js";
|
||||
|
||||
let singleton: ProcessSupervisor | null = null;
|
||||
|
||||
export function getProcessSupervisor(): ProcessSupervisor {
|
||||
if (singleton) {
|
||||
return singleton;
|
||||
}
|
||||
singleton = createProcessSupervisor();
|
||||
return singleton;
|
||||
}
|
||||
|
||||
export { createProcessSupervisor } from "./supervisor.js";
|
||||
export type {
|
||||
ManagedRun,
|
||||
ProcessSupervisor,
|
||||
RunExit,
|
||||
RunRecord,
|
||||
RunState,
|
||||
SpawnInput,
|
||||
SpawnMode,
|
||||
TerminationReason,
|
||||
} from "./types.js";
|
||||
83
src/process/supervisor/registry.test.ts
Normal file
83
src/process/supervisor/registry.test.ts
Normal file
@@ -0,0 +1,83 @@
|
||||
import { describe, expect, it } from "vitest";
|
||||
import { createRunRegistry } from "./registry.js";
|
||||
|
||||
describe("process supervisor run registry", () => {
|
||||
it("finalize is idempotent and preserves first terminal metadata", () => {
|
||||
const registry = createRunRegistry();
|
||||
registry.add({
|
||||
runId: "r1",
|
||||
sessionId: "s1",
|
||||
backendId: "b1",
|
||||
state: "running",
|
||||
startedAtMs: 1,
|
||||
lastOutputAtMs: 1,
|
||||
createdAtMs: 1,
|
||||
updatedAtMs: 1,
|
||||
});
|
||||
|
||||
const first = registry.finalize("r1", {
|
||||
reason: "overall-timeout",
|
||||
exitCode: null,
|
||||
exitSignal: "SIGKILL",
|
||||
});
|
||||
const second = registry.finalize("r1", {
|
||||
reason: "manual-cancel",
|
||||
exitCode: 0,
|
||||
exitSignal: null,
|
||||
});
|
||||
|
||||
expect(first).not.toBeNull();
|
||||
expect(first?.firstFinalize).toBe(true);
|
||||
expect(first?.record.terminationReason).toBe("overall-timeout");
|
||||
expect(first?.record.exitCode).toBeNull();
|
||||
expect(first?.record.exitSignal).toBe("SIGKILL");
|
||||
|
||||
expect(second).not.toBeNull();
|
||||
expect(second?.firstFinalize).toBe(false);
|
||||
expect(second?.record.terminationReason).toBe("overall-timeout");
|
||||
expect(second?.record.exitCode).toBeNull();
|
||||
expect(second?.record.exitSignal).toBe("SIGKILL");
|
||||
});
|
||||
|
||||
it("prunes oldest exited records once retention cap is exceeded", () => {
|
||||
const registry = createRunRegistry({ maxExitedRecords: 2 });
|
||||
registry.add({
|
||||
runId: "r1",
|
||||
sessionId: "s1",
|
||||
backendId: "b1",
|
||||
state: "running",
|
||||
startedAtMs: 1,
|
||||
lastOutputAtMs: 1,
|
||||
createdAtMs: 1,
|
||||
updatedAtMs: 1,
|
||||
});
|
||||
registry.add({
|
||||
runId: "r2",
|
||||
sessionId: "s2",
|
||||
backendId: "b1",
|
||||
state: "running",
|
||||
startedAtMs: 2,
|
||||
lastOutputAtMs: 2,
|
||||
createdAtMs: 2,
|
||||
updatedAtMs: 2,
|
||||
});
|
||||
registry.add({
|
||||
runId: "r3",
|
||||
sessionId: "s3",
|
||||
backendId: "b1",
|
||||
state: "running",
|
||||
startedAtMs: 3,
|
||||
lastOutputAtMs: 3,
|
||||
createdAtMs: 3,
|
||||
updatedAtMs: 3,
|
||||
});
|
||||
|
||||
registry.finalize("r1", { reason: "exit", exitCode: 0, exitSignal: null });
|
||||
registry.finalize("r2", { reason: "exit", exitCode: 0, exitSignal: null });
|
||||
registry.finalize("r3", { reason: "exit", exitCode: 0, exitSignal: null });
|
||||
|
||||
expect(registry.get("r1")).toBeUndefined();
|
||||
expect(registry.get("r2")?.state).toBe("exited");
|
||||
expect(registry.get("r3")?.state).toBe("exited");
|
||||
});
|
||||
});
|
||||
154
src/process/supervisor/registry.ts
Normal file
154
src/process/supervisor/registry.ts
Normal file
@@ -0,0 +1,154 @@
|
||||
import type { RunRecord, RunState, TerminationReason } from "./types.js";
|
||||
|
||||
function nowMs() {
|
||||
return Date.now();
|
||||
}
|
||||
|
||||
const DEFAULT_MAX_EXITED_RECORDS = 2_000;
|
||||
|
||||
function resolveMaxExitedRecords(value?: number): number {
|
||||
if (typeof value !== "number" || !Number.isFinite(value) || value < 1) {
|
||||
return DEFAULT_MAX_EXITED_RECORDS;
|
||||
}
|
||||
return Math.max(1, Math.floor(value));
|
||||
}
|
||||
|
||||
export type RunRegistry = {
|
||||
add: (record: RunRecord) => void;
|
||||
get: (runId: string) => RunRecord | undefined;
|
||||
list: () => RunRecord[];
|
||||
listByScope: (scopeKey: string) => RunRecord[];
|
||||
updateState: (
|
||||
runId: string,
|
||||
state: RunState,
|
||||
patch?: Partial<Pick<RunRecord, "pid" | "terminationReason" | "exitCode" | "exitSignal">>,
|
||||
) => RunRecord | undefined;
|
||||
touchOutput: (runId: string) => void;
|
||||
finalize: (
|
||||
runId: string,
|
||||
exit: {
|
||||
reason: TerminationReason;
|
||||
exitCode: number | null;
|
||||
exitSignal: NodeJS.Signals | number | null;
|
||||
},
|
||||
) => { record: RunRecord; firstFinalize: boolean } | null;
|
||||
delete: (runId: string) => void;
|
||||
};
|
||||
|
||||
export function createRunRegistry(options?: { maxExitedRecords?: number }): RunRegistry {
|
||||
const records = new Map<string, RunRecord>();
|
||||
const maxExitedRecords = resolveMaxExitedRecords(options?.maxExitedRecords);
|
||||
|
||||
const pruneExitedRecords = () => {
|
||||
if (!records.size) {
|
||||
return;
|
||||
}
|
||||
let exited = 0;
|
||||
for (const record of records.values()) {
|
||||
if (record.state === "exited") {
|
||||
exited += 1;
|
||||
}
|
||||
}
|
||||
if (exited <= maxExitedRecords) {
|
||||
return;
|
||||
}
|
||||
let remove = exited - maxExitedRecords;
|
||||
for (const [runId, record] of records.entries()) {
|
||||
if (remove <= 0) {
|
||||
break;
|
||||
}
|
||||
if (record.state !== "exited") {
|
||||
continue;
|
||||
}
|
||||
records.delete(runId);
|
||||
remove -= 1;
|
||||
}
|
||||
};
|
||||
|
||||
const add: RunRegistry["add"] = (record) => {
|
||||
records.set(record.runId, { ...record });
|
||||
};
|
||||
|
||||
const get: RunRegistry["get"] = (runId) => {
|
||||
const record = records.get(runId);
|
||||
return record ? { ...record } : undefined;
|
||||
};
|
||||
|
||||
const list: RunRegistry["list"] = () => {
|
||||
return Array.from(records.values()).map((record) => ({ ...record }));
|
||||
};
|
||||
|
||||
const listByScope: RunRegistry["listByScope"] = (scopeKey) => {
|
||||
if (!scopeKey.trim()) {
|
||||
return [];
|
||||
}
|
||||
return Array.from(records.values())
|
||||
.filter((record) => record.scopeKey === scopeKey)
|
||||
.map((record) => ({ ...record }));
|
||||
};
|
||||
|
||||
const updateState: RunRegistry["updateState"] = (runId, state, patch) => {
|
||||
const current = records.get(runId);
|
||||
if (!current) {
|
||||
return undefined;
|
||||
}
|
||||
const updatedAtMs = nowMs();
|
||||
const next: RunRecord = {
|
||||
...current,
|
||||
...patch,
|
||||
state,
|
||||
updatedAtMs,
|
||||
lastOutputAtMs: current.lastOutputAtMs,
|
||||
};
|
||||
records.set(runId, next);
|
||||
return { ...next };
|
||||
};
|
||||
|
||||
const touchOutput: RunRegistry["touchOutput"] = (runId) => {
|
||||
const current = records.get(runId);
|
||||
if (!current) {
|
||||
return;
|
||||
}
|
||||
const ts = nowMs();
|
||||
records.set(runId, {
|
||||
...current,
|
||||
lastOutputAtMs: ts,
|
||||
updatedAtMs: ts,
|
||||
});
|
||||
};
|
||||
|
||||
const finalize: RunRegistry["finalize"] = (runId, exit) => {
|
||||
const current = records.get(runId);
|
||||
if (!current) {
|
||||
return null;
|
||||
}
|
||||
const firstFinalize = current.state !== "exited";
|
||||
const ts = nowMs();
|
||||
const next: RunRecord = {
|
||||
...current,
|
||||
state: "exited",
|
||||
terminationReason: current.terminationReason ?? exit.reason,
|
||||
exitCode: current.exitCode !== undefined ? current.exitCode : exit.exitCode,
|
||||
exitSignal: current.exitSignal !== undefined ? current.exitSignal : exit.exitSignal,
|
||||
updatedAtMs: ts,
|
||||
};
|
||||
records.set(runId, next);
|
||||
pruneExitedRecords();
|
||||
return { record: { ...next }, firstFinalize };
|
||||
};
|
||||
|
||||
const del: RunRegistry["delete"] = (runId) => {
|
||||
records.delete(runId);
|
||||
};
|
||||
|
||||
return {
|
||||
add,
|
||||
get,
|
||||
list,
|
||||
listByScope,
|
||||
updateState,
|
||||
touchOutput,
|
||||
finalize,
|
||||
delete: del,
|
||||
};
|
||||
}
|
||||
76
src/process/supervisor/supervisor.pty-command.test.ts
Normal file
76
src/process/supervisor/supervisor.pty-command.test.ts
Normal file
@@ -0,0 +1,76 @@
|
||||
import { beforeEach, describe, expect, it, vi } from "vitest";
|
||||
|
||||
const { createPtyAdapterMock } = vi.hoisted(() => ({
|
||||
createPtyAdapterMock: vi.fn(),
|
||||
}));
|
||||
|
||||
vi.mock("../../agents/shell-utils.js", () => ({
|
||||
getShellConfig: () => ({ shell: "sh", args: ["-c"] }),
|
||||
}));
|
||||
|
||||
vi.mock("./adapters/pty.js", () => ({
|
||||
createPtyAdapter: (...args: unknown[]) => createPtyAdapterMock(...args),
|
||||
}));
|
||||
|
||||
function createStubPtyAdapter() {
|
||||
return {
|
||||
pid: 1234,
|
||||
stdin: undefined,
|
||||
onStdout: (_listener: (chunk: string) => void) => {
|
||||
// no-op
|
||||
},
|
||||
onStderr: (_listener: (chunk: string) => void) => {
|
||||
// no-op
|
||||
},
|
||||
wait: async () => ({ code: 0, signal: null }),
|
||||
kill: (_signal?: NodeJS.Signals) => {
|
||||
// no-op
|
||||
},
|
||||
dispose: () => {
|
||||
// no-op
|
||||
},
|
||||
};
|
||||
}
|
||||
|
||||
describe("process supervisor PTY command contract", () => {
|
||||
beforeEach(() => {
|
||||
createPtyAdapterMock.mockReset();
|
||||
});
|
||||
|
||||
it("passes PTY command verbatim to shell args", async () => {
|
||||
createPtyAdapterMock.mockResolvedValue(createStubPtyAdapter());
|
||||
const { createProcessSupervisor } = await import("./supervisor.js");
|
||||
const supervisor = createProcessSupervisor();
|
||||
const command = `printf '%s\\n' "a b" && printf '%s\\n' '$HOME'`;
|
||||
|
||||
const run = await supervisor.spawn({
|
||||
sessionId: "s1",
|
||||
backendId: "test",
|
||||
mode: "pty",
|
||||
ptyCommand: command,
|
||||
timeoutMs: 1_000,
|
||||
});
|
||||
const exit = await run.wait();
|
||||
|
||||
expect(exit.reason).toBe("exit");
|
||||
expect(createPtyAdapterMock).toHaveBeenCalledTimes(1);
|
||||
const params = createPtyAdapterMock.mock.calls[0]?.[0] as { args?: string[] };
|
||||
expect(params.args).toEqual(["-c", command]);
|
||||
});
|
||||
|
||||
it("rejects empty PTY command", async () => {
|
||||
createPtyAdapterMock.mockResolvedValue(createStubPtyAdapter());
|
||||
const { createProcessSupervisor } = await import("./supervisor.js");
|
||||
const supervisor = createProcessSupervisor();
|
||||
|
||||
await expect(
|
||||
supervisor.spawn({
|
||||
sessionId: "s1",
|
||||
backendId: "test",
|
||||
mode: "pty",
|
||||
ptyCommand: " ",
|
||||
}),
|
||||
).rejects.toThrow("PTY command cannot be empty");
|
||||
expect(createPtyAdapterMock).not.toHaveBeenCalled();
|
||||
});
|
||||
});
|
||||
102
src/process/supervisor/supervisor.test.ts
Normal file
102
src/process/supervisor/supervisor.test.ts
Normal file
@@ -0,0 +1,102 @@
|
||||
import { describe, expect, it } from "vitest";
|
||||
import { createProcessSupervisor } from "./supervisor.js";
|
||||
|
||||
describe("process supervisor", () => {
|
||||
it("spawns child runs and captures output", async () => {
|
||||
const supervisor = createProcessSupervisor();
|
||||
const run = await supervisor.spawn({
|
||||
sessionId: "s1",
|
||||
backendId: "test",
|
||||
mode: "child",
|
||||
argv: [process.execPath, "-e", 'process.stdout.write("ok")'],
|
||||
timeoutMs: 2_000,
|
||||
stdinMode: "pipe-closed",
|
||||
});
|
||||
const exit = await run.wait();
|
||||
expect(exit.reason).toBe("exit");
|
||||
expect(exit.exitCode).toBe(0);
|
||||
expect(exit.stdout).toBe("ok");
|
||||
});
|
||||
|
||||
it("enforces no-output timeout for silent processes", async () => {
|
||||
const supervisor = createProcessSupervisor();
|
||||
const run = await supervisor.spawn({
|
||||
sessionId: "s1",
|
||||
backendId: "test",
|
||||
mode: "child",
|
||||
argv: [process.execPath, "-e", "setTimeout(() => {}, 10_000)"],
|
||||
timeoutMs: 5_000,
|
||||
noOutputTimeoutMs: 200,
|
||||
stdinMode: "pipe-closed",
|
||||
});
|
||||
const exit = await run.wait();
|
||||
expect(exit.reason).toBe("no-output-timeout");
|
||||
expect(exit.noOutputTimedOut).toBe(true);
|
||||
expect(exit.timedOut).toBe(true);
|
||||
});
|
||||
|
||||
it("cancels prior scoped run when replaceExistingScope is enabled", async () => {
|
||||
const supervisor = createProcessSupervisor();
|
||||
const first = await supervisor.spawn({
|
||||
sessionId: "s1",
|
||||
backendId: "test",
|
||||
scopeKey: "scope:a",
|
||||
mode: "child",
|
||||
argv: [process.execPath, "-e", "setTimeout(() => {}, 10_000)"],
|
||||
timeoutMs: 10_000,
|
||||
stdinMode: "pipe-open",
|
||||
});
|
||||
|
||||
const second = await supervisor.spawn({
|
||||
sessionId: "s1",
|
||||
backendId: "test",
|
||||
scopeKey: "scope:a",
|
||||
replaceExistingScope: true,
|
||||
mode: "child",
|
||||
argv: [process.execPath, "-e", 'process.stdout.write("new")'],
|
||||
timeoutMs: 2_000,
|
||||
stdinMode: "pipe-closed",
|
||||
});
|
||||
|
||||
const firstExit = await first.wait();
|
||||
const secondExit = await second.wait();
|
||||
expect(firstExit.reason === "manual-cancel" || firstExit.reason === "signal").toBe(true);
|
||||
expect(secondExit.reason).toBe("exit");
|
||||
expect(secondExit.stdout).toBe("new");
|
||||
});
|
||||
|
||||
it("applies overall timeout even for near-immediate timer firing", async () => {
|
||||
const supervisor = createProcessSupervisor();
|
||||
const run = await supervisor.spawn({
|
||||
sessionId: "s-timeout",
|
||||
backendId: "test",
|
||||
mode: "child",
|
||||
argv: [process.execPath, "-e", "setTimeout(() => {}, 10_000)"],
|
||||
timeoutMs: 1,
|
||||
stdinMode: "pipe-closed",
|
||||
});
|
||||
const exit = await run.wait();
|
||||
expect(exit.reason).toBe("overall-timeout");
|
||||
expect(exit.timedOut).toBe(true);
|
||||
});
|
||||
|
||||
it("can stream output without retaining it in RunExit payload", async () => {
|
||||
const supervisor = createProcessSupervisor();
|
||||
let streamed = "";
|
||||
const run = await supervisor.spawn({
|
||||
sessionId: "s-capture",
|
||||
backendId: "test",
|
||||
mode: "child",
|
||||
argv: [process.execPath, "-e", 'process.stdout.write("streamed")'],
|
||||
timeoutMs: 2_000,
|
||||
stdinMode: "pipe-closed",
|
||||
captureOutput: false,
|
||||
onStdout: (chunk) => {
|
||||
streamed += chunk;
|
||||
},
|
||||
});
|
||||
const exit = await run.wait();
|
||||
expect(streamed).toBe("streamed");
|
||||
expect(exit.stdout).toBe("");
|
||||
});
|
||||
});
|
||||
282
src/process/supervisor/supervisor.ts
Normal file
282
src/process/supervisor/supervisor.ts
Normal file
@@ -0,0 +1,282 @@
|
||||
import crypto from "node:crypto";
|
||||
import type {
|
||||
ManagedRun,
|
||||
ProcessSupervisor,
|
||||
RunExit,
|
||||
RunRecord,
|
||||
SpawnInput,
|
||||
TerminationReason,
|
||||
} from "./types.js";
|
||||
import { getShellConfig } from "../../agents/shell-utils.js";
|
||||
import { createSubsystemLogger } from "../../logging/subsystem.js";
|
||||
import { createChildAdapter } from "./adapters/child.js";
|
||||
import { createPtyAdapter } from "./adapters/pty.js";
|
||||
import { createRunRegistry } from "./registry.js";
|
||||
|
||||
const log = createSubsystemLogger("process/supervisor");
|
||||
|
||||
type ActiveRun = {
|
||||
run: ManagedRun;
|
||||
scopeKey?: string;
|
||||
};
|
||||
|
||||
function clampTimeout(value?: number): number | undefined {
|
||||
if (typeof value !== "number" || !Number.isFinite(value) || value <= 0) {
|
||||
return undefined;
|
||||
}
|
||||
return Math.max(1, Math.floor(value));
|
||||
}
|
||||
|
||||
function isTimeoutReason(reason: TerminationReason) {
|
||||
return reason === "overall-timeout" || reason === "no-output-timeout";
|
||||
}
|
||||
|
||||
export function createProcessSupervisor(): ProcessSupervisor {
|
||||
const registry = createRunRegistry();
|
||||
const active = new Map<string, ActiveRun>();
|
||||
|
||||
const cancel = (runId: string, reason: TerminationReason = "manual-cancel") => {
|
||||
const current = active.get(runId);
|
||||
if (!current) {
|
||||
return;
|
||||
}
|
||||
registry.updateState(runId, "exiting", {
|
||||
terminationReason: reason,
|
||||
});
|
||||
current.run.cancel(reason);
|
||||
};
|
||||
|
||||
const cancelScope = (scopeKey: string, reason: TerminationReason = "manual-cancel") => {
|
||||
if (!scopeKey.trim()) {
|
||||
return;
|
||||
}
|
||||
for (const [runId, run] of active.entries()) {
|
||||
if (run.scopeKey !== scopeKey) {
|
||||
continue;
|
||||
}
|
||||
cancel(runId, reason);
|
||||
}
|
||||
};
|
||||
|
||||
const spawn = async (input: SpawnInput): Promise<ManagedRun> => {
|
||||
const runId = input.runId?.trim() || crypto.randomUUID();
|
||||
if (input.replaceExistingScope && input.scopeKey?.trim()) {
|
||||
cancelScope(input.scopeKey, "manual-cancel");
|
||||
}
|
||||
const startedAtMs = Date.now();
|
||||
const record: RunRecord = {
|
||||
runId,
|
||||
sessionId: input.sessionId,
|
||||
backendId: input.backendId,
|
||||
scopeKey: input.scopeKey?.trim() || undefined,
|
||||
state: "starting",
|
||||
startedAtMs,
|
||||
lastOutputAtMs: startedAtMs,
|
||||
createdAtMs: startedAtMs,
|
||||
updatedAtMs: startedAtMs,
|
||||
};
|
||||
registry.add(record);
|
||||
|
||||
let forcedReason: TerminationReason | null = null;
|
||||
let settled = false;
|
||||
let stdout = "";
|
||||
let stderr = "";
|
||||
let timeoutTimer: NodeJS.Timeout | null = null;
|
||||
let noOutputTimer: NodeJS.Timeout | null = null;
|
||||
const captureOutput = input.captureOutput !== false;
|
||||
|
||||
const overallTimeoutMs = clampTimeout(input.timeoutMs);
|
||||
const noOutputTimeoutMs = clampTimeout(input.noOutputTimeoutMs);
|
||||
|
||||
const setForcedReason = (reason: TerminationReason) => {
|
||||
if (forcedReason) {
|
||||
return;
|
||||
}
|
||||
forcedReason = reason;
|
||||
registry.updateState(runId, "exiting", { terminationReason: reason });
|
||||
};
|
||||
|
||||
let cancelAdapter: ((reason: TerminationReason) => void) | null = null;
|
||||
|
||||
const requestCancel = (reason: TerminationReason) => {
|
||||
setForcedReason(reason);
|
||||
cancelAdapter?.(reason);
|
||||
};
|
||||
|
||||
const touchOutput = () => {
|
||||
registry.touchOutput(runId);
|
||||
if (!noOutputTimeoutMs || settled) {
|
||||
return;
|
||||
}
|
||||
if (noOutputTimer) {
|
||||
clearTimeout(noOutputTimer);
|
||||
}
|
||||
noOutputTimer = setTimeout(() => {
|
||||
requestCancel("no-output-timeout");
|
||||
}, noOutputTimeoutMs);
|
||||
};
|
||||
|
||||
try {
|
||||
if (input.mode === "child" && input.argv.length === 0) {
|
||||
throw new Error("spawn argv cannot be empty");
|
||||
}
|
||||
const adapter =
|
||||
input.mode === "pty"
|
||||
? await (async () => {
|
||||
const { shell, args: shellArgs } = getShellConfig();
|
||||
const ptyCommand = input.ptyCommand.trim();
|
||||
if (!ptyCommand) {
|
||||
throw new Error("PTY command cannot be empty");
|
||||
}
|
||||
return await createPtyAdapter({
|
||||
shell,
|
||||
args: [...shellArgs, ptyCommand],
|
||||
cwd: input.cwd,
|
||||
env: input.env,
|
||||
});
|
||||
})()
|
||||
: await createChildAdapter({
|
||||
argv: input.argv,
|
||||
cwd: input.cwd,
|
||||
env: input.env,
|
||||
windowsVerbatimArguments: input.windowsVerbatimArguments,
|
||||
input: input.input,
|
||||
stdinMode: input.stdinMode,
|
||||
});
|
||||
|
||||
registry.updateState(runId, "running", { pid: adapter.pid });
|
||||
|
||||
const clearTimers = () => {
|
||||
if (timeoutTimer) {
|
||||
clearTimeout(timeoutTimer);
|
||||
timeoutTimer = null;
|
||||
}
|
||||
if (noOutputTimer) {
|
||||
clearTimeout(noOutputTimer);
|
||||
noOutputTimer = null;
|
||||
}
|
||||
};
|
||||
|
||||
cancelAdapter = (_reason: TerminationReason) => {
|
||||
if (settled) {
|
||||
return;
|
||||
}
|
||||
adapter.kill("SIGKILL");
|
||||
};
|
||||
|
||||
if (overallTimeoutMs) {
|
||||
timeoutTimer = setTimeout(() => {
|
||||
requestCancel("overall-timeout");
|
||||
}, overallTimeoutMs);
|
||||
}
|
||||
if (noOutputTimeoutMs) {
|
||||
noOutputTimer = setTimeout(() => {
|
||||
requestCancel("no-output-timeout");
|
||||
}, noOutputTimeoutMs);
|
||||
}
|
||||
|
||||
adapter.onStdout((chunk) => {
|
||||
if (captureOutput) {
|
||||
stdout += chunk;
|
||||
}
|
||||
input.onStdout?.(chunk);
|
||||
touchOutput();
|
||||
});
|
||||
adapter.onStderr((chunk) => {
|
||||
if (captureOutput) {
|
||||
stderr += chunk;
|
||||
}
|
||||
input.onStderr?.(chunk);
|
||||
touchOutput();
|
||||
});
|
||||
|
||||
const waitPromise = (async (): Promise<RunExit> => {
|
||||
const result = await adapter.wait();
|
||||
if (settled) {
|
||||
return {
|
||||
reason: forcedReason ?? "exit",
|
||||
exitCode: result.code,
|
||||
exitSignal: result.signal,
|
||||
durationMs: Date.now() - startedAtMs,
|
||||
stdout,
|
||||
stderr,
|
||||
timedOut: isTimeoutReason(forcedReason ?? "exit"),
|
||||
noOutputTimedOut: forcedReason === "no-output-timeout",
|
||||
};
|
||||
}
|
||||
settled = true;
|
||||
clearTimers();
|
||||
adapter.dispose();
|
||||
active.delete(runId);
|
||||
|
||||
const reason: TerminationReason =
|
||||
forcedReason ?? (result.signal != null ? ("signal" as const) : ("exit" as const));
|
||||
const exit: RunExit = {
|
||||
reason,
|
||||
exitCode: result.code,
|
||||
exitSignal: result.signal,
|
||||
durationMs: Date.now() - startedAtMs,
|
||||
stdout,
|
||||
stderr,
|
||||
timedOut: isTimeoutReason(forcedReason ?? reason),
|
||||
noOutputTimedOut: forcedReason === "no-output-timeout",
|
||||
};
|
||||
registry.finalize(runId, {
|
||||
reason: exit.reason,
|
||||
exitCode: exit.exitCode,
|
||||
exitSignal: exit.exitSignal,
|
||||
});
|
||||
return exit;
|
||||
})().catch((err) => {
|
||||
if (!settled) {
|
||||
settled = true;
|
||||
clearTimers();
|
||||
active.delete(runId);
|
||||
adapter.dispose();
|
||||
registry.finalize(runId, {
|
||||
reason: "spawn-error",
|
||||
exitCode: null,
|
||||
exitSignal: null,
|
||||
});
|
||||
}
|
||||
throw err;
|
||||
});
|
||||
|
||||
const managedRun: ManagedRun = {
|
||||
runId,
|
||||
pid: adapter.pid,
|
||||
startedAtMs,
|
||||
stdin: adapter.stdin,
|
||||
wait: async () => await waitPromise,
|
||||
cancel: (reason = "manual-cancel") => {
|
||||
requestCancel(reason);
|
||||
},
|
||||
};
|
||||
|
||||
active.set(runId, {
|
||||
run: managedRun,
|
||||
scopeKey: input.scopeKey?.trim() || undefined,
|
||||
});
|
||||
return managedRun;
|
||||
} catch (err) {
|
||||
registry.finalize(runId, {
|
||||
reason: "spawn-error",
|
||||
exitCode: null,
|
||||
exitSignal: null,
|
||||
});
|
||||
log.warn(`spawn failed: runId=${runId} reason=${String(err)}`);
|
||||
throw err;
|
||||
}
|
||||
};
|
||||
|
||||
return {
|
||||
spawn,
|
||||
cancel,
|
||||
cancelScope,
|
||||
reconcileOrphans: async () => {
|
||||
// Deliberate no-op: this supervisor uses in-memory ownership only.
|
||||
// Active runs are not recovered after process restart in the current model.
|
||||
},
|
||||
getRecord: (runId: string) => registry.get(runId),
|
||||
};
|
||||
}
|
||||
96
src/process/supervisor/types.ts
Normal file
96
src/process/supervisor/types.ts
Normal file
@@ -0,0 +1,96 @@
|
||||
export type RunState = "starting" | "running" | "exiting" | "exited";
|
||||
|
||||
export type TerminationReason =
|
||||
| "manual-cancel"
|
||||
| "overall-timeout"
|
||||
| "no-output-timeout"
|
||||
| "spawn-error"
|
||||
| "signal"
|
||||
| "exit";
|
||||
|
||||
export type RunRecord = {
|
||||
runId: string;
|
||||
sessionId: string;
|
||||
backendId: string;
|
||||
scopeKey?: string;
|
||||
pid?: number;
|
||||
processGroupId?: number;
|
||||
startedAtMs: number;
|
||||
lastOutputAtMs: number;
|
||||
createdAtMs: number;
|
||||
updatedAtMs: number;
|
||||
state: RunState;
|
||||
terminationReason?: TerminationReason;
|
||||
exitCode?: number | null;
|
||||
exitSignal?: NodeJS.Signals | number | null;
|
||||
};
|
||||
|
||||
export type RunExit = {
|
||||
reason: TerminationReason;
|
||||
exitCode: number | null;
|
||||
exitSignal: NodeJS.Signals | number | null;
|
||||
durationMs: number;
|
||||
stdout: string;
|
||||
stderr: string;
|
||||
timedOut: boolean;
|
||||
noOutputTimedOut: boolean;
|
||||
};
|
||||
|
||||
export type ManagedRun = {
|
||||
runId: string;
|
||||
pid?: number;
|
||||
startedAtMs: number;
|
||||
stdin?: ManagedRunStdin;
|
||||
wait: () => Promise<RunExit>;
|
||||
cancel: (reason?: TerminationReason) => void;
|
||||
};
|
||||
|
||||
export type SpawnMode = "child" | "pty";
|
||||
|
||||
export type ManagedRunStdin = {
|
||||
write: (data: string, cb?: (err?: Error | null) => void) => void;
|
||||
end: () => void;
|
||||
destroy?: () => void;
|
||||
destroyed?: boolean;
|
||||
};
|
||||
|
||||
type SpawnBaseInput = {
|
||||
runId?: string;
|
||||
sessionId: string;
|
||||
backendId: string;
|
||||
scopeKey?: string;
|
||||
replaceExistingScope?: boolean;
|
||||
cwd?: string;
|
||||
env?: NodeJS.ProcessEnv;
|
||||
timeoutMs?: number;
|
||||
noOutputTimeoutMs?: number;
|
||||
/**
|
||||
* When false, stdout/stderr are streamed via callbacks only and not retained in RunExit payload.
|
||||
*/
|
||||
captureOutput?: boolean;
|
||||
onStdout?: (chunk: string) => void;
|
||||
onStderr?: (chunk: string) => void;
|
||||
};
|
||||
|
||||
type SpawnChildInput = SpawnBaseInput & {
|
||||
mode: "child";
|
||||
argv: string[];
|
||||
windowsVerbatimArguments?: boolean;
|
||||
input?: string;
|
||||
stdinMode?: "inherit" | "pipe-open" | "pipe-closed";
|
||||
};
|
||||
|
||||
type SpawnPtyInput = SpawnBaseInput & {
|
||||
mode: "pty";
|
||||
ptyCommand: string;
|
||||
};
|
||||
|
||||
export type SpawnInput = SpawnChildInput | SpawnPtyInput;
|
||||
|
||||
export interface ProcessSupervisor {
|
||||
spawn(input: SpawnInput): Promise<ManagedRun>;
|
||||
cancel(runId: string, reason?: TerminationReason): void;
|
||||
cancelScope(scopeKey: string, reason?: TerminationReason): void;
|
||||
reconcileOrphans(): Promise<void>;
|
||||
getRecord(runId: string): RunRecord | undefined;
|
||||
}
|
||||
Reference in New Issue
Block a user