* fix: prune stale session entries, cap entry count, and rotate sessions.json
The sessions.json file grows unbounded over time. Every heartbeat tick (default: 30m)
triggers multiple full rewrites, and session keys from groups, threads, and DMs
accumulate indefinitely with large embedded objects (skillsSnapshot,
systemPromptReport). At >50MB the synchronous JSON parse blocks the event loop,
causing Telegram webhook timeouts and effectively taking the bot down.
Three mitigations, all running inside saveSessionStoreUnlocked() on every write:
1. Prune stale entries: remove entries with updatedAt older than 30 days
(configurable via session.maintenance.pruneDays in openclaw.json)
2. Cap entry count: keep only the 500 most recently updated entries
(configurable via session.maintenance.maxEntries). Entries without updatedAt
are evicted first.
3. File rotation: if the existing sessions.json exceeds 10MB before a write,
rename it to sessions.json.bak.{timestamp} and keep only the 3 most recent
backups (configurable via session.maintenance.rotateBytes).
All three thresholds are configurable under session.maintenance in openclaw.json
with Zod validation. No env vars.
Existing tests updated to use Date.now() instead of epoch-relative timestamps
(1, 2, 3) that would be incorrectly pruned as stale.
27 new tests covering pruning, capping, rotation, and integration scenarios.
* feat: auto-prune expired cron run sessions (#12289)
Add TTL-based reaper for isolated cron run sessions that accumulate
indefinitely in sessions.json.
New config option:
cron.sessionRetention: string | false (default: '24h')
The reaper runs piggy-backed on the cron timer tick, self-throttled
to sweep at most every 5 minutes. It removes session entries matching
the pattern cron:<jobId>:run:<uuid> whose updatedAt + retention < now.
Design follows the Kubernetes ttlSecondsAfterFinished pattern:
- Sessions are persisted normally (observability/debugging)
- A periodic reaper prunes expired entries
- Configurable retention with sensible default
- Set to false to disable pruning entirely
Files changed:
- src/config/types.cron.ts: Add sessionRetention to CronConfig
- src/config/zod-schema.ts: Add Zod validation for sessionRetention
- src/cron/session-reaper.ts: New reaper module (sweepCronRunSessions)
- src/cron/session-reaper.test.ts: 12 tests covering all paths
- src/cron/service/state.ts: Add cronConfig/sessionStorePath to deps
- src/cron/service/timer.ts: Wire reaper into onTimer tick
- src/gateway/server-cron.ts: Pass config and session store path to deps
Closes#12289
* fix: sweep cron session stores per agent
* docs: add changelog for session maintenance (#13083) (thanks @skyfallsin, @Glucksberg)
* fix: add warn-only session maintenance mode
* fix: warn-only maintenance defaults to active session
* fix: deliver maintenance warnings to active session
* docs: add session maintenance examples
* fix: accept duration and size maintenance thresholds
* refactor: share cron run session key check
* fix: format issues and replace defaultRuntime.warn with console.warn
---------
Co-authored-by: Pradeep Elankumaran <pradeepe@gmail.com>
Co-authored-by: Glucksberg <markuscontasul@gmail.com>
Co-authored-by: max <40643627+quotentiroler@users.noreply.github.com>
Co-authored-by: quotentiroler <max.nussbaumer@maxhealth.tech>
* fix(tools): correct Grok response parsing for xAI Responses API
The xAI Responses API returns content in output[0].content[0].text,
not in output_text field. Updated GrokSearchResponse type and
runGrokSearch to extract content from the correct path.
Fixes the 'No response' issue when using Grok web search.
* fix(tools): harden Grok web_search parsing (#13049) (thanks @ereid7)
---------
Co-authored-by: erai <erai@erais-Mac-mini.local>
Co-authored-by: Tak Hoffman <781889+Takhoffman@users.noreply.github.com>
* fix: cap Discord gateway reconnect attempts to prevent infinite loop
The Discord GatewayPlugin was configured with maxAttempts: Infinity,
which causes an unbounded reconnection loop when the Discord gateway
enters a persistent failure state (e.g. code 1005 with stalled HELLO).
In production, this manifested as 2,483+ reconnection attempts in a
single log file, starving the Node.js event loop and preventing cron,
heartbeat, and other subsystems from functioning.
Cap maxAttempts at 50, which provides ~25 minutes of retry time
(with 30s HELLO timeout between attempts) before cleanly exiting
via the existing "Max reconnect attempts" error handler.
Closes#11836
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
* Changelog: note Discord gateway reconnect cap (#12230) (thanks @Yida-Dev)
---------
Co-authored-by: Yida-Dev <reyifeijun@gmail.com>
Co-authored-by: Claude Opus 4.6 <noreply@anthropic.com>
Co-authored-by: Shadow <shadow@clawd.bot>
* Config: reload dotenv before env substitution on runtime loads
* Test: isolate config env var regression from host state env
* fix: keep dotenv vars resolvable on runtime config reloads (#12748) (thanks @rodrigouroz)
---------
Co-authored-by: Tak Hoffman <781889+Takhoffman@users.noreply.github.com>
* fix(web_search): Fix invalid model name sent to Perplexity
* chore: Only apply fix to direct Perplexity calls
* fix(web_search): normalize direct Perplexity model IDs
* fix: add changelog note for perplexity model normalization (#12795) (thanks @cdorsey)
* fix: align tests and fetch type for gate stability (#12795) (thanks @cdorsey)
* chore: keep #12795 scoped to web_search changes
---------
Co-authored-by: Sebastian <19554889+sebslight@users.noreply.github.com>
* Memory/QMD: symlink default model cache into custom XDG_CACHE_HOME
QmdMemoryManager overrides XDG_CACHE_HOME to isolate the qmd index
per-agent, but this also moves where qmd looks for its ML models
(~2.1GB). Since models are installed at the default location
(~/.cache/qmd/models/), every qmd invocation would attempt to
re-download them from HuggingFace and time out.
Fix: on initialization, symlink ~/.cache/qmd/models/ into the custom
XDG_CACHE_HOME path so the index stays isolated per-agent while the
shared models are reused. The symlink is only created when the default
models directory exists and the target path does not already exist.
Includes tests for the three key scenarios: symlink creation, existing
directory preservation, and graceful skip when no default models exist.
* Memory/QMD: skip model symlink warning on ENOENT
* test: stabilize warning-filter visibility assertion (#12114) (thanks @tyler6204)
* fix: add changelog entry for QMD cache reuse (#12114) (thanks @tyler6204)
* fix: handle plain context-overflow strings in compaction detection (#12114) (thanks @tyler6204)
* fix(cron): recover flat params when LLM omits job wrapper (#11310)
Non-frontier models (e.g. Grok) flatten job properties to the top level
alongside `action` instead of nesting them inside the `job` parameter.
The opaque schema (`Type.Object({}, { additionalProperties: true })`)
gives these models no structural hint, so they put name, schedule,
payload, etc. as siblings of action.
Add a flat-params recovery step in the cron add handler: when
`params.job` is missing or an empty object, scan for recognised job
property names on params and construct a synthetic job object before
passing to `normalizeCronJobCreate`. Recovery requires at least one
meaningful signal field (schedule, payload, message, or text) to avoid
false positives.
Added tests:
- Flat params with no job wrapper → recovered
- Empty job object + flat params → recovered
- Message shorthand at top level → inferred as agentTurn
- No meaningful fields → still throws 'job required'
- Non-empty job takes precedence over flat params
* fix(cron): floor nowMs to second boundary before croner lookback
Cron expressions operate at second granularity. When nowMs falls
mid-second (e.g. 12:00:00.500) and the pattern targets that exact
second (like '0 0 12 * * *'), a 1ms lookback still lands inside the
matching second. Croner interprets this as 'already past' and skips
to the next occurrence (e.g. the following day).
Fix: floor nowMs to the start of the current second before applying
the 1ms lookback. This ensures the reference always falls in the
*previous* second, so croner correctly identifies the current match.
Also compare the result against the floored nowSecondMs (not raw nowMs)
so that a match at the start of the current second is not rejected by
the >= guard when nowMs has sub-second offset.
Adds regression tests for 6-field cron patterns with specific seconds.
* fix: add changelog entries for cron fixes (#12124) (thanks @tyler6204)
* test: stabilize warning filter emit assertion (#12124) (thanks @tyler6204)
* feat: Implement Telegram video note support with tests and docs
* fixing lint
* feat: add doctor-state-integrity command, Telegram messaging, and PowerShell Docker setup scripts.
* Update src/telegram/send.video-note.test.ts
Co-authored-by: greptile-apps[bot] <165735046+greptile-apps[bot]@users.noreply.github.com>
* fix: Set video note follow-up text to undefined for empty input and adjust caption test expectation.
* test: add assertion for `sendMessage` with reply markup and HTML parse mode in `send.video-note` test.
* docs: add changelog entry for Telegram video notes
---------
Co-authored-by: Evgenii Utkin <thewulf7@gmail.com>
Co-authored-by: greptile-apps[bot] <165735046+greptile-apps[bot]@users.noreply.github.com>
Co-authored-by: CLAWDINATOR Bot <clawdinator[bot]@users.noreply.github.com>
* fix(skills): ignore Python venvs and caches in skills watcher
Add .venv, venv, __pycache__, .mypy_cache, .pytest_cache, build, and
.cache to the default ignored patterns for the skills watcher.
This prevents file descriptor exhaustion when a skill contains a Python
virtual environment with tens of thousands of files, which was causing
EBADF spawn errors on macOS.
Fixes#1056
Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
* docs: add changelog entry for skills watcher ignores
* docs: fill changelog PR number
---------
Co-authored-by: Kyle Howells <freerunnering@gmail.com>
Co-authored-by: Claude Opus 4.5 <noreply@anthropic.com>
Co-authored-by: CLAWDINATOR Bot <clawdinator[bot]@users.noreply.github.com>
* Gateway: preserve Pi transcript parentId for injected messages
Thread: unknown
When: 2026-02-08 20:08 CST
Repo: https://github.com/openclaw/openclaw.git
Branch: codex/wip/2026-02-09/compact-post-compaction-parentid-fix
Problem
- Post-compaction turns sometimes lost the compaction summary + kept suffix in the *next* provider request.
- Root cause was session graph corruption: gateway appended "stopReason: injected" transcript lines via raw JSONL writes without `parentId`.
- Pi's `SessionManager.buildSessionContext()` walks the `parentId` chain from the current leaf; missing `parentId` can sever the active branch and hide the compaction entry.
Fix
- Use `SessionManager.appendMessage(...)` for injected assistant transcript writes so `parentId` is set to the current leaf.
- Route `chat.inject` through the same helper to avoid duplicating the broken raw JSONL append logic.
Why This Matters
- The compaction algorithm may be correct, but if the leaf chain is broken right after compaction, the provider payload cannot include the summary/suffix "shape" Pi expects.
Testing
- pnpm test src/agents/pi-embedded-helpers.post-compaction-shape.test.ts src/agents/pi-embedded-runner/run.overflow-compaction.post-context.test.ts
- pnpm build
Notes
- This is provider-shape agnostic: it fixes transcript structure so Anthropic/Gemini/etc all see the same post-compaction context.
Resume
- If post-compaction looks wrong again, inspect the session transcript for entries missing `parentId` immediately after `type: compaction`.
* Gateway: guardrail test for transcript parentId (chat.inject)
* Gateway: guardrail against raw transcript appends (chat.ts)
* Gateway: add local AGENTS.md note to preserve Pi transcript parentId chain
* Changelog: note gateway post-compaction amnesia fix
* Gateway: store injected transcript messages with valid stopReason
* Gateway: use valid stopReason in injected fallback
* macOS: honor Nix defaults suite; auto launch in Nix mode
Fixes repeated onboarding in Nix deployments by detecting nixMode from the stable defaults suite (ai.openclaw.mac) and bridging key settings into the current defaults domain.
Also enables LaunchAgent autostart by default in Nix mode (escape hatch: openclaw.nixAutoLaunchAtLogin=false).
* macOS: keep Nix mode fix focused
Drop the automatic launch-at-login behavior from the Nix defaults patch; keep this PR scoped to reliable nixMode detection + defaults bridging.
* macOS: simplify nixMode fix
Remove the defaults-bridging helper and rely on a single, stable defaults suite (ai.openclaw.mac) for nixMode detection when running as an app bundle. This keeps the fix focused on onboarding suppression and rename churn resilience.
* macOS: fix nixMode defaults suite churn (#12205)
* fix(paths): structurally resolve home dir to prevent Windows path bugs
Extract resolveRawHomeDir as a private function and gate the public
resolveEffectiveHomeDir through a single path.resolve() exit point.
This makes it structurally impossible for unresolved paths (missing
drive letter on Windows) to escape the function, regardless of how
many return paths exist in the raw lookup logic.
Simplify resolveRequiredHomeDir to only resolve the process.cwd()
fallback, since resolveEffectiveHomeDir now returns resolved values.
Fix shortenMeta in tool-meta.ts: the colon-based split for file:line
patterns (e.g. file.txt:12) conflicts with Windows drive letters
(C:\...) because indexOf(":") matches the drive colon first.
shortenHomeInString already handles file:line patterns correctly via
split/join, so the colon split was both unnecessary and harmful.
Update test assertions across all affected files to use path.resolve()
in expected values and input strings so they match the now-correct
resolved output on both Unix and Windows.
Fixes#12119
* fix(changelog): add paths Windows fix entry (#12125)
---------
Co-authored-by: Sebastian <19554889+sebslight@users.noreply.github.com>