mirror of
https://github.com/simstudioai/sim.git
synced 2026-04-28 03:00:29 -04:00
* fix(knowledge): enqueue connector docs per-batch to survive sync timeouts * fix(connectors): convert all connectors to contentDeferred pattern and fix validation issues All 10 connectors now use contentDeferred: true in listDocuments, returning lightweight metadata stubs instead of downloading content during listing. Content is fetched lazily via getDocument only for new/changed documents, preventing Trigger.dev task timeouts on large syncs. Connector-specific fixes from validation audit: - Google Drive: metadata-based contentHash, orderBy for deterministic pagination, precise maxFiles, byte-length size check with truncation warning - OneDrive: metadata-based contentHash, orderBy for deterministic pagination - SharePoint: metadata-based contentHash, byte-length size check - Dropbox: metadata-based contentHash using content_hash field - Notion: code/equation block extraction, empty page fallback to title, reduced CHILD_PAGE_CONCURRENCY to 5, syncContext parameter - Confluence: syncContext caching for cloudId, reduced label concurrency to 5 - Gmail: use joinTagArray for label tags - Obsidian: syncRunId-based stub hash for forced re-fetch, mtime-based hash in getDocument, .trim() on vaultUrl, lightweight validateConfig - Evernote: retryOptions threaded through apiFindNotesMetadata and apiGetNote - GitHub: added contentDeferred: false to getDocument, syncContext parameter Infrastructure: - sync-engine: added syncRunId to syncContext for Obsidian change detection - confluence/utils: replaced raw fetch with fetchWithRetry, added retryOptions - oauth: added supportsRefreshTokenRotation: false for Dropbox - Updated add-connector and validate-connector skills with contentDeferred docs Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> * fix(connectors): address PR review comments - metadata merge, retryOptions, UTF-8 safety - Sync engine: merge metadata from getDocument during deferred hydration, so Gmail/Obsidian/Confluence tags and metadata survive the stub→full transition - Evernote: pass retryOptions {retries:3, backoff:500} from listDocuments and getDocument callers into apiFindNotesMetadata and apiGetNote - Google Drive + SharePoint: safe UTF-8 truncation that walks back to the last complete character boundary instead of splitting multi-byte chars Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> * fix(evernote): use correct RetryOptions property names maxRetries/initialDelayMs instead of retries/backoff to match the RetryOptions interface from lib/knowledge/documents/utils. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> * fix(sync-engine): merge title from getDocument and skip unchanged docs after hydration - Merge title from getDocument during deferred hydration so Gmail documents get the email Subject header instead of the snippet text - After hydration, compare the hydrated contentHash against the stored DB hash — if they match, skip the update. This prevents Obsidian (and any connector with a force-refresh stub hash) from re-uploading and re-processing unchanged documents every sync Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> * fix(sync-engine): dedup externalIds, enable deletion reconciliation, merge sourceUrl Three sync engine gaps identified during audit: 1. Duplicate externalId guard: if a connector returns the same externalId across pages (pagination overlap), skip the second occurrence to prevent unique constraint violations on add and double-uploads on update. 2. Deletion reconciliation: previously required explicit fullSync or syncMode='full', meaning docs deleted from the source accumulated in the KB forever. Now runs on all non-incremental syncs (which return ALL docs). Includes a safety threshold: if >50% of existing docs (and >5 docs) would be deleted, skip and warn — protects against partial listing failures. Explicit fullSync bypasses the threshold. 3. sourceUrl merge: hydration now picks up sourceUrl from getDocument, falling back to the stub's sourceUrl if getDocument doesn't set one. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> * lint * fix(connectors): confluence version metadata fallback and google drive maxFiles guard - Confluence: use `version?.number` directly (undefined) in metadata instead of `?? ''` (empty string) to prevent Number('') = 0 passing NaN check in mapTags. Hash still uses `?? ''` for string interpolation. - Google Drive: add early return when previouslyFetched >= maxFiles to prevent effectivePageSize <= 0 which violates the API's pageSize requirement (1-1000). Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> * fix(connectors): blogpost labels and capped listing deletion reconciliation - Confluence: fetchLabelsForPages now tries both /pages/{id}/labels and /blogposts/{id}/labels, preventing label loss when getDocument hydrates blogpost content (previously returned empty labels on 404). - Sync engine: skip deletion reconciliation when listing was capped (maxFiles/maxThreads). Connectors signal this via syncContext.listingCapped. Prevents incorrect deletion of docs beyond the cap that still exist in source. fullSync override still forces deletion for explicit cleanup. - Google Drive & Gmail: set syncContext.listingCapped = true when cap is hit. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> * fix(connectors): set syncContext.listingCapped in all connectors with caps OneDrive, Dropbox, SharePoint, Confluence (v2 + CQL), and Notion (3 listing functions) now set syncContext.listingCapped = true when their respective maxFiles/maxPages limit is hit. Without this, the sync engine's deletion reconciliation would run against an incomplete listing and incorrectly hard-delete documents that exist in the source but fell outside the cap window. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> * fix(evernote): thread retryOptions through apiListTags and apiListNotebooks All calls to apiListTags and apiListNotebooks in both listDocuments and getDocument now pass retryOptions for consistent retry protection across all Thrift RPC calls. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> --------- Co-authored-by: Claude Opus 4.6 <noreply@anthropic.com>