Files
sim/apps
Waleed 4c474e03c1 fix(connectors): contentDeferred pattern + validation fixes across all connectors (#3793)
* fix(knowledge): enqueue connector docs per-batch to survive sync timeouts

* fix(connectors): convert all connectors to contentDeferred pattern and fix validation issues

All 10 connectors now use contentDeferred: true in listDocuments, returning
lightweight metadata stubs instead of downloading content during listing.
Content is fetched lazily via getDocument only for new/changed documents,
preventing Trigger.dev task timeouts on large syncs.

Connector-specific fixes from validation audit:
- Google Drive: metadata-based contentHash, orderBy for deterministic pagination,
  precise maxFiles, byte-length size check with truncation warning
- OneDrive: metadata-based contentHash, orderBy for deterministic pagination
- SharePoint: metadata-based contentHash, byte-length size check
- Dropbox: metadata-based contentHash using content_hash field
- Notion: code/equation block extraction, empty page fallback to title,
  reduced CHILD_PAGE_CONCURRENCY to 5, syncContext parameter
- Confluence: syncContext caching for cloudId, reduced label concurrency to 5
- Gmail: use joinTagArray for label tags
- Obsidian: syncRunId-based stub hash for forced re-fetch, mtime-based hash
  in getDocument, .trim() on vaultUrl, lightweight validateConfig
- Evernote: retryOptions threaded through apiFindNotesMetadata and apiGetNote
- GitHub: added contentDeferred: false to getDocument, syncContext parameter

Infrastructure:
- sync-engine: added syncRunId to syncContext for Obsidian change detection
- confluence/utils: replaced raw fetch with fetchWithRetry, added retryOptions
- oauth: added supportsRefreshTokenRotation: false for Dropbox
- Updated add-connector and validate-connector skills with contentDeferred docs

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

* fix(connectors): address PR review comments - metadata merge, retryOptions, UTF-8 safety

- Sync engine: merge metadata from getDocument during deferred hydration,
  so Gmail/Obsidian/Confluence tags and metadata survive the stub→full transition
- Evernote: pass retryOptions {retries:3, backoff:500} from listDocuments and
  getDocument callers into apiFindNotesMetadata and apiGetNote
- Google Drive + SharePoint: safe UTF-8 truncation that walks back to the last
  complete character boundary instead of splitting multi-byte chars

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

* fix(evernote): use correct RetryOptions property names

maxRetries/initialDelayMs instead of retries/backoff to match the
RetryOptions interface from lib/knowledge/documents/utils.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

* fix(sync-engine): merge title from getDocument and skip unchanged docs after hydration

- Merge title from getDocument during deferred hydration so Gmail
  documents get the email Subject header instead of the snippet text
- After hydration, compare the hydrated contentHash against the stored
  DB hash — if they match, skip the update. This prevents Obsidian
  (and any connector with a force-refresh stub hash) from re-uploading
  and re-processing unchanged documents every sync

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

* fix(sync-engine): dedup externalIds, enable deletion reconciliation, merge sourceUrl

Three sync engine gaps identified during audit:

1. Duplicate externalId guard: if a connector returns the same externalId
   across pages (pagination overlap), skip the second occurrence to prevent
   unique constraint violations on add and double-uploads on update.

2. Deletion reconciliation: previously required explicit fullSync or
   syncMode='full', meaning docs deleted from the source accumulated in
   the KB forever. Now runs on all non-incremental syncs (which return
   ALL docs). Includes a safety threshold: if >50% of existing docs
   (and >5 docs) would be deleted, skip and warn — protects against
   partial listing failures. Explicit fullSync bypasses the threshold.

3. sourceUrl merge: hydration now picks up sourceUrl from getDocument,
   falling back to the stub's sourceUrl if getDocument doesn't set one.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

* lint

* fix(connectors): confluence version metadata fallback and google drive maxFiles guard

- Confluence: use `version?.number` directly (undefined) in metadata instead
  of `?? ''` (empty string) to prevent Number('') = 0 passing NaN check in
  mapTags. Hash still uses `?? ''` for string interpolation.
- Google Drive: add early return when previouslyFetched >= maxFiles to prevent
  effectivePageSize <= 0 which violates the API's pageSize requirement (1-1000).

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

* fix(connectors): blogpost labels and capped listing deletion reconciliation

- Confluence: fetchLabelsForPages now tries both /pages/{id}/labels and
  /blogposts/{id}/labels, preventing label loss when getDocument hydrates
  blogpost content (previously returned empty labels on 404).
- Sync engine: skip deletion reconciliation when listing was capped
  (maxFiles/maxThreads). Connectors signal this via syncContext.listingCapped.
  Prevents incorrect deletion of docs beyond the cap that still exist in source.
  fullSync override still forces deletion for explicit cleanup.
- Google Drive & Gmail: set syncContext.listingCapped = true when cap is hit.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

* fix(connectors): set syncContext.listingCapped in all connectors with caps

OneDrive, Dropbox, SharePoint, Confluence (v2 + CQL), and Notion (3 listing
functions) now set syncContext.listingCapped = true when their respective
maxFiles/maxPages limit is hit. Without this, the sync engine's deletion
reconciliation would run against an incomplete listing and incorrectly
hard-delete documents that exist in the source but fell outside the cap window.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

* fix(evernote): thread retryOptions through apiListTags and apiListNotebooks

All calls to apiListTags and apiListNotebooks in both listDocuments and
getDocument now pass retryOptions for consistent retry protection across
all Thrift RPC calls.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

---------

Co-authored-by: Claude Opus 4.6 <noreply@anthropic.com>
2026-03-26 19:04:54 -07:00
..