Files
sim/apps
Waleed 1acafe8763 feat(knowledge): add token, sentence, recursive, and regex chunkers (#4102)
* feat(knowledge): add token, sentence, recursive, and regex chunkers

* fix(chunkers): standardize token estimation and use emcn dropdown

- Refactor all existing chunkers (Text, JsonYaml, StructuredData, Docs) to use shared utils
- Fix inconsistent token estimation (JsonYaml used tiktoken, StructuredData used /3 ratio)
- Fix DocsChunker operator precedence bug and hard-coded 300-token limit
- Fix JsonYamlChunker isStructuredData false positive on plain strings
- Add MAX_DEPTH recursion guard to JsonYamlChunker
- Replace @/components/ui/select with emcn DropdownMenu in strategy selector

* fix(chunkers): address research audit findings

- Expand RecursiveChunker recipes: markdown adds horizontal rules, code
  fences, blockquotes; code adds const/let/var/if/for/while/switch/return
- RecursiveChunker fallback uses splitAtWordBoundaries instead of char slicing
- RegexChunker ReDoS test uses adversarial strings (repeated chars, spaces)
- SentenceChunker abbreviation list adds St/Rev/Gen/No/Fig/Vol/months
  and single-capital-letter lookbehind
- Add overlap < maxSize validation in Zod schema and UI form
- Add pattern max length (500) validation in Zod schema
- Fix StructuredDataChunker footer grammar

* fix(chunkers): fix remaining audit issues across all chunkers

- DocsChunker: extract headers from cleaned content (not raw markdown)
  to fix position mismatch between header positions and chunk positions
- DocsChunker: strip export statements and JSX expressions in cleanContent
- DocsChunker: fix table merge dedup using equality instead of includes
- JsonYamlChunker: preserve path breadcrumbs when nested value fits in
  one chunk, matching LangChain RecursiveJsonSplitter behavior
- StructuredDataChunker: detect 2-column CSV (lowered threshold from >2
  to >=1) and use 20% relative tolerance instead of absolute +/-2
- TokenChunker: use sliding window overlap (matching LangChain/Chonkie)
  where chunks stay within chunkSize instead of exceeding it
- utils: splitAtWordBoundaries accepts optional stepChars for sliding
  window overlap; addOverlap uses newline join instead of space

* chore(chunkers): lint formatting

* updated styling

* fix(chunkers): audit fixes and comprehensive tests

- Fix SentenceChunker regex: lookbehinds now include the period to correctly handle abbreviations (Mr., Dr., etc.), initials (J.K.), and decimals
- Fix RegexChunker ReDoS: reset lastIndex between adversarial test iterations, add poisoned-suffix test strings
- Fix DocsChunker: skip code blocks during table boundary detection to prevent false positives from pipe characters
- Fix JsonYamlChunker: oversized primitive leaf values now fall back to text chunking instead of emitting a single chunk
- Fix TokenChunker: pass 0 to buildChunks for overlap metadata since sliding window handles overlap inherently
- Add defensive guard in splitAtWordBoundaries to prevent infinite loops if step is 0
- Add tests for utils, TokenChunker, SentenceChunker, RecursiveChunker, RegexChunker (236 total tests, 0 failures)
- Fix existing test expectations for updated footer format and isStructuredData behavior

* chore(chunkers): remove unnecessary comments and dead code

Strip 445 lines of redundant TSDoc, math calculation comments,
implementation rationale notes, and assertion-restating comments
across all chunker source and test files.

* fix(chunkers): address PR review comments

- Fix regex fallback path: use sliding window for overlap instead of
  passing chunkOverlap to buildChunks without prepended overlap text
- Fix misleading strategy label: "Text (hierarchical splitting)" →
  "Text (word boundary splitting)"

* fix(chunkers): use consistent overlap pattern in regex fallback

Use addOverlap + buildChunks(chunks, overlap) in the regex fallback
path to match the main path and all other chunkers (TextChunker,
RecursiveChunker). The sliding window approach was inconsistent.

* fix(chunkers): prevent content loss in word boundary splitting

When splitAtWordBoundaries snaps end back to a word boundary, advance
pos from end (not pos + step) in non-overlapping mode. The step-based
advancement is preserved for the sliding window case (TokenChunker).

* fix(chunkers): restore structured data token ratio and overlap joiner

- Restore /3 token estimation for StructuredDataChunker (structured data
  is denser than prose, ~3 chars/token vs ~4)
- Change addOverlap joiner from \n to space to match original TextChunker
  behavior

* lint

* fix(chunkers): fall back to character-level overlap in sentence chunker

When no complete sentence fits within the overlap budget,
fall back to character-level word-boundary overlap from the
previous group's text. This ensures buildChunks metadata is
always correct.

* fix(chunkers): fix log message and add missing month abbreviations

- Fix regex fallback log: "character splitting" → "word-boundary splitting"
- Add Jun and Jul to sentence chunker abbreviation list

* lint

* fix(chunkers): restore structured data detection threshold to > 2

avgCount >= 1 was too permissive — prose with consistent comma usage
would be misclassified as CSV. Restore original > 2 threshold while
keeping the improved proportional tolerance.

* fix(chunkers): pass chunkOverlap to buildChunks in TokenChunker

* fix(chunkers): restore separator-as-joiner pattern in splitRecursively

Separator was unconditionally prepended to parts after the first,
leaving leading punctuation on chunks after a boundary reset.

* feat(knowledge): add JSONL file support for knowledge base uploads

Parses JSON Lines files by splitting on newlines and converting to a
JSON array, which then flows through the existing JsonYamlChunker.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

---------

Co-authored-by: Claude Opus 4.6 <noreply@anthropic.com>
2026-04-10 21:33:29 -07:00
..