Files
sim/apps/docs/content/docs/en/knowledgebase/index.mdx
Waleed 37443a7b77 improvement(kb): improve chunkers, respect user-specified chunk configurations, added tests (#2539)
* improvement(kb): improve chunkers, respect user-specified chunk configurations, added tests

* ack PR commnets

* updated docs

* cleanup
2025-12-22 20:47:29 -08:00

120 lines
6.2 KiB
Plaintext

---
title: Overview
description: Upload, process, and search through your documents with intelligent vector search and chunking
---
import { Video } from '@/components/ui/video'
import { Image } from '@/components/ui/image'
The knowledgebase allows you to upload, process, and search through your documents with intelligent vector search and chunking. Documents of various types are automatically processed, embedded, and made searchable. Your documents are intelligently chunked, and you can view, edit, and search through them using natural language queries.
## Upload and Processing
Simply upload your documents to get started. Sim automatically processes them in the background, extracting text, creating embeddings, and breaking them into searchable chunks.
<div className="mx-auto w-full overflow-hidden rounded-lg">
<Video src="knowledgebase-1.mp4" width={700} height={450} />
</div>
The system handles the entire processing pipeline for you:
1. **Text Extraction**: Content is extracted from your documents using specialized parsers for each file type
2. **Intelligent Chunking**: Documents are broken into meaningful chunks with configurable size and overlap
3. **Embedding Generation**: Vector embeddings are created for semantic search capabilities
4. **Processing Status**: Track the progress as your documents are processed
## Supported File Types
Sim supports PDF, Word (DOC/DOCX), plain text (TXT), Markdown (MD), HTML, Excel (XLS/XLSX), PowerPoint (PPT/PPTX), and CSV files. Files can be up to 100MB each, with optimal performance for files under 50MB. You can upload multiple documents simultaneously, and PDF files include OCR processing for scanned documents.
## Viewing and Editing Chunks
Once your documents are processed, you can view and edit the individual chunks. This gives you full control over how your content is organized and searched.
<Image src="/static/knowledgebase/knowledgebase.png" alt="Document chunks view showing processed content" width={800} height={500} />
### Chunk Configuration
When creating a knowledge base, you can configure how documents are split into chunks:
| Setting | Unit | Default | Range | Description |
|---------|------|---------|-------|-------------|
| **Max Chunk Size** | tokens | 1,024 | 100-4,000 | Maximum size of each chunk (1 token ≈ 4 characters) |
| **Min Chunk Size** | characters | 1 | 1-2,000 | Minimum chunk size to avoid tiny fragments |
| **Overlap** | characters | 200 | 0-500 | Context overlap between consecutive chunks |
- **Hierarchical splitting**: Respects document structure (sections, paragraphs, sentences)
### Editing Capabilities
- **Edit chunk content**: Modify the text content of individual chunks
- **Adjust chunk boundaries**: Merge or split chunks as needed
- **Add metadata**: Enhance chunks with additional context
- **Bulk operations**: Manage multiple chunks efficiently
## Advanced PDF Processing
For PDF documents, Sim offers enhanced processing capabilities:
### OCR Support
When configured with Azure or [Mistral OCR](https://docs.mistral.ai/ocr/):
- **Scanned document processing**: Extract text from image-based PDFs
- **Mixed content handling**: Process PDFs with both text and images
- **High accuracy**: Advanced AI models ensure accurate text extraction
## Using The Knowledge Block in Workflows
Once your documents are processed, you can use them in your AI workflows through the Knowledge block. This enables Retrieval-Augmented Generation (RAG), allowing your AI agents to access and reason over your document content to provide more accurate, contextual responses.
<Image src="/static/knowledgebase/knowledgebase-2.png" alt="Using Knowledge Block in Workflows" width={800} height={500} />
### Knowledge Block Features
- **Semantic search**: Find relevant content using natural language queries
- **Context integration**: Automatically include relevant chunks in agent prompts
- **Dynamic retrieval**: Search happens in real-time during workflow execution
- **Relevance scoring**: Results ranked by semantic similarity
### Integration Options
- **System prompts**: Provide context to your AI agents
- **Dynamic context**: Search and include relevant information during conversations
- **Multi-document search**: Query across your entire knowledgebase
- **Filtered search**: Combine with tags for precise content retrieval
## Vector Search Technology
Sim uses vector search powered by [pgvector](https://github.com/pgvector/pgvector) to understand the meaning and context of your content:
### Semantic Understanding
- **Contextual search**: Finds relevant content even when exact keywords don't match
- **Concept-based retrieval**: Understands relationships between ideas
- **Multi-language support**: Works across different languages
- **Synonym recognition**: Finds related terms and concepts
### Search Capabilities
- **Natural language queries**: Ask questions in plain English
- **Similarity search**: Find conceptually similar content
- **Hybrid search**: Combines vector and traditional keyword search
- **Configurable results**: Control the number and relevance threshold of results
## Document Management
### Organization Features
- **Bulk upload**: Upload multiple files at once via the asynchronous API
- **Processing status**: Real-time updates on document processing
- **Search and filter**: Find documents quickly in large collections
- **Metadata tracking**: Automatic capture of file information and processing details
### Security and Privacy
- **Secure storage**: Documents stored with enterprise-grade security
- **Access control**: Workspace-based permissions
- **Processing isolation**: Each workspace has isolated document processing
- **Data retention**: Configure document retention policies
## Getting Started
1. **Navigate to your knowledgebase**: Access from your workspace sidebar
2. **Upload documents**: Drag and drop or select files to upload
3. **Monitor processing**: Watch as documents are processed and chunked
4. **Explore chunks**: View and edit the processed content
5. **Add to workflows**: Use the Knowledge block to integrate with your AI agents
The knowledgebase transforms your static documents into an intelligent, searchable resource that your AI workflows can leverage for more informed and contextual responses.