
Configure how your documents are prepared and split before indexing.
This step defines how your raw documents are transformed into searchable chunks. Document processing happens in three phases: preprocessing cleans your text, chunking splits it into manageable pieces, and metadata extraction adds contextual information.
Preprocessing prepares your raw document text for indexing. The goal is to clean and normalize content so the embedding model can create accurate representations.
| Setting | What It Does |
|---|---|
| Collapse Spaces | Reduces multiple consecutive spaces into a single space, cleaning up inconsistent formatting |
| Remove Blank Lines | Removes empty lines between paragraphs to reduce noise |
| Normalize Unicode | Converts unicode characters to their ASCII equivalents (é → e), improving consistency |
| Lowercase | Converts all text to lowercase, which can improve matching for case-insensitive searches |
| Remove Non-ASCII | Strips characters outside the standard ASCII range |
| Spellcheck | Enables spell checking and automatic correction |
| Setting | What It Does |
|---|---|
| Enable OCR | Extracts text from scanned PDF pages using optical character recognition |
| OCR Engine | Selects the OCR engine (Tesseract, EasyOCR, PaddleOCR, TrOCR) — different engines handle different document types better |
| Page Range | Limits processing to specific pages instead of the entire document |
| Remove Headers | Detects and removes repeated page headers that add noise |
| Remove Footers | Detects and removes repeated page footers that add noise |
| Merge Lines to Paragraphs | Combines broken lines into complete paragraphs for better readability |
| Setting | What It Does |
|---|---|
| Normalize Whitespace | Cleans up inconsistent spacing in HTML content |
| Strip Script Tags | Removes <script> tags and their content |
| Strip Style Tags | Removes <style> tags and their content |
| Remove Navigation | Removes navigation bars and sidebars |
| Remove Ads | Removes advertisement content |
| Setting | What It Does |
|---|---|
| Chunk by Headings | Splits chunks at heading boundaries to preserve document structure |
| Max Heading Level | Controls the deepest heading level respected for chunking (e.g., level 3 = H1/H2/H3) |
| Remove Code Blocks | Strips fenced code blocks from the processed text |
| Flatten Bullet Lists | Converts nested bullet lists to flat lists for simpler processing |
| Setting | What It Does |
|---|---|
| Remove Empty Rows | Skips rows that contain no data |
| Trim Numeric Precision | Limits decimal places for numeric values |
| Collapse Duplicate Rows | Removes rows that are exact duplicates |
| Max Row Limit | Sets the maximum number of rows to process |
| Drop Columns | Specifies column indices to exclude from processing |
Chunking determines how your documents are split into smaller pieces for embedding. This is one of the most important settings because it directly impacts retrieval quality.
| Strategy | How It Works | Best For |
|---|---|---|
| Fixed-Size | Splits text at regular token intervals | Consistent, uniform documents |
| Semantic | Uses AI to find natural topic boundaries | Complex documents with clear sections |
| Recursive | Tries larger boundaries first, falls back to smaller | Mixed content types |
| Document-Based | Uses existing document structure | Well-structured documents with headings |
Controls how many tokens each chunk contains. Larger chunks provide more context but may include irrelevant content. Smaller chunks are more precise but may miss connections between related ideas.
Ensures that content near chunk boundaries isn't lost. A 10% overlap means the last 10% of each chunk repeats at the start of the next one, preserving context that would otherwise be split.
Ensures chunks split at natural sentence ends rather than mid-sentence. This is important for technical documents containing code, formulas, or structured data where breaking mid-sentence loses meaning.
Keeps entire paragraphs together in a single chunk, even if it means chunks vary in size. This maintains the logical flow of ideas.
Metadata adds contextual information to each chunk, making search results more informative and enabling filtering.
| Setting | What It Does |
|---|---|
| Extract Metadata | Enables or disables all metadata extraction |
| Include Document Title | Adds the source document title to each chunk's metadata |
| Include Chunk Index | Adds the chunk's position number within the source document |
| Include Timestamps | Adds processing timestamps to track when chunks were created |
Add custom metadata fields to capture domain-specific information that helps with filtering and organization. For example, you could add fields like "department," "document version," or "author."
For technical documents, enable "Respect Sentence Boundaries" to avoid splitting code snippets or formulas mid-sentence.