RAG Wizard

Step 3 — Document Processing

Configure how your documents are prepared and split before indexing.

This step defines how your raw documents are transformed into searchable chunks. Document processing happens in three phases: preprocessing cleans your text, chunking splits it into manageable pieces, and metadata extraction adds contextual information.

Preprocessing

Preprocessing prepares your raw document text for indexing. The goal is to clean and normalize content so the embedding model can create accurate representations.

General Text Cleaning

Setting	What It Does
Collapse Spaces	Reduces multiple consecutive spaces into a single space, cleaning up inconsistent formatting
Remove Blank Lines	Removes empty lines between paragraphs to reduce noise
Normalize Unicode	Converts unicode characters to their ASCII equivalents (é → e), improving consistency
Lowercase	Converts all text to lowercase, which can improve matching for case-insensitive searches
Remove Non-ASCII	Strips characters outside the standard ASCII range
Spellcheck	Enables spell checking and automatic correction

PDF Processing

Setting	What It Does
Enable OCR	Extracts text from scanned PDF pages using optical character recognition
OCR Engine	Selects the OCR engine (Tesseract, EasyOCR, PaddleOCR, TrOCR) — different engines handle different document types better
Page Range	Limits processing to specific pages instead of the entire document
Remove Headers	Detects and removes repeated page headers that add noise
Remove Footers	Detects and removes repeated page footers that add noise
Merge Lines to Paragraphs	Combines broken lines into complete paragraphs for better readability

HTML Processing

Setting	What It Does
Normalize Whitespace	Cleans up inconsistent spacing in HTML content
Strip Script Tags	Removes `<script>` tags and their content
Strip Style Tags	Removes `<style>` tags and their content
Remove Navigation	Removes navigation bars and sidebars
Remove Ads	Removes advertisement content

Markdown Processing

Setting	What It Does
Chunk by Headings	Splits chunks at heading boundaries to preserve document structure
Max Heading Level	Controls the deepest heading level respected for chunking (e.g., level 3 = H1/H2/H3)
Remove Code Blocks	Strips fenced code blocks from the processed text
Flatten Bullet Lists	Converts nested bullet lists to flat lists for simpler processing

CSV Processing

Setting	What It Does
Remove Empty Rows	Skips rows that contain no data
Trim Numeric Precision	Limits decimal places for numeric values
Collapse Duplicate Rows	Removes rows that are exact duplicates
Max Row Limit	Sets the maximum number of rows to process
Drop Columns	Specifies column indices to exclude from processing

Chunking

Chunking determines how your documents are split into smaller pieces for embedding. This is one of the most important settings because it directly impacts retrieval quality.

Chunking Strategies

Strategy	How It Works	Best For
Fixed-Size	Splits text at regular token intervals	Consistent, uniform documents
Semantic	Uses AI to find natural topic boundaries	Complex documents with clear sections
Recursive	Tries larger boundaries first, falls back to smaller	Mixed content types
Document-Based	Uses existing document structure	Well-structured documents with headings

Chunk Size

Controls how many tokens each chunk contains. Larger chunks provide more context but may include irrelevant content. Smaller chunks are more precise but may miss connections between related ideas.

Overlap

Ensures that content near chunk boundaries isn't lost. A 10% overlap means the last 10% of each chunk repeats at the start of the next one, preserving context that would otherwise be split.

Respect Sentence Boundaries

Ensures chunks split at natural sentence ends rather than mid-sentence. This is important for technical documents containing code, formulas, or structured data where breaking mid-sentence loses meaning.

Preserve Paragraphs

Keeps entire paragraphs together in a single chunk, even if it means chunks vary in size. This maintains the logical flow of ideas.

Metadata Extraction

Metadata adds contextual information to each chunk, making search results more informative and enabling filtering.

Default Metadata

Setting	What It Does
Extract Metadata	Enables or disables all metadata extraction
Include Document Title	Adds the source document title to each chunk's metadata
Include Chunk Index	Adds the chunk's position number within the source document
Include Timestamps	Adds processing timestamps to track when chunks were created

Custom Metadata Fields

Add custom metadata fields to capture domain-specific information that helps with filtering and organization. For example, you could add fields like "department," "document version," or "author."

For technical documents, enable "Respect Sentence Boundaries" to avoid splitting code snippets or formulas mid-sentence.

← PreviousKnowledge Sources Next →Graph Editor