Logo IconGuided Mind
v2.4Sign In
RAG Wizard

Step 3 — Document Processing

Configure how your documents are prepared and split before indexing.

This step defines how your raw documents are transformed into searchable chunks. Document processing happens in three phases: preprocessing cleans your text, chunking splits it into manageable pieces, and metadata extraction adds contextual information.

Preprocessing

Preprocessing prepares your raw document text for indexing. The goal is to clean and normalize content so the embedding model can create accurate representations.

General Text Cleaning

SettingWhat It Does
Collapse SpacesReduces multiple consecutive spaces into a single space, cleaning up inconsistent formatting
Remove Blank LinesRemoves empty lines between paragraphs to reduce noise
Normalize UnicodeConverts unicode characters to their ASCII equivalents (é → e), improving consistency
LowercaseConverts all text to lowercase, which can improve matching for case-insensitive searches
Remove Non-ASCIIStrips characters outside the standard ASCII range
SpellcheckEnables spell checking and automatic correction

PDF Processing

SettingWhat It Does
Enable OCRExtracts text from scanned PDF pages using optical character recognition
OCR EngineSelects the OCR engine (Tesseract, EasyOCR, PaddleOCR, TrOCR) — different engines handle different document types better
Page RangeLimits processing to specific pages instead of the entire document
Remove HeadersDetects and removes repeated page headers that add noise
Remove FootersDetects and removes repeated page footers that add noise
Merge Lines to ParagraphsCombines broken lines into complete paragraphs for better readability

HTML Processing

SettingWhat It Does
Normalize WhitespaceCleans up inconsistent spacing in HTML content
Strip Script TagsRemoves <script> tags and their content
Strip Style TagsRemoves <style> tags and their content
Remove NavigationRemoves navigation bars and sidebars
Remove AdsRemoves advertisement content

Markdown Processing

SettingWhat It Does
Chunk by HeadingsSplits chunks at heading boundaries to preserve document structure
Max Heading LevelControls the deepest heading level respected for chunking (e.g., level 3 = H1/H2/H3)
Remove Code BlocksStrips fenced code blocks from the processed text
Flatten Bullet ListsConverts nested bullet lists to flat lists for simpler processing

CSV Processing

SettingWhat It Does
Remove Empty RowsSkips rows that contain no data
Trim Numeric PrecisionLimits decimal places for numeric values
Collapse Duplicate RowsRemoves rows that are exact duplicates
Max Row LimitSets the maximum number of rows to process
Drop ColumnsSpecifies column indices to exclude from processing

Chunking

Chunking determines how your documents are split into smaller pieces for embedding. This is one of the most important settings because it directly impacts retrieval quality.

Chunking Strategies

StrategyHow It WorksBest For
Fixed-SizeSplits text at regular token intervalsConsistent, uniform documents
SemanticUses AI to find natural topic boundariesComplex documents with clear sections
RecursiveTries larger boundaries first, falls back to smallerMixed content types
Document-BasedUses existing document structureWell-structured documents with headings

Chunk Size

Controls how many tokens each chunk contains. Larger chunks provide more context but may include irrelevant content. Smaller chunks are more precise but may miss connections between related ideas.

Overlap

Ensures that content near chunk boundaries isn't lost. A 10% overlap means the last 10% of each chunk repeats at the start of the next one, preserving context that would otherwise be split.

Respect Sentence Boundaries

Ensures chunks split at natural sentence ends rather than mid-sentence. This is important for technical documents containing code, formulas, or structured data where breaking mid-sentence loses meaning.

Preserve Paragraphs

Keeps entire paragraphs together in a single chunk, even if it means chunks vary in size. This maintains the logical flow of ideas.

Metadata Extraction

Metadata adds contextual information to each chunk, making search results more informative and enabling filtering.

Default Metadata

SettingWhat It Does
Extract MetadataEnables or disables all metadata extraction
Include Document TitleAdds the source document title to each chunk's metadata
Include Chunk IndexAdds the chunk's position number within the source document
Include TimestampsAdds processing timestamps to track when chunks were created

Custom Metadata Fields

Add custom metadata fields to capture domain-specific information that helps with filtering and organization. For example, you could add fields like "department," "document version," or "author."

For technical documents, enable "Respect Sentence Boundaries" to avoid splitting code snippets or formulas mid-sentence.