Document Processing & Chunking

Configure chunking strategy and text processing in Step 3 of the wizard.

Chunking divides your documents into smaller pieces that can be embedded and searched efficiently. The chunking strategy you choose directly impacts retrieval quality.

Chunking Methods

Method	Best For	Chunk Size	Overlap
Fixed-Size	Uniform documents, manuals	256-2048 tokens	10-20%
Recursive	Mixed content, technical docs	256-2048 tokens	10-20%
Semantic	Articles, blog posts	Variable	N/A
Document	Short docs, FAQ entries	Full document	N/A

Fixed-Size Chunking

Splits text into equal-sized chunks with configurable overlap. Simple and predictable.

Config example:

{
  "chunking": {
    "method": "fixed",
    "chunk_size": 512,
    "chunk_overlap": 50,
    "chunk_min_size": 128,
    "chunk_max_size": 768
  }
}

Best for:

Technical manuals with consistent structure
Legal documents
Large PDFs with uniform content

Recursive Chunking

Splits text by separators (paragraphs → sentences → words) to maintain semantic boundaries.

Config example:

{
  "chunking": {
    "method": "recursive",
    "chunk_size": 1024,
    "chunk_overlap": 100,
    "separators": ["\n\n", "\n", ". ", " "]
  }
}

Best for:

Mixed content types
Documents with varying section lengths
General-purpose use (recommended default)

Semantic Chunking

Uses embedding similarity to detect topic boundaries and split where meaning changes.

Best for:

Blog posts and articles
Meeting transcripts
Content with clear topic transitions

Document-Based Chunking

Treats each document (or major section) as a single chunk.

Best for:

FAQ entries
Short product descriptions
Documents under 512 tokens

Processing Options

Text Preprocessing

Option	Description	Default
Remove extra whitespace	Collapse multiple spaces/newlines	Enabled
Normalize Unicode	Convert special characters to standard	Enabled
Strip HTML tags	Remove HTML markup from content	Enabled
OCR for scanned PDFs	Extract text from images in PDFs	Disabled

Enable OCR only for scanned PDFs. OCR adds significant processing time and is unnecessary for text-extractable PDFs.

Content Filtering

Filter	Description
Skip empty chunks	Remove chunks with no meaningful content
Skip boilerplate	Filter out headers, footers, page numbers
Minimum chunk size	Drop chunks below token threshold

Chunk Preview

After configuring chunking, use the Chunk Preview tool to see how your documents will be split:

Select a sample document from your uploads
Click Preview Chunks
Review the generated chunks and adjust settings

The preview shows:

Chunk content and token count
Overlap between adjacent chunks
Total chunk count for the document

Similarity Score Test

Use the Similarity Score Test tool to verify your chunking produces good retrieval:

Enter a test query
View matching chunks with similarity scores
Adjust chunking settings if scores are low

Good scores:

0.85+ — Excellent match
0.70-0.84 — Good match
0.50-0.69 — Acceptable
Below 0.50 — Review chunking or content quality

Low similarity scores usually mean chunks are too large (diluting meaning) or too small (losing context). Try adjusting chunk_size and overlap.

Document Type	Method	Chunk Size	Overlap
Technical Manual	Recursive	512-1024	50-100
Legal Contract	Fixed	256-512	25-50
Blog Post	Semantic	Variable	N/A
FAQ Entry	Document	Full doc	N/A
Research Paper	Recursive	1024-2048	100-200
CSV Data	Fixed	128-256	0

Config via API

Pass chunking config when uploading:

curl -X POST "https://api.guidedmind.ai/rag/upload-and-process" \
  -H "X-API-Key: rk_your_key_here" \
  -F "file=@document.pdf" \
  -F 'config={
    "chunking": {
      "method": "recursive",
      "chunk_size": 512,
      "chunk_overlap": 50
    }
  }'

Next Step

After configuring chunking, move to Pipeline Configuration to select your embedding model and retrieval method.

Use the Chunk Preview to verify settings before processing
Test similarity scores with real queries
Start with Recursive chunking for mixed content
Adjust overlap if chunks feel disconnected

Don't

Use 100% overlap — it wastes storage and slows search
Chunk at exactly token limits — leave room for metadata
Process all documents before testing one
Ignore low similarity scores — they indicate problems

Next →Introduction

Chunking Methods

Fixed-Size Chunking

Recursive Chunking

Semantic Chunking

Document-Based Chunking

Processing Options

Text Preprocessing

Content Filtering

Chunk Preview

Similarity Score Test

Recommended Settings by Document Type

Config via API

Next Step