Logo IconGuided Mind
v2.4Sign In

Document Processing & Chunking

Configure chunking strategy and text processing in Step 3 of the wizard.

Chunking divides your documents into smaller pieces that can be embedded and searched efficiently. The chunking strategy you choose directly impacts retrieval quality.

Chunking Methods

MethodBest ForChunk SizeOverlap
Fixed-SizeUniform documents, manuals256-2048 tokens10-20%
RecursiveMixed content, technical docs256-2048 tokens10-20%
SemanticArticles, blog postsVariableN/A
DocumentShort docs, FAQ entriesFull documentN/A

Fixed-Size Chunking

Splits text into equal-sized chunks with configurable overlap. Simple and predictable.

Config example:

{
  "chunking": {
    "method": "fixed",
    "chunk_size": 512,
    "chunk_overlap": 50,
    "chunk_min_size": 128,
    "chunk_max_size": 768
  }
}

Best for:

  • Technical manuals with consistent structure
  • Legal documents
  • Large PDFs with uniform content

Recursive Chunking

Splits text by separators (paragraphs → sentences → words) to maintain semantic boundaries.

Config example:

{
  "chunking": {
    "method": "recursive",
    "chunk_size": 1024,
    "chunk_overlap": 100,
    "separators": ["\n\n", "\n", ". ", " "]
  }
}

Best for:

  • Mixed content types
  • Documents with varying section lengths
  • General-purpose use (recommended default)

Semantic Chunking

Uses embedding similarity to detect topic boundaries and split where meaning changes.

Best for:

  • Blog posts and articles
  • Meeting transcripts
  • Content with clear topic transitions

Document-Based Chunking

Treats each document (or major section) as a single chunk.

Best for:

  • FAQ entries
  • Short product descriptions
  • Documents under 512 tokens

Processing Options

Text Preprocessing

OptionDescriptionDefault
Remove extra whitespaceCollapse multiple spaces/newlinesEnabled
Normalize UnicodeConvert special characters to standardEnabled
Strip HTML tagsRemove HTML markup from contentEnabled
OCR for scanned PDFsExtract text from images in PDFsDisabled

Enable OCR only for scanned PDFs. OCR adds significant processing time and is unnecessary for text-extractable PDFs.

Content Filtering

FilterDescription
Skip empty chunksRemove chunks with no meaningful content
Skip boilerplateFilter out headers, footers, page numbers
Minimum chunk sizeDrop chunks below token threshold

Chunk Preview

After configuring chunking, use the Chunk Preview tool to see how your documents will be split:

  1. Select a sample document from your uploads
  2. Click Preview Chunks
  3. Review the generated chunks and adjust settings

The preview shows:

  • Chunk content and token count
  • Overlap between adjacent chunks
  • Total chunk count for the document

Similarity Score Test

Use the Similarity Score Test tool to verify your chunking produces good retrieval:

  1. Enter a test query
  2. View matching chunks with similarity scores
  3. Adjust chunking settings if scores are low

Good scores:

  • 0.85+ — Excellent match
  • 0.70-0.84 — Good match
  • 0.50-0.69 — Acceptable
  • Below 0.50 — Review chunking or content quality

Low similarity scores usually mean chunks are too large (diluting meaning) or too small (losing context). Try adjusting chunk_size and overlap.

Document TypeMethodChunk SizeOverlap
Technical ManualRecursive512-102450-100
Legal ContractFixed256-51225-50
Blog PostSemanticVariableN/A
FAQ EntryDocumentFull docN/A
Research PaperRecursive1024-2048100-200
CSV DataFixed128-2560

Config via API

Pass chunking config when uploading:

curl -X POST "https://api.guidedmind.ai/rag/upload-and-process" \
  -H "X-API-Key: rk_your_key_here" \
  -F "file=@document.pdf" \
  -F 'config={
    "chunking": {
      "method": "recursive",
      "chunk_size": 512,
      "chunk_overlap": 50
    }
  }'

Next Step

After configuring chunking, move to Pipeline Configuration to select your embedding model and retrieval method.

Do
  • Use the Chunk Preview to verify settings before processing
  • Test similarity scores with real queries
  • Start with Recursive chunking for mixed content
  • Adjust overlap if chunks feel disconnected
Don't
  • Use 100% overlap — it wastes storage and slows search
  • Chunk at exactly token limits — leave room for metadata
  • Process all documents before testing one
  • Ignore low similarity scores — they indicate problems